On one of the TEST severs there are several databases hosted on one single mount-point configured on single ZFS data pool. We received multiple requests from users that the databases running on these servers are not accessible. When we Investigate all Instances running on this servers was not responding and after some time it terminated.
In the database alert log-file the following error messages was reported:
WARNING: aiowait timed out 1 times Thu May 21 14:09:38 2015 WARNING: aiowait timed out 1 times Thu May 21 14:09:38 2015opiodr aborting process unknown
root@dbnode1:~# zpool status data_pool pool: data_pool state: SUSPENDED status: One or more devices are unavailable in response to IO failures. The pool is suspended. action: Make sure the affected devices are connected, then run 'zpool clear' or'fmadm repaired'. Run 'zpool status -v' to see device specific details. see: http://support.oracle.com/msg/ZFS-8000-HC scan: none requested config: NAME STATE READ WRITE CKSUM data_pool SUSPENDED 0 1 0 c1d2s0 ONLINE 0 0 0
To fix this Issue we need to clear the dataset and run repair command for this specific data pool
root@dbnode1:~# zpool clear data_pool
– Check the faulty data-pool
root@dbnode1:~# fmadm faulty --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- May 20 22:02:50 202f0e1f-3cd8-4ce6-a495-8e22a94796ee ZFS-8000-8A Critical Problem Status : solved Diag Engine : zfs-diagnosis / 1.0 System Manufacturer : unknown Name : SPARC-T5-2 Part_Number : unknown Serial_Number : unknown Host_ID : 84fbc1b9 ---------------------------------------- Suspect 1 of 1 : Fault class : fault.fs.zfs.object.corrupt_data Certainty : 100% Affects : zfs://pool=b07605b211b6c1f3/pool_name=data_pool Status : faulted but still in service FRU Name : "zfs://pool=b07605b211b6c1f3/pool_name=data_pool" Status : faulty Description : A file or directory in pool 'data_pool' could not be read due to corrupt data. Response : No automated response will occur. Impact : The file or directory is unavailable. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Run 'zpool status -xv' and examine the list of damaged files to determine what has been affected. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-8A for the latest service procedures and policies regarding this diagnosis. --------------- ------------------------------------ -------------- --------- TIME EVENT-ID MSG-ID SEVERITY --------------- ------------------------------------ -------------- --------- May 20 22:02:50 004d413b-9765-4e5f-8763-dd29281b0cad ZFS-8000-HC Major Problem Status : solved Diag Engine : zfs-diagnosis / 1.0 System Manufacturer : unknown Name : SPARC-T5-2 Part_Number : unknown Serial_Number : unknown Host_ID : 84fbc1b9 ---------------------------------------- Suspect 1 of 1 : Fault class : fault.fs.zfs.io_failure_wait Certainty : 100% Affects : zfs://pool=b07605b211b6c1f3/pool_name=data_pool Status : faulted but still in service FRU Name : "zfs://pool=b07605b211b6c1f3/pool_name=data_pool" Status : faulty Description : ZFS pool 'data_pool' has experienced currently unrecoverable I/O failures. Response : No automated response will occur. Impact : Read and write I/Os cannot be serviced. Action : Use 'fmadm faulty' to provide a more detailed view of this event. Make sure the affected devices are connected, then run 'zpool clear'. Please refer to the associated reference document at http://support.oracle.com/msg/ZFS-8000-HC for the latest service procedures and policies regarding this diagnosis.
– Run the repair command:
root@dbnode1:~# fmadm repaired zfs://pool=b07605b211b6c1f3/pool_name=data_pool fmadm: recorded repair to of zfs://pool=b07605b211b6c1f3/pool_name=data_pool root@dbnode1:~#
– Check the status of data pool:
root@dbnode1:~# zpool status data_pool pool: data_pool state: ONLINE config: NAME STATE READ WRITE CKSUM data_pool ONLINE 0 0 0 c1d2s0 ONLINE 0 0 0
Conclusion:
Now the data pool status changed from “SUSPENDED” to “ONLINE”. It is recommended to restart all Instances that were running on that mount-point and also check for the database clock corruption after all DB Instances started normally.
Thanks for reading
regards,
X A H E E R