vSAN will declare the disks as permanently unavailable , dead or unmounted for multiple reasons . You will have further investigate on the logs to find out the case for such unmounts . Here is a small guide which should enable you to isolating the problem .
It is always good to start looking into the vmkernel.log and vobd.log for the drive (naa.ID) to see the reason for the Disk unmount or DG umounts . The vCenter ⇒ Cluster ⇒ Configure (Manage in 6.2) ⇒ Disk Management View should indicate the Disk or DG which is unhealthy failed or unmounted .
One of a case study from Frequent Disk and DG unmount in an ALL-Flash Deduplication and Compression enabled cluster
Here is a case study for one of the clusters which kept unmounting DGs randomly . I will start with the log evidence for the cause of the Disk group unmount and learnt over weeks period that the cluster was using unsupported SAS-expanders with about 28 disks (including cache tier and capacity tier ) , however the controller in use can only handle 14 drives . Per : https://blogs.vmware.com/virtualblocks/2016/05/17/vmware-virtual-san-sas-expanders/ , SAS-Expanders are only supported with vSAN ready nodes which are pre-tested and validated for Saturation issues on SAS BUS , here in this case we used to see a pattern where the drives on random host part of the cluster started to report read failures and later the drive gets unmounted by vSAN after which the entire DG gets unmount because this is an All-Flash Dedup and compression enabled cluster and a huge resync is triggered . We assume that this is because the drives in the bus didnt get enough bandwidth to acknowledge the IOPS as other drives/DG on the host might have saturated the SAS-Bus .
Check vmkernel.log under /var/run/log : Look for Permanent Errors : Disk Event permanent error for MD 52827e2f-7a2c-8931-e978-1e57a2ff9078 (naa.51402ec0104cc514:2) 2018-02-22T22:12:56.479Z cpu10:28470336)WARNING: LSOM: LSOMEventNotify:6861: Virtual SAN device 52827e2f-7a2c-8931-e978-1e57a2ff9078 is under permanent error. 2018-02-22T22:12:56.479Z cpu10:28470336)LSOM: LSOMLogDiskEvent:5602: Disk Event permanent error propagated for MD 52726770-6b21-aa35-3227-ba18bec45c89 (naa.51402ec0104cc517:2) 2018-02-22T22:12:56.479Z cpu10:28470336)WARNING: LSOM: LSOMEventNotify:6872: Virtual SAN device 52726770-6b21-aa35-3227-ba18bec45c89 is under propagated permanent error. 2018-02-22T22:12:56.479Z cpu10:28470336)LSOM: LSOMLogDiskEvent:5602: Disk Event permanent error propagated for MD 528605c7-70cd-b6df-78bc-1c6ea90f57d2 (naa.51402ec0104cc507:2) 2018-02-22T22:12:56.479Z cpu10:28470336)WARNING: LSOM: LSOMEventNotify:6872: Virtual SAN device 528605c7-70cd-b6df-78bc-1c6ea90f57d2 is under propagated permanent error. 2018-02-22T22:12:56.479Z cpu10:28470336)LSOM: LSOMLogDiskEvent:5602: Disk Event permanent error propagated for SSD 5256b4fa-5f83-5b4a-8e9c-ae49c87fa660 (naa.58ce38ee200a70d9:2) 2018-02-22T22:12:56.479Z cpu10:28470336)WARNING: LSOM: LSOMEventNotify:6872: Virtual SAN device 5256b4fa-5f83-5b4a-8e9c-ae49c87fa660 is under propagated permanent error. 2018-02-22T22:17:17.897Z cpu12:66491)WARNING: PLOG: PLOGPropagateErrorInt:2809: Ignored permanent error event on stashed 52827e2f-7a2c-8931-e978-1e57a2ff9078 state=0xc09 You may also see Power on Reset Errors : 2018-02-22T22:28:02.296Z cpu37:66492)ScsiCore: 1705: Power-on Reset occurred on naa.51402ec0104cc514 2018-02-22T22:28:10.299Z cpu22:68625024)ScsiCore: 1705: Power-on Reset occurred on vmhba0:C2:T20:L0 2018-02-22T22:28:18.307Z cpu23:66492)ScsiCore: 1705: Power-on Reset occurred on naa.51402ec0104cc514 2018-02-22T22:28:26.310Z cpu24:68625024)ScsiCore: 1705: Power-on Reset occurred on vmhba0:C2:T20:L0 2018-02-22T22:28:34.318Z cpu30:66492)ScsiCore: 1705: Power-on Reset occurred on naa.51402ec0104cc514 2018-02-22T22:29:05.337Z cpu24:65910)ScsiCore: 1705: Power-on Reset occurred on vmhba0:C2:T20:L0 2018-02-22T22:29:28.355Z cpu8:66491)ScsiCore: 1705: Power-on Reset occurred on naa.51402ec0104cc514 Read (0x28) or Write (0x2a) failure (SCSI errors) : 2018-02-22T22:11:53.634Z cpu13:66491)ScsiDeviceIO: 2927: Cmd(0x439a1e94ba80) 0x28, CmdSN 0x4ae975ed from world 0 to dev "naa.51402ec0104cc514" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x4 0x43 0x0. 2018-02-22T22:11:53.634Z cpu13:66491)NMP: nmp_ThrottleLogForDevice:3617: Cmd 0x28 (0x439a1e617180, 0) to dev "naa.51402ec0104cc514" on path "vmhba0:C2:T20:L0" Failed: H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL 2018-02-22T22:12:01.640Z cpu38:66492)ScsiDeviceIO: 2927: Cmd(0x43a20b6ead00) 0x28, CmdSN 0x4ae975f6 from world 0 to dev "naa.51402ec0104cc514" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x4 0x43 0x0. 2018-02-22T22:12:01.640Z cpu38:66492)ScsiDeviceIO: 2927: Cmd(0x439d0083e640) 0x28, CmdSN 0x4ae975f5 from world 0 to dev "naa.51402ec0104cc514" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x4 0x43 0x0. 2018-02-22T22:12:01.640Z cpu38:66492)NMP: nmp_ThrottleLogForDevice:3617: Cmd 0x28 (0x43a392885e40, 0) to dev "naa.51402ec0104cc514" on path "vmhba0:C2:T20:L0" Failed: H:0x5 D:0x0 P:0x0 Invalid sense data: 0x0 0x0 0x0. Act:EVAL 2018-02-22T22:12:09.646Z cpu35:66492)ScsiDeviceIO: 2927: Cmd(0x43a384c7e480) 0x28, CmdSN 0x4ae975fd from world 0 to dev "naa.51402ec0104cc514" failed H:0x5 D:0x0 P:0x0 Invalid sense data: 0x80 0x41 0x0. 2018-02-22T22:12:09.646Z cpu35:66492)NMP: nmp_ThrottleLogForDevice:3617: Cmd 0x28 (0x439d0083e640, 0) to dev "naa.51402ec0104cc514" on path "vmhba0:C2:T20:L0" Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0. Act:NONE 2018-02-22T22:12:09.646Z cpu35:66492)ScsiDeviceIO: 2927: Cmd(0x439d0083e640) 0x28, CmdSN 0x4ae975f5 from world 0 to dev "naa.51402ec0104cc514" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x6 0x29 0x0.
In the Above example : We see that the the Disk “naa.51402ec0104cc514” started to report power on reset events , these events are usually seen whenever the disk stops responding to all queries / SCSI commands from the hosts , the hosts try to reset the Luns (naa.ID) to see if the disk can come back to a responding state , after retrying multiple times , vSAN marks the drive as PDL (permanent device loss) this may or may not be a genuine drive fault because of media errors or a full drive failure further investigation will also need to be done from H/W perspective . Here in this example it is very clear that the drive “naa.51402ec0104cc514” started to report read failures 0x28 continuously and host retried to reset the Lun to get it in accessible state , however later gives and marks it as permanently lost .
Note* : If this an ALL-Flash vSAN with dedup and compression enabled cluster you are expected to see the entire diskgroup being un-mounted which is an expected behavior for a capacity tier SSD drive failure . In a non-dedup environment we expect to see only a failure on cache tier drive which causes DG unmount and any capcity teir drive failure should continue to function fine without the faulty disk .
Here is a list of possible reason which can potentially mark Disks as failed
- The disk stopped responding to VSAN IO requests in a timely manner. When the reason for the disk failure declaration is due to IO timeout, the vmkernel log will report “maximum kernel-level retries exceeded” in association with the permanent error message. For example:
LSOMCommon: IORETRYParentIODoneCB:1043: Throttled: split status Maximum kernel-level retries exceeded - The disk encountered a genuine fault such as a medium error. In these cases, the disk will be marked as faulted, even though the disk may still appear in the system. When encountering a medium error, Virtual SAN will mark the disk as being in a permanent error state. When a medium error is encountered, the problem will manifest in two ways:
- The vmkernel log will reflect that the permanent-error event as due to “I/O Error”
WARNING: LSOMCommon: IORETRYParentIODoneCB:1466: Throttled: split status I/O error - The vmkernel log will report that at least one I/O operation to the storage device failed due to a medium error. This is reflected by sense-key 0x3:
NMP: nmp_ThrottleLogForDevice:3178: Cmd 0x2a (0x439fc298e0c0, 0) to dev “naa.5000cca072ac3b58” on path “vmhba2:C0:T14:L0” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x3 0x15 0x0. Act:NONE
- The vmkernel log will reflect that the permanent-error event as due to “I/O Error”
- For more information on decoding SCSI error messages : https://en.wikipedia.org/wiki/SCSI_command
- The disk was removed from the storage path. When this is encountered, the vmkernel log will report a PDL (permanent device loss) or APD (all paths down) condition associated with a device.
The most common scenario is a disk going into PDL, and Virtual SAN will interpret this as a permanent condition and will mark the disk as permanently unavailable as IO will fail due to “not supported”
WARNING: NMP: nmp_PathDetermineFailure:2961: Cmd (0x2a) PDL error (0x5/0x25/0x0) – path vmhba2:C2:T2:L0 device naa.600605b0099250d01da7c6a019312a53 – triggering path evaluation
NMP: nmp_ThrottleLogForDevice:3286: Cmd 0x2a (0x439ee894cc00, 0) to dev “naa.600605b0099250d01da7c6a019312a53” on path “vmhba2:C2:T2:L0” Failed: H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x25 0x0. Act:EVAL
LSOMCommon: IORETRYCompleteIO:495: Throttled: 0x439ee7ea0c00 IO type 304 (WRITE) isOdered:NO since 20392 msec status Not supported
WARNING: LSOM: LSOMEventNotify:6126: Virtual SAN device 52a89ef4-6c3f-d16b-5347-c5354470b465 is under permanent error.- When the failure is due to APD instead of PDL (a comparatively rare scenario), the failure will be due to “status Not found“
- For more information on APD and PDL behavior, please see : Permanent device loss and All-Paths-Down handling in vsphere 5.x and 6.x
- The overall disk performance level deteriorated to the point that Virtual SAN marked the disks as offline to prevent system performance degradation. When this occurs, Virtual SAN will unmount the affected disks or disk groups. The vmkernel log will report that VSAN Device Monitor took the unmount action:
VSAN Device Monitor: Unmounting VSAN diskgroup eui.2114aa100d00001- Please see Dying Disk handling in vSAN KB : vSAN 6.1/5.5 Update 3 Disk Groups show as Unmounted in the vSphere Web Client (DDH) (2132079)