VxRAIL Hosts going Non-responsive due to a plugin conflict
“VxRAIL Hosts going Non-responsive” symptoms are seen because of many reason , here we are going to discuss about one such issues seen which is very common lately .
The hosts Generally go non-responsive is because the hostd service on the ESXi hosts is non-responsive . The fastest way to fiund is through the iDRAC / KVM console to the hosts directly and press ALT+F12(Function Key 12) or F11 (Function Key 12)on your keyboard to see live vmkernel logging and you will see messages relating to the hostd service going non-responsive , further if SSH is available to the hosts we should be able to check the logs live .
Lately we have been noticing that a VxRAIL environment running DELL hardware are running into this issue because of a plugin conflict which locks up the esx.conf file (ESXI config file) and doesn’t release this lock and eventually causing the hostd service going non-responsive . The conflict is between Dell-PTA AGENT and lsu-lsi-lsi-msgpt3-plugin .
This known issue with lsu-lsi-lsi-msgpt3-plugin and Dell-PTA AGENT vibs installed on the VXRAIL Boxes causes a lock on the esx.conf file intern causing the hosts to go non-responsive . The root cause is clearly documented on EMC KB : https://emcservice.force.com/CustomersPartners/kA2f1000000FvH4CAK .
Cause : “esxcfg-mpath” related commands call lsu plugin init function, which calls StoreLib looking for LSI controllers. StoreLib sends a lot of
IOCTL command to driver. In an escalated case, one command hang at HW side for 129 seconds, it holds the lock for /etc/vmware/esx.conf, the next command waiting to request the esx.conf.LOCK eventually leads to ESXi host hang and non-responsive. Removing the lsu-lsi-lsi-msgpt3-plugin could mitigate the issue.
It is recommend to un-install the “lsu-lsi-lsi-msgpt3-plugin” proactively even though if you have not hit this issue and if both these plugin are present . If you dont have access to this link , find the steps below
Conflicting Drivers found installed :
esxcli software vib list | grep -i “lsu-lsi-lsi-msgpt3” && esxcli software vib list | grep -i “dellptagent “
lsu-lsi-lsi-msgpt3-plugin 1.0.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-07-05
dellptagent 1.1-0.5885 Dell PartnerSupported 2017-07-05
Log Evidence for the issue :
vmkernel.log:2017-12-01T06:49:07.114Z cpu24:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff528'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 35442 [hostd-worker]): Busy vmkernel.log:2017-12-01T06:52:12.662Z cpu18:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff264'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 38219 [hostd-worker]): Busy vmkernel.log:2017-12-01T06:54:22.738Z cpu7:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff264'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 35927 [hostd-worker]): Busy vmkernel.1:2017-12-01T03:17:29.546Z cpu40:194056575)ALERT: hostd detected to be non-responsive vmkernel.1:2017-12-01T03:18:33.229Z cpu38:200259473)ALERT: hostd detected to be non-responsive hostd.log:2017-11-30T21:24:54.179Z warning hostd[A42C1B70] [Originator@6876 sub=Hostsvc.NetworkProvider opID=1313cfa6-43ec user=vpxuser] Error getting dvs fa 3f 02 50 f3 c3 f2 a7-bd 52 85 de 45 c5 3b d5 : Error interacting with configuration file /etc/vmware/esx.conf: Timout while waiting for lock, /etc/vmware/esx.conf.LOCK, to be released. Another process has kept this file locked for more than 30 seconds. The process currently holding the lock is esxcfg-mpath(198411757). This is likely a temporary condition. Please try your operation again. hostd.log:--> value = "Error interacting with configuration file /etc/vmware/esx.conf: Timout while waiting for lock, /etc/vmware/esx.conf.LOCK, to be released. Another process has kept this file locked for more than 30 seconds. The process currently holding the lock is esxcfg-mpath(198411757). This is likely a temporary condition. Please try your operation again."
Step1 : Place one host in maintenance-mode with ensure accessibility
Please also see : http://virtuallysensei.com/steps-for-performing-vsan-host-maintenance/ which explains few basic checks before placing hosts in maintenance Mode .
Step2 : Remove the problematic vib
esxcli software vib remove –no-live-install -n lsu-lsi-lsi-msgpt3-plugin
Removal Result Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective. Reboot Required: true VIBs Installed: VIBs Removed: VMware_bootbank_lsu-lsi-lsi-msgpt3-plugin_1.0.0-1vmw.600.0.0.2494585
Step3 : Reboot the host
Post rebooting the host wait for host to come back in maintenanceMode wait for any resync , you may use the webclient on vcenter
Select : Cluster ⇒ Monitor ⇒ vsan ⇒ Resyncing Components or use RVC command : ” vsan.resync_dashboard .” (see how-to-login to RVC) under cluster directory . If there is an ongoing resync , DONOT attempt maintenance on any of the other hosts , wait for the resync to complete . If there were no resync operations seen , proceed to perform same steps with all other hosts by placing them in maintenance remove the vib and rebooting them one at time .