VxRAIL Hosts going Non-responsive due to a plugin conflict

“VxRAIL Hosts going Non-responsive” symptoms are seen because of many reason , here we are going to discuss about one such issues seen which is very common lately .

The hosts Generally go non-responsive is because the hostd service on the ESXi hosts is non-responsive . The fastest way to fiund is through the iDRAC / KVM console to the hosts directly and press ALT+F12(Function Key 12) or F11 (Function Key 12)on your keyboard to see live vmkernel logging and you will see messages relating to the hostd service going non-responsive , further if SSH is available to the hosts we should be able to check the logs live .

Lately we have been noticing that a VxRAIL environment running DELL hardware are running into this issue because of a plugin conflict which locks up the esx.conf file (ESXI config file) and doesn’t release this lock and eventually causing the hostd service going non-responsive . The conflict is between Dell-PTA AGENT  and lsu-lsi-lsi-msgpt3-plugin .

This known issue with lsu-lsi-lsi-msgpt3-plugin and Dell-PTA AGENT vibs installed on the VXRAIL Boxes causes a lock on the esx.conf file intern causing the hosts to go non-responsive . The root cause is clearly documented on EMC KB : https://emcservice.force.com/CustomersPartners/kA2f1000000FvH4CAK    .

Cause : “esxcfg-mpath” related commands call lsu plugin init function, which calls StoreLib looking for LSI controllers. StoreLib sends a lot of
IOCTL command to driver. In an escalated case, one command hang at HW side for 129 seconds, it holds the lock for /etc/vmware/esx.conf, the next command waiting to request the esx.conf.LOCK eventually leads to ESXi host hang and non-responsive. Removing the lsu-lsi-lsi-msgpt3-plugin could mitigate the issue.

 

It is recommend to un-install the “lsu-lsi-lsi-msgpt3-plugin” proactively even though if you have not hit this issue and if both these plugin are present . If you dont have access to this link , find the steps below

Conflicting Drivers found installed :

esxcli software vib list | grep -i  “lsu-lsi-lsi-msgpt3” && esxcli software vib list | grep -i  “dellptagent

lsu-lsi-lsi-msgpt3-plugin 1.0.0-1vmw.600.0.0.2494585 VMware VMwareCertified 2017-07-05

dellptagent 1.1-0.5885 Dell PartnerSupported 2017-07-05

Log Evidence for the issue :

vmkernel.log:2017-12-01T06:49:07.114Z cpu24:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff528'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 35442 [hostd-worker]): Busy
vmkernel.log:2017-12-01T06:52:12.662Z cpu18:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff264'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 38219 [hostd-worker]): Busy
vmkernel.log:2017-12-01T06:54:22.738Z cpu7:49140)FSS: 6264: Failed to open file 'naa.5000cca0805ff264'; Requested flags 0x5, world: 49140 [DellPTAgent], (Existing flags 0x4005, world: 35927 [hostd-worker]): Busy

vmkernel.1:2017-12-01T03:17:29.546Z cpu40:194056575)ALERT: hostd detected to be non-responsive
vmkernel.1:2017-12-01T03:18:33.229Z cpu38:200259473)ALERT: hostd detected to be non-responsive


hostd.log:2017-11-30T21:24:54.179Z warning hostd[A42C1B70] [Originator@6876 sub=Hostsvc.NetworkProvider opID=1313cfa6-43ec user=vpxuser] Error getting dvs fa 3f 02 50 f3 c3 f2 a7-bd 52 85 de 45 c5 3b d5 : Error interacting with configuration file /etc/vmware/esx.conf: Timout while waiting for lock, /etc/vmware/esx.conf.LOCK, to be released. Another process has kept this file locked for more than 30 seconds. The process currently holding the lock is esxcfg-mpath(198411757). This is likely a temporary condition. Please try your operation again.
hostd.log:--> value = "Error interacting with configuration file /etc/vmware/esx.conf: Timout while waiting for lock, /etc/vmware/esx.conf.LOCK, to be released. Another process has kept this file locked for more than 30 seconds. The process currently holding the lock is esxcfg-mpath(198411757). This is likely a temporary condition. Please try your operation again."

 

Step1 : Place one host in maintenance-mode with ensure accessibility

Please also see : https://virtuallysensei.com/steps-for-performing-vsan-host-maintenance/ which explains few basic checks before placing hosts in maintenance Mode .

Step2 : Remove the problematic vib

 esxcli software vib remove –no-live-install -n lsu-lsi-lsi-msgpt3-plugin

Removal Result
Message: The update completed successfully, but the system needs to be rebooted for the
changes to be effective.
Reboot Required: true
VIBs Installed:
VIBs Removed: VMware_bootbank_lsu-lsi-lsi-msgpt3-plugin_1.0.0-1vmw.600.0.0.2494585

Step3 : Reboot the host

Post rebooting the host wait for host to come back in maintenanceMode wait for any resync , you may use the webclient on vcenter

Select : Cluster ⇒  Monitor  ⇒  vsan ⇒   Resyncing Components or use RVC command : ” vsan.resync_dashboard .” (see how-to-login to RVC) under cluster directory . If there is an ongoing resync , DONOT attempt maintenance on any of the other hosts , wait for the resync to complete . If there were no resync operations seen , proceed to perform same steps with all other hosts by placing them in maintenance remove the vib and rebooting them one at time .

 

admin

Hareesh K G is a Site Reliability Engineer with VMware VSAN Engineering, his current focus is with VMware vSAN ® on-premises , his overall expertise is with Storage Availability Business Unit Products (VMware vSAN ®, VMware Site Recovery Manager® and vSphere Data Protection® ). Started his career with EMC support for Clariion and VNX block storage in 2012 and has been with VMware since 2015.

You may also like...