Troubleshooting vSAN cluster partition vSAN 6.7 and vSAN 6.6

Here are the troubleshooting Steps to resolve Unicastagent issues on vSAN cluster post upgrade usually from vSAN 6.1 / 6.2 / 6.5 to vSAN to 6.6  6.50d and below as we no longer use multicast from version 6.7 and unicastagent list is not updated with correct details on one or more hosts and you also vSAN network partition on one or more hosts .

In these cases you will need to manually add the unicastagent address list on all hosts part of the cluster , follow thru the steps listed below .

Note** : From ESXI 6.5 Update 1 onwards , all the unicatagent entries is controlled from vCenter server and vCenter will push all the unicastagent entries to the hosts  , no manual task is needed to add the unicast entries to the hosts . In very rare cases when we perform host removal and addition task we see that host will be in a network partition state and we will need to resolve this manually .

How and where do you Start to troubleshoot Cluster partition issue ?

It is very obvious that you will see a warning on the cluster and the vSAN health plugin as seen below examples . You should be able to easily find out host(s) where are network partitioned . In the below example we see that the host1- 10.109.10.155 which is part of a 3 Node-All-Flash vSAN 6.7 cluster is network partition and we also see that there are multiple virtual machine which have gone inaccessible due to this issue .

**Note : We should always work towards bring up the virtual machines first as they may be impacting the production , later worry about fixing the network partition , however sometimes whenever the objects were non-complaint to the VM-Storage-policy the VM objects will actually go inaccessible as the host(s) which got partition contains a component which is probably the only good copy or the latest updated component for the VM . In such case we need to bring back the isolated/partitioned host back in cluster to get the VMs accessible .

Step1 :

Go to the vSAN Health plugin  (Cluster ⇒ Monitor ⇒  vSAN ⇒ Health) we see  there are warnings under “Network ⇒ vSAN cluster partition” where hosts 10.109.10.156 and 10.109.10.157 are in partition 1 and Host 10.109.10.155 is in partition 2 .

Step2 :

Find out how many objects are inaccessible under the health plugin section which should throw a warning under Data ⇒vSAN object health  , here in this case we see about 20 objects which have gone inaccessible .  multiple other objects in reduced availability .

Step3:

Try and fix some of the inaccessible VMs by refreshing the objects status (un-register/re-register) process . If the vCenter serve which is managing this cluster is accessible it can be easily accomplished by running some RVC  commands to see how many inaccessible VMs can be brought accessible .

  • SSH to the vCenter server , login to the RVC console and navigate all the to the cluster . see RVC section for assistance with logging into RVC.
    • Run Command “vsan.check_state .” (see example output below)
/localhost/6.7_DC/computers/vSAN-6.7-AF3Node> vsan.check_state .
2018-05-02 04:06:35 +0000: Step 1: Check for inaccessible vSAN objects
Detected 18 objects to be inaccessible
Detected 07484b5a-490f-0505-a4c3-ecf4bbec65d8 on 10.109.10.156 to be inaccessible
Detected 139f475a-5988-2a0d-4aff-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 3794b35a-7742-a83d-d5c2-ecf4bbec65d8 on 10.109.10.156 to be inaccessible
Detected 3db8b359-50f8-2d4a-3fdb-ecf4bbec6050 on 10.109.10.156 to be inaccessible
Detected f746475a-ba30-8055-71d1-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 6c44a35a-e2be-c667-c6cc-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected a630ab5a-9df8-3273-e07f-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 109f475a-03aa-238c-e89b-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected aac9385a-698f-0097-df6a-ecf4bbec65d8 on 10.109.10.156 to be inaccessible
Detected 6c44a35a-8116-3ca7-eaba-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 9aec485a-01cf-5eb5-1f42-ecf4bbec6050 on 10.109.10.156 to be inaccessible
Detected 51c2c059-55c4-87ba-172a-ecf4bbec6050 on 10.109.10.156 to be inaccessible
Detected 359eb35a-f9f2-bdbd-7eb5-ecf4bbec65d8 on 10.109.10.156 to be inaccessible
Detected a830ab5a-5a2f-24cc-b351-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 3694b35a-f9c5-4dd5-6c6b-ecf4bbec65d8 on 10.109.10.156 to be inaccessible
Detected 9cec485a-aee5-29e3-dff3-ecf4bbec6050 on 10.109.10.156 to be inaccessible
Detected 129f475a-2ac2-edea-3e00-ecf4bbec91a8 on 10.109.10.156 to be inaccessible
Detected 3694b35a-8229-9af3-325e-ecf4bbec65d8 on 10.109.10.156 to be inaccessible

2018-05-02 04:06:35 +0000: Step 2: Check for invalid/inaccessible VMs
Detected VM 'CLONED' as being 'inaccessible'
Detected VM '%2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2f04484b5a-e35b-84dc-' as being 'inaccessible'
Detected VM '%2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2fa630ab5a-9df8-3273-' as being 'inaccessible'
Detected VM 'server2012' as being 'inaccessible'
Detected VM 'ComposerServer' as being 'inaccessible'
Detected VM 'VMware-vR-Appliance-7.3.0' as being 'inaccessible'
  •  Next check  how many Objects/VMs can be brought online by running RVC command :  “vsan.check_state . -r”  (Recommended to read how to login to RVC) , in the above example we see that there are 6 VMs which were inaccessible and when this command is run you would be asked type [Y/N] as re-registering VM will cause loss of some of the management state of this VM (for eg. storage policy, permissions, tags, scheduled tasks, etc. but NO data loss) , hit Y for all the prompts and finally we should see how many of them are still inaccessible after refreshing all vms . We  still see 2 VMs inaccessible which cannot be brought online  by mere re-registration , which means that we don’t have 50% availability of components to keep the virtual machine objects accessible .
/localhost/6.7_DC/computers/vSAN-6.7-AF3Node> vsan.check_state . -r
2018-05-02 04:07:28 +0000: Step 1: Check for inaccessible vSAN objects
Detected 07484b5a-490f-0505-a4c3-ecf4bbec65d8 to be inaccessible, refreshing state
Detected 139f475a-5988-2a0d-4aff-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 3794b35a-7742-a83d-d5c2-ecf4bbec65d8 to be inaccessible, refreshing state
Detected 3db8b359-50f8-2d4a-3fdb-ecf4bbec6050 to be inaccessible, refreshing state
Detected f746475a-ba30-8055-71d1-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 6c44a35a-e2be-c667-c6cc-ecf4bbec91a8 to be inaccessible, refreshing state
Detected a630ab5a-9df8-3273-e07f-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 109f475a-03aa-238c-e89b-ecf4bbec91a8 to be inaccessible, refreshing state
Detected aac9385a-698f-0097-df6a-ecf4bbec65d8 to be inaccessible, refreshing state
Detected 6c44a35a-8116-3ca7-eaba-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 9aec485a-01cf-5eb5-1f42-ecf4bbec6050 to be inaccessible, refreshing state
Detected 51c2c059-55c4-87ba-172a-ecf4bbec6050 to be inaccessible, refreshing state
Detected 359eb35a-f9f2-bdbd-7eb5-ecf4bbec65d8 to be inaccessible, refreshing state
Detected a830ab5a-5a2f-24cc-b351-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 3694b35a-f9c5-4dd5-6c6b-ecf4bbec65d8 to be inaccessible, refreshing state
Detected 9cec485a-aee5-29e3-dff3-ecf4bbec6050 to be inaccessible, refreshing state
Detected 129f475a-2ac2-edea-3e00-ecf4bbec91a8 to be inaccessible, refreshing state
Detected 3694b35a-8229-9af3-325e-ecf4bbec65d8 to be inaccessible, refreshing state

2018-05-02 04:07:33 +0000: Step 1b: Check for inaccessible vSAN objects, again
Detected 07484b5a-490f-0505-a4c3-ecf4bbec65d8 is still inaccessible
Detected 139f475a-5988-2a0d-4aff-ecf4bbec91a8 is still inaccessible
Detected 3794b35a-7742-a83d-d5c2-ecf4bbec65d8 is still inaccessible
Detected 3db8b359-50f8-2d4a-3fdb-ecf4bbec6050 is still inaccessible
Detected f746475a-ba30-8055-71d1-ecf4bbec91a8 is still inaccessible
Detected 6c44a35a-e2be-c667-c6cc-ecf4bbec91a8 is still inaccessible
Detected a630ab5a-9df8-3273-e07f-ecf4bbec91a8 is still inaccessible
Detected 109f475a-03aa-238c-e89b-ecf4bbec91a8 is still inaccessible
Detected aac9385a-698f-0097-df6a-ecf4bbec65d8 is still inaccessible
Detected 6c44a35a-8116-3ca7-eaba-ecf4bbec91a8 is still inaccessible
Detected 9aec485a-01cf-5eb5-1f42-ecf4bbec6050 is still inaccessible
Detected 51c2c059-55c4-87ba-172a-ecf4bbec6050 is still inaccessible
Detected 359eb35a-f9f2-bdbd-7eb5-ecf4bbec65d8 is still inaccessible
Detected a830ab5a-5a2f-24cc-b351-ecf4bbec91a8 is still inaccessible
Detected 3694b35a-f9c5-4dd5-6c6b-ecf4bbec65d8 is still inaccessible
Detected 9cec485a-aee5-29e3-dff3-ecf4bbec6050 is still inaccessible
Detected 129f475a-2ac2-edea-3e00-ecf4bbec91a8 is still inaccessible
Detected 3694b35a-8229-9af3-325e-ecf4bbec65d8 is still inaccessible

2018-05-02 04:07:33 +0000: Step 2: Check for invalid/inaccessible VMs
Detected VM 'CLONED' as being 'inaccessible', reloading ...
RbVmomi::Fault: SystemError: A general system error occurred: Invalid fault
You have chosen to fix these VMs. 
This involves re-registering the VM which will cause loss of some of the management state of this VM 
(for eg. storage policy, permissions, tags, scheduled tasks, etc. but NO data loss). Do you want to continue [Y/N] ?
y
Attempting to fix the vm...
Unregistering VM CLONED
Registering VM CLONED
RegisterVM Discovered virtual machine: success
Detected VM '%2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2f04484b5a-e35b-84dc-' as being 'inaccessible', reloading ...
RbVmomi::Fault: SystemError: A general system error occurred: Invalid fault
You have chosen to fix these VMs. 
This involves re-registering the VM which will cause loss of some of the management state of this VM 
(for eg. storage policy, permissions, tags, scheduled tasks, etc. but NO data loss). Do you want to continue [Y/N] ?
y
Attempting to fix the vm...
Unregistering VM %2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2f04484b5a-e35b-84dc-
Registering VM %2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2f04484b5a-e35b-84dc-
RegisterVM Discovered virtual machine: InvalidArgument: A specified parameter was not correct: path
Detected VM '%2fvmfs%2fvolumes%2fvsan:523d5e5605a4d751-0c3304ae7a42599b%2fa630ab5a-9df8-3273-' as being 'inaccessible', reloading ...
RbVmomi::Fault: SystemError: A general system error occurred: Invalid fault
You have chosen to fix these VMs. 
This involves re-registering the VM which will cause loss of some of the management state of this VM
 (for eg. storage policy, permissions, tags, scheduled tasks, etc. but NO data loss). Do you want to continue [Y/N] ?
.
.
2018-05-02 04:16:53 +0000: Step 2: Check for invalid/inaccessible VMs
Detected VM 'CLONED' as being 'inaccessible'
Detected VM 'server2012' as being 'inaccessible'

2018-05-02 04:16:53 +0000: Step 3: Check for VMs for which VC/hostd/vmx are out of sync
Did not find VMs for which VC/hostd/vmx are out of sync

Step4:

We were successful in getting at least some of the inaccessible virtual machine back online , however we still need to fix the other inaccessible VMs . As explained earlier we now need to investigate the actual hosts to find out what caused the cluster/network partition . We need check cluster members  , vmkernel used for vSAN and the unicastagent entries available on all the three hosts to figure out what entry is missing in the unicast address list which caused the cluster partition over SSH to all hosts in the cluster  .

  • Run the commands “esxcli vsan cluster get”  , “esxcfg-vmknic -l” and “esxcli vsan cluster unicastagent list” on all the hosts and see what entry is missing on each host . In this example we see that the host-156 and host-157 have the vSAN vmkernal  unicast neighbor IP address listed of each other and is missing the entry for host-155 . However the host-155 has the entry for vSAN vmkernel IP adress in the unicast neighbhor address list for both hosts-156 and 157 .
[root@is-tse-d155:~] esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2018-05-02T03:25:38Z
 Local Node UUID: 5938de9a-e35b-d745-c9ff-ecf4bbec65d8
 Local Node Type: NORMAL
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 5938de9a-e35b-d745-c9ff-ecf4bbec65d8
 Sub-Cluster Backup UUID:
 Sub-Cluster UUID: 523d5e56-05a4-d751-0c33-04ae7a42599b
 Sub-Cluster Membership Entry Revision: 5
 Sub-Cluster Member Count: 1
 Sub-Cluster Member UUIDs: 5938de9a-e35b-d745-c9ff-ecf4bbec65d8
 Sub-Cluster Membership UUID: e826e95a-afff-8356-8c7c-ecf4bbec65d8
 Unicast Mode Enabled: true
 Maintenance Mode State: OFF
 Config Generation: e1acbef4-fc1e-4901-b365-091166f8d30e 4 2017-09-16T19:31:38.194
 
 
[root@is-tse-d155:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type 
vmk1 vSAN-network IPv4 10.109.44.30 255.255.240.0 10.109.47.255 00:50:56:68:12:89 1500 65535 true STATIC defaultTcpipStack



[root@is-tse-d155:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------ ----- ----------
5937c663-8cb8-3d48-d3ad-ecf4bbec91a8 0 true 10.109.44.31 12321
5937c679-f343-be43-49a3-ecf4bbec6050 0 true 10.109.44.32 12321


[root@is-tse-d157:~] esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2018-05-02T03:25:24Z
 Local Node UUID: 5937c679-f343-be43-49a3-ecf4bbec6050
 Local Node Type: NORMAL
 Local Node State: BACKUP
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 5937c663-8cb8-3d48-d3ad-ecf4bbec91a8
 Sub-Cluster Backup UUID: 5937c679-f343-be43-49a3-ecf4bbec6050
 Sub-Cluster UUID: 523d5e56-05a4-d751-0c33-04ae7a42599b
 Sub-Cluster Membership Entry Revision: 0
 Sub-Cluster Member Count: 2
 Sub-Cluster Member UUIDs: 5937c679-f343-be43-49a3-ecf4bbec6050, 5937c663-8cb8-3d48-d3ad-ecf4bbec91a8
 Sub-Cluster Membership UUID: 6b83e15a-2033-2928-572e-ecf4bbec91a8
 Unicast Mode Enabled: true
 Maintenance Mode State: OFF
 Config Generation: e1acbef4-fc1e-4901-b365-091166f8d30e 11 2018-05-02T02:48:04.503
 
[root@is-tse-d157:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type NetStack 
vmk1 vSAN-network IPv4 10.109.44.32 255.255.240.0 10.109.47.255 00:50:56:65:5b:ea 1500 65535 true STATIC defaultTcpipStack 
 
[root@is-tse-d157:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------ ----- ----------
5937c663-8cb8-3d48-d3ad-ecf4bbec91a8 0 true 10.109.44.31 12321


[root@is-tse-d156:~] esxcli vsan cluster get
Cluster Information
 Enabled: true
 Current Local Time: 2018-05-02T04:03:29Z
 Local Node UUID: 5937c663-8cb8-3d48-d3ad-ecf4bbec91a8
 Local Node Type: NORMAL
 Local Node State: MASTER
 Local Node Health State: HEALTHY
 Sub-Cluster Master UUID: 5937c663-8cb8-3d48-d3ad-ecf4bbec91a8
 Sub-Cluster Backup UUID: 5937c679-f343-be43-49a3-ecf4bbec6050
 Sub-Cluster UUID: 523d5e56-05a4-d751-0c33-04ae7a42599b
 Sub-Cluster Membership Entry Revision: 0
 Sub-Cluster Member Count: 2
 Sub-Cluster Member UUIDs: 5937c679-f343-be43-49a3-ecf4bbec6050, 5937c663-8cb8-3d48-d3ad-ecf4bbec91a8
 Sub-Cluster Membership UUID: 6b83e15a-2033-2928-572e-ecf4bbec91a8
 Unicast Mode Enabled: true
 Maintenance Mode State: OFF
 Config Generation: e1acbef4-fc1e-4901-b365-091166f8d30e 11 2018-05-02T02:48:04.441
 
[root@is-tse-d156:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------ ----- ----------
5937c679-f343-be43-49a3-ecf4bbec6050 0 true 10.109.44.32 12321
[root@is-tse-d156:~] esxcfg-vmknic -l
Interface Port Group/DVPort/Opaque Network IP Family IP Address Netmask Broadcast MAC Address MTU TSO MSS Enabled Type 
vmk1 vSAN-network IPv4 10.109.44.31 255.255.240.0 10.109.47.255 00:50:56:6e:0f:a4 1500 65535 true STATIC defaultTcpipStack

[root@is-tse-d156:~] esxcli vsan cluster unicastagent list
NodeUuid IsWitness Supports Unicast IP Address Port Iface Name
------------------------------------ --------- ---------------- ------------ ----- ----------
5937c679-f343-be43-49a3-ecf4bbec6050 0 true 10.109.44.32 12321

Step5:

In vSAN 6.6 and above the we have a new check/test added into the vCenter server under vSAN health plugin  called “vCenter state is authoritative”  under vSAN-health plugin (Cluster⇒ vCenter state is authoritative ) which checks and manages the unicast entries on all the hosts along with few other tasks . In this case we see that there is a warning on this check reporting that the last update on this cluster was by a different Vcenter server which may the actual cause of the problem . This is possible if the cluster was moved from a different vcenter server to this vCenter and then hit the network partition issue . We may not need to add the unicast entries manually to fix the network partition instead hit “Update ESXi Configuration” this would fix the issue . In this scenario I had removed the hosts from a different vcenter server and added them to this new cluster and hit this problem . Later I had found out that there was a setting mismatch on host-155 where I had set the value as 1 for “/VSAN/IgnoreClusterMemberListUpdates” previously which didnt allow the vcenter server to autmatically add/fix unicast issues . setting this value back to zero and clicking “Update ESXi Configuration” helped fix the problem .

[root@is-tse-d155:~] esxcfg-advcfg -g /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 1
[root@is-tse-d155:~] esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates
Value of IgnoreClusterMemberListUpdates is 0

If the above step didnt help fix the problem , make sure command "esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates" is run on all the hosts and they retry "Update ESXi Configuration" otherwise unicast entries might have to be manually added to all the hosts . Reach out to VMware support to help/assist you add the unicast values on each host .

Note* : Please “DO NOT” attempt adding unicast entries on your own ,adding wrong entry in the Unicast entry causes hosts to PSOD . Please Engage VMware vSAN support to further help fix your issue .

Step6 :

After Network apparition issue is fixed , there will be a slight resync for the objects which are not up to date once all the resync finishes , run Retest on the vSAN-health plugin , all health check test should come clean .

/localhost/6.7_DC/computers/vSAN-6.7-AF3Node> vsan.resync_dashboard .
2018-05-02 05:08:41 +0000: Querying all VMs on vSAN ...
2018-05-02 05:08:41 +0000: Querying all objects in the system from 10.109.10.156 ...
2018-05-02 05:08:41 +0000: Got all the info, computing table ...
+--------------------------------------------------------------------------------+-----------------+---------------+
| VM/Object | Syncing objects | Bytes to sync |
+--------------------------------------------------------------------------------+-----------------+---------------+
| 3Node-AF-VCVA67 | 2 | |
| [vsanDatastore] 5728e95a-29b0-552d-5860-ecf4bbec91a8/3Node-AF-VCVA67_1.vmdk | | 1.84 GB |
| [vsanDatastore] 5728e95a-29b0-552d-5860-ecf4bbec91a8/3Node-AF-VCVA67_6.vmdk | | 0.51 GB |
| 3Node-AF-PSC67 | 1 | |
| [vsanDatastore] a427e95a-389e-a03d-e009-ecf4bbec91a8/3Node-AF-PSC67.vmdk | | 2.28 GB |
+--------------------------------------------------------------------------------+-----------------+---------------+
| Total | 3 | 4.63 GB |
+--------------------------------------------------------------------------------+-----------------+---------------+

Retest : 

Related Posts