At many times we may wonder while managing multiple vSAN clusters in a production environment , what are my”Things to know for vSAN management” you may ask yourself ?,a Complete guide for your every day tasks while managing vSAN clusters and things to know which can cause potential disaster without taking precautions
Virtual machine storage policy compliance
You might have already seen all my other previous hosts and by now you would have realized that virtual machine storage policy is everything when it comes to data protection . We need to always make sure all the important virtual machines are compliant with their storage policies at all times and if there were any VMs running FTT-0 which we may still need redundancy , change the policy to FTT-1 (fault to tolerate). Please see How-to-create-VM-policy-vSAN .
Always make a habit to read release notes and patch information for vSAN
It is highly recommended to know what has come? and what is coming ? To stay on top of things and avoid major/minor issues in the environment,hence always make a habit to read the field advisory emails that we receive from when new patches updates are received at least the critical ones . From my experience so for in handling vSAN issues vs fixes , any version running ESXi 6.0 update 2 and below needs immediate upgrade to the latest version ESXi 6.0 update 3 patch 06 or if the hardware is compatible with vSAN 6.6 (ESXi 6.5 Update 1+) we should go for this . There are always enhancements , fixes for certain known issues in the latest release and few may be nasty issues which we always want to avoid .
We know upgrades are always painful as we may be involved with multiple products and dependencies with other suites (VC , SRM , Horizon and even ready appliances like VXRAIL,VXRACK SDDC). Hence we can plan the upgrades well do the Pre-check with all vendors and suites for compatibility /interoperability . In many cases it might just be a patch to the current version level and may not require a version upgrade so in these situations we may not necessarily upgrade other product suites and the update process would be a hassle free . See below KB articles for awareness .
Following KBs are applicable for both Hybrid and All-Flash setup
ESXi 6.0 Patch 06
- KB2146345 – ESXi host experiences a PSOD due to a vSAN race condition
- KB2145347 – Component metadata health check fails with invalid state error
- KB2150189 – vSAN de-staging may cause a brief PCPU lockup during heavy client I/O
- KB2150395 – Bytes to sync values for RAID5/6 objects appear incorrectly in vCenter and RVC
- KB2150396 – Using objtool on a vSAN witness node may result in a PSOD
- KB2150390 – Health check for vSAN vmknic configuration may display a false positive
- KB2150389 – SSD congestion may cause multiple virtual machines to become unresponsive
- KB2150387 – vSAN Datastores may become inaccessible during log or memory congestion
- KB2151127 – vsan and Vmware Bootbank critical fix
- KB2151132 – vsan and Vmware Bootbank critical fix
Following KBs are Highly Recommended for All-Flash setup .
ESXi 6.5 Express patch 4 : Critical Fix for ALL Flash KB Running 6.5 releases
ESXi 6.0 Express patch 11 : Critical Fix for ALL Flash KB Running 6.0 releases
Note* : You may need to contact your hardware vendor while using an appliance like DELL-EMC (VXRAIL , VXRACK) , IBM-Bluemix etc.. prior to applying patches
vSAN Health check plugin
It is highly recommended to monitor the vSAN health check plugin under (Cluster ⇒ Monitor ⇒ vSAN) for all your clusters for any issues with vSAN . Most of the alerts are self explanatory with proper description , if you needed further assistance in troubleshooting the issue , feel free to contact Technical support or use Ask VMware button .
Handling and troubleshooting Non-responsive Hosts
At many times the admins tend to reboot a hosts or even multiple hosts to get the hosts responsive on a vcenter for management . You should highly refrain from taking such actions unless you know what yu are doing , such actions can cause “Data Unavailability and Data Loss ” situations . It is better to contact vSAN Technical support prior to taking such actions . Here is an example one host running into high congestion (Log/SSD/etc..) causes multiple or all hosts to go non-responsive in the cluster and addressing such one host may get all other hosts responding . We need to take preventive measures just to reboot one host . Hence do not panic and reboot hosts as you may end up in a bigger problem to deal with .
Tip* : It is good to have SSH turned on for all hosts running vSAN , unless company policy dictates for a reason not to . It is easy to isolate the problem to a node or objects when we have a SSH to the hosts and looking at live vmkernel.log /vobd.log /hostd.log will surely give us something .
How and when to change the storage policy for the VMs
You may be in need to change the storage policy for all VMs at a time or some of the Vms . You may wonder will there be any consequences doing this , well YES you might just take down the whole cluster!! . Now here is the explanation , if we change a policy for a VM (EG : FTT=1 SW=1 “Default vSAN Policy” to FTT=1 SW=2 “Custom Default Policy”) , in this case vSAN creates a top-level RAID-1 objects starts to create the components for the new Policy and once the components are successfully created and synced , the older components are cleaned up (Deleted).
This means that if we have a VM of size 5TB (current allocation) and you just applied a new polity to it , we would be utilizing an additional 5TB of storage space on vSAN while transition is in progress , hence ensure to have enough headroom on vSAN Datastore while changing storage policy for one or more VMs. Also note that you will be triggering resyncs for every policy change and further cause performance issues and congestion on hosts if there were too many re-syncing objects (VMs) . See How-to-change-policy
Best Practices for storage policy change
- Make sure you have enough space to accommodate the additional consumption during the re-sync progress and with a good head room (VM size + a head room) .
- Make sure there are no previous huge resync prior to triggering a policy change .
- Make sure there are no congestion issues / warnings against the host prior to triggering a policy change (see vSAN health plugin).
- Ensure all the capacity disks on your disk groups are balanced out , else run a proactive re-balance and then attempt any policy change . Please us RVC commands to achieve this
- Do not Trigger policy change for too many VMs at one go as a best practice .
- If there were any VMs running into performance issues while this policy change is in progress , reduced the resync threshold (from available GUI button with vSAN 6.6) , else contact support for recommendations .
Disk Group Maintenance Activities
If there comes a situation to delete disk group for whatever reason , it is recommended to perform a pre-check on the cluster before attempt a disk-group deletion activities. Follow steps would help you in doing some pre-check . Deleting a disk group can cause potential data loss and data unavailability (DU/DL) if you don’t understand what you were doing .
- Understand the number of nodes that you have in the cluster and correct choice of modes to delete a disk group on your cluster .
- If you were running a third node VSAN setup , the only two available options are (no data migration and ensure accessibility mode) .
- You have to make sure that there were no vSAN objects or virtual machines running with a policy of FTT-0 (RAID-0) on a DG which is about to be deleted with no actions . This is an irreversible options , hence do such tasks are your own risk .
- If were going to choose ensure accessibility as an option to complete the task , you are at risk of loosing good data objects /components if something were to happen to the other hosts and diskgroups when this DG is being deleted . Remember “Always take backups” before taking any disruptive actions like this.
- When we talk bout Full Data Migration /Evacuation , we need to take additional Pre-cautions when it comes to an ALL-FLASH setup with dedup and compression enabled cluster . Please remember all Dedup and compression tasks happen within a Disk group and not even across other disk groups within a host . Hence when you delete a diskgroup with full data-Migration/Evacuation you will have a resync for uncompressed /un-deduped data which may be huge depending on your previous de-dup and compression ratio achieved on that DG which is being deleted . Make sure you have enough space (equivalent and higher than the actual disk group) that was actually deleted to avoid any cluster wide problems (DU/DL) .