How to Upgrade vSAN CLuster with Best practices

Here is your simple “How to Upgrade vSAN with best practice guide ?” . I have been asked by a lot of people recently regarding vSAN upgrades sequence , questions will generally be like ,what are my threats? , what are the prerequisites ? , Should I perform upgrades on all hosts together or one node after another ? what are my checks before I proceed to upgrade vSAN hosts? what are the dependencies before I start upgrading my hosts .However you may ask why should I upgrade my hosts / VCenter servers ?  The answer to that is very simple , based on certain known issues seen on each version or builds the fixes are provided in the next patches /updates or version releases . It is recommended to get the hosts to latest patches available at the current version if not going forward with the next version release .

If you are running vSAN 6.1 / 6.2 with hosts on 6.0 U1 / 6.0 U2 , it is highly recommended to get to the latest build “ESXi 6.0 U3 see Link and  Link where multiple known issues with vSAN have been addressed , also considering vSAN 6.6.1 is always a better option . Similarly if you are running a vSAN 6.5 (ESXi 6.5 GA or 6.5 a release) it is recommended to get to the latest build of “ESXi 6.5 U1 see Link and  Link” which is  vSAN 6.6.1.

Before you start to think about upgrades , we need to make a list of things to be addressed along with your host upgrades , it may be your host BIOS , firmware , I/O Controller firmware and drivers , network card firmware and drivers . This will save you significant time  and avoid requirement of multiple reboot if they were put together smartly . I would also high encourage you to go thru VMware official Upgrade Docs here where few additional steps are explained  .

Upgrade recipe Pre-Checks 

  1. Check for possible inaccessible objects and VMs  in the cluster and log a case with VMware support if you could not find out what those objects were .
  2. Virtual machine compliance status check.If there were any VMs with Fault-to-Tolerate (FTT) = 0 , do you need them if yes , consider changing the policy to default or any other data redundant policy .
  3. Check Current utilization of vSAN datastore , Check What would be your Possible Resync/Rebuild data when a host goes down and can I accommodate the data on rest of the host when one host goes down. Also ensure that there are No Current Ongoing Resync before attempting any maintenance mode mask on the cluster .
  4. Check the Utilization of each disk group and disks , check if they need a Proactive Rebalance .
  5. Ensure to have the Latest BACKUP for all VMs .
  6. Check and Verify current Driver and firmware for I/O controller and network card , Disk Firmware (Cache and capacity)  .
  7. Ensure that There is NO Critical Errors on the vSAN health plugin (Note* only 6.2 and above version only has a working health plugin ) , before starting the upgrade process


Upgrade recipe Pre-checks Explained

1.Check for possible inaccessible objects and VMs

Easiest way is to go the health plugin (works only if we are 6.2 and above release) to see if there were any inaccessible objects , I however recommend you to use and familiarize with RVC commands as this would be your best friend !!.Its also easy to check the same status from the RVC command line on your vcenter server where your vSAN cluster resides . Please see RVC command line guide here .First log into RVC console (See how to log into RVC) , change directory to the vSAN cluster , next run vsan.check_state to look for any inaccessible objects .

  

2. Virtual machine compliance status check.

It is very important to ensure that the VMs in the vSAN datastore are complaint with their set policies always . We may miss out some important things and proceeding with upgrades of hosts might have adverse affect if the VMs which were non complaint we may end up in a possible data loss situation . Also if there were VMs which were using FTT=0 policy you may either convert them to vSAN default policy and hit apply if they were required or you can have those VMs deleted to avoid data loss situation . If you find any VMs that were non complaint to their policy however if the objects /components were shown active you may simply check for compliance they should turn green by right click ⇒ VM policies ⇒ check for compliance . If this did not work simply choose re-apply policy and they should go complaint .

  

3. Check Current utilization of vSAN datastore , Check What if host failure scenario

Ensure that we have enough datastore free space on the vSAN datastore to accommodate data from complete host , since performing upgrades per host basis may trigger a complete resync/rebuild of the host that is under maintenance .Again here RVC is your best friend , run “vsan.whatif_host_failures .” to see vSAN datastore utilization after a host is taken down , also if there were any ongoing resync .

4.Check the utilization of each disk group and disks , check if they need a proactive rebalance .

It is good to check if there were any over utilized disk groups or disks before starting an upgrade procedure . RVC command ” vsan.disks_stats .” at the cluster level shows the disk group current utilization , also the command “vsan.proactive_rebalance_info .” , will show you how much of data needs to be re-balanced between Disk groups and disks .

5. Ensure and take latest backups for all VMs .

We all understand that the upgrades are non-disruptive but  “Does it mean nothing can go wrong?”  , it is always good to have backups and is a  highly recommended per best practices for all VMs

6. Check and Verify current Driver and firmware for I/O controller , Disk Firmware (Cache and capacity) and network card drivers and firmware .

Please note you may not be able to get the current firmware from ESXi , you will need to use your hardware console for the hosts (Dell : iDRAC , HP : ILO , CISCO : KVM , etc) from the respective vendor to determine the firmware for your controller , nic cards and otehr peripherals .

To compare the current driver for compatibility you will need to check VMwareHCL and VMware-vSAN-HCL respectively for your network card /driver and vSAN compatiblity for I/O Controller compatiblity with vSAN version / Host version , recommended driver/firwmare release , SSD / HDD compatibility and recommended version .

Here is an example to find the current controller driver and firmware and compare it with VMware certified driver and firmware, if you found that there was a mismatch , please proceed to correct them per recommended version on the respective HCL guide . You will need to follow similar steps to identify the nic cards in use on the host and their current drivers and firmware , also have them upgraded per HCL guide .Please follow KB Article see verify your current driver and firmware for all nic cards and HBA .

Step1 : Find the current I/O controller in use for vSAN

[root@is-tse-d155:~] esxcfg-scsidevs -a
vmhba0  lsi_mr3   link-n/a  sas.514187706c076000 (0000:02:00.0) Avago (LSI) Dell PERC H730 Mini
vmhba1  vmw_ahci  link-n/a  sata.vmhba1 (0000:00:11.4) Intel Corporation Wellsburg AHCI Controller
vmhba2  vmw_ahci  link-n/a  sata.vmhba2 (0000:00:1f.2) Intel Corporation Wellsburg AHCI Controller
vmhba64 iscsi_vmk online    iscsi.vmhba64 iSCSI Software Adapter
vmhba32 vmkusb    link-n/a  usb.vmhba32 () USB

Step2 : Find the current Driver in use for the controller

[root@is-tse-d155:~] vmkload_mod -s lsi_mr3 | grep Version
 Version: 6.910.18.00-1vmw.650.0.0.4564106

Step3 : Get the VID DID SVID SSID for search your component accurately on the HCL guide

[root@is-tse-d155:~] vmkchdev -l | grep vmhba0
0000:02:00.0 1000:005d 1028:1f49 vmkernel vmhba0

Step4 : Verify the hardware against the VMware vSAN HCL guide

I would prefer the website : https://hcl.captain-vsan.com  where we can enter the above details which will direct you to the VMware-vSAN-HCL post you fill in the details for : VID DID SVID SSID , please find the details below . From this step you will now be aware of the recommended driver and firmware combination which is certified for vSAN for the respective controller .

Step5 : Verify the current Firmware against the drives in use for vSAN , if they also need any upgrade

Please do note that firmware revision on drives (both Cache and capacity) should be at par or higher as described on the vSAN-HCL guide and  is very critical , as we issues related to race conditions between Controller and disks causing all disks in the host to go into permanent device loss and hosts go into non-responsive state , triggering a massive resync and also cause cluster wide issues like congestion etc..

To validate the current firmware you may use the command esxcli storage core device list and check the drive revision which corresponds to the firmware for the drive and check the vSAN-HCL for the recommended Firmware on the drives .

EX :

[root@is-tse-d155:~] esxcli storage core device list | less

naa.500003969809b581
 Display Name: Local TOSHIBA Disk (naa.500003969809b581)
 Has Settable Display Name: true
 Size: 286102
 Device Type: Direct-Access
 Multipath Plugin: NMP
 Devfs Path: /vmfs/devices/disks/naa.500003969809b581
 Vendor: TOSHIBA
 Model: AL13SXB30EN
 Revision: DK02
 SCSI Level: 6
 Is Pseudo: false
 Status: on
 Is RDM Capable: true
 Is Local: true
 Is Removable: false
 Is SSD: false
 Is VVOL PE: false
 Is Offline: false
 Is Perennially Reserved: false
 Queue Full Sample Size: 0
 Queue Full Threshold: 0
 Thin Provisioning Status: unknown
 Attached Filters:
 VAAI Status: unsupported
 Other UIDs: vml.0200000000500003969809b581414c31335358
 Is Shared Clusterwide: false
 Is Local SAS Device: true
 Is SAS: true
 Is USB: false
 Is Boot USB Device: false
 Is Boot Device: true
 Device Max Queue Depth: 64
 No of outstanding IOs with competing worlds: 32
 Drive Type: physical
 RAID Level: NA
 Number of Physical Drives: 1
 Protection Enabled: false
 PI Activated: false
 PI Type: 0
 PI Protection Mask: NO PROTECTION
 Supported Guard Types: NO GUARD SUPPORT
 DIX Enabled: false
 DIX Guard Type: NO GUARD SUPPORT
 Emulated DIX/DIF Enabled: false

With the Model Number and revision number you see against the disk from above example you may verify the recommended Firmware Revision for the disk . Follow thru the screenshots to verify the disk firmware .

Now you have successfully completed the Pre-upgrade check .


Begin Upgrade Process

 

Lets discuss the upgrade options , you may choose to do this in two ways , each have their own benefits and cons . You may choose to proceed with either of the them depending on your requirements

  1. With a full downtime on all the VMs , complete the upgrade on all hosts and then bring up the environment once you have successfully completed the upgrade .
  2. Upgrade with no downtime, maintenance mode with ensure accessibility
  3. Upgrade with no downtime, with full data migration

Recipe 1 : Upgrade with Full Downtime

This approach to upgrade is the most preferred in many cases . The steps are very straight forward , all you need is a downtime on all your Virtual machines running residing on the cluster which is about to be upgraded . Please ensure you have followed all pre-upgrade checks before proceeding with these steps

  • Step 1 : Shutdown all the virtual machines on the cluster .
  • Step 2 : Check for VM storage policy compliance on all the virtual machines and ensure that they are compliant with their assigned VM storage policy . Note : If there were any VMs with FTT=0 (Fault to tolerate) as their policy and if this is considered  as a critical VM , please change the policy to Virtual SAN default Policy before attempting any Host Maintenance-Mode .
  • Step 3 : Once you have confirmed all VMs are powered and and their compliance check is clean , proceed to put all hosts to MaintenanceMode with NO-DATA-MIGRATION .
  • Step 4 : You may now start upgrade all hosts in one go to the build that is recommended / desired , perform all the required upgrades at the hardware layer (BIOS / Firmware etc) .
  • Step 5 : Once all hosts are upgraded to the latest build, you may next push the driver vibs for network cards , controller driver and any additional patches for the host post upgrade and take another reboot on all hosts . Now you would have completed the upgrade sequence and you may exit Maintenance Mode on all hosts .
  • Step 6 : VMs can now be powered on / brought back to production .

The advantage of this option is that there were be no resync that is required post upgrade on all hosts , hence post upgrade VMs can be brought online and complete vSAN bandwidth is available for VM-storage traffic , the only drawback is the requirement of a downtime , it may not be feasible in all production scenarios .

 

Recipe 2 : Upgrade with No Downtime

This approach to upgrade is preferred in situations where there can be no down time on the VMs that are running which are mostly mission critical businesses. The steps are explained below , please ensure you have followed all pre-upgrade checks before proceeding with these steps .

 

  • Step1 :Check for VM storage policy compliance on all the virtual machines and ensure that they are compliant with their assigned VM storage policy . Note : If there were any VMs with FTT=0 (Fault to tolerate) as their policy and if this is considered  as a critical VM , please change the policy to Virtual SAN default Policy before attempting any Host Maintenance-Mode .
  • Step2 : Once all VMs are checked for compliance , you will also need to make sure that there are no ongoing re-sync or re-balance in the cluster with the help of RVC commands on the vCenter server .
  • Step3 : Increase the clom repair delay to two hours (120 mins) ONLY when you think you will need more than one hour to complete the upgrade on one of the host without rebuild kick off which will causing complete resync for the components that are currently residing the host which is about to be placed in maintenance to other hosts in the cluster, proceed to put one host in maintenance mode with ensure accessibility option from the Webclient , if the DRS is not fully automated , you will have to manually move (vMotion) the VMs to other hosts . Below are the screenshots from both 6.0 and 6.5 versions where the wizard might look slightly different while changing the value clom repair delay .

  • Step4 : Once the first host is in maintenance mode , you may proceed with all of the upgrades . Start with the host upgrade to the latest build version , proceed with any recommendation on the hardware side for BIOS / Firmware etc (this includes for your I/O controller firmware , SSD/HDD firmware , nic firmware) . Once these upgrades are completed , you may now check if the hosts require any additional patches , driver upgrades for your I/O controller , nics . Push them all in one go and take another reboot . This host can now exit maintenance post all the upgrades and patches .
  • Step5 : Post taking the first host out of maintenance , you will need to watch out for possible resync , use RVC command : ” vsan.resync_dashboard .” under cluster directory . If there is an ongoing resync , DONOT attempt maintenance on any of the other hosts , wait for the resync to complete . Please contact VMware technical support if there were anomalies around resync completion or any other issue.
  • Step6 After confirming resync completion (or 0GB to resync) , you may proceed to upgrade the next host following Steps 1 to 5 in a cyclic manner one host after another .

Recipe 3 : Upgrade with No Downtime and full data migration.

This approach to upgrade is preferred in situations company policies which dictates no downtime , data compromise . This is only applicable for more than three node vSAN normal cluster and if in fault domains we should have enough free space on the other hosts within the same fault domain to achieve this.This process is time consuming and will take longer time to complete the upgrade on all hosts as there are multiple resync cycles involved  .Please ensure you have followed all pre-upgrade checks before proceeding with these steps

  • Step1 :Check for VM storage policy compliance on all the virtual machines and ensure that they are compliant with their assigned VM storage policy . Note : If there were any VMs with FTT=0 (Fault to tolerate) as their policy and if this is considered  as a critical VM , please change the policy to Virtual SAN default Policy before attempting any Host Maintenance-Mode .
  • Step2 : Once all VMs are checked for compliance , you will also need to make sure that there are no ongoing re-sync or re-balance in the cluster with the help of RVC commands on the vCenter server .
  • Step3 : Proceed to put one host in maintenance mode with full data migration , wait for the full resync completion . You may monitor resync using RVC , host will only complete entering maintenance mode post reysnc completion .

  • Step4 :Once the first host is in maintenance mode , you may proceed with all of the upgrades . Start with the host upgrade to the latest build version , proceed with any recommendation on the hardware side for BIOS / Firmware etc (this includes for your I/O controller firmware , SSD/HDD firmware , nic firmware) . Once these upgrades are completed , you may now check if the hosts require any additional patches , driver upgrades for your I/O controller , nics . Push them all in one go and take another reboot . This host can now exit maintenance post all the upgrades and patches .
  • Step5 : Post taking the first host out of maintenance , you will need to watch out for possible resync , use RVC command : ” vsan.resync_dashboard .” under cluster directory . If there is an ongoing resync , DONOT attempt maintenance on any of the other hosts , wait for the resync to complete .
  •  Step6 : After confirming resync completion (or 0GB to resync) , you may proceed to upgrade the next host following Steps 1 to 5 in a cyclic manner one host after another .

 

Note* : There are chances that the vSAN cluster gets network partitioned post upgrade especially from vSAN 6.1 / 6.2 / 6.5 to vSAN to 6.6 as we no longer use multicast from version 6.6 .  In these cases you will need to manually add the unicastagent address list on all hosts part of the cluster , I encourage you to see Troubleshooting-Unicastagent

Upgrading vSAN disk Format version

Disk format upgrade may  either from the Vcenter webclient or thru RVC , I suggest you first meet the pre-requisite for Disk format upgrade prior to attempting this .  I personally prefer RVC to do this , recommend you to refer to the links RVC-Method or Webclient-method . Please also refer to Link  for better understanding “on Disk format upgrades and Pre-Requisites” . Refer to How to log into RVC

[contact-form][contact-field label=’Name’ type=’name’ required=’1’/][contact-field label=’Email’ type=’email’ required=’1’/][contact-field label=’Website’ type=’url’/][contact-field label=’Comment’ type=’textarea’ required=’1’/][/contact-form]

Related Posts