Huawei FusionServer RH2288 V3 : The not so ready, “vSAN READY NODE”

On the face of it,  this heading may seem as if I’m taking a shot at Huawei. I want to make it clear that in no way is the post aimed to bad mouth Huawei or any vendor, it is merely a post to assist anyone else in the community that  may face a similar problem I had with my vSAN deployment on this specific hardware platform.

This was my first VMware vSAN deployment and the “nerd” in me was super exited.  I knew that I would be learning a “whack load” of new cool things.  Fortunately I’ve read “The essential vSAN guide” by Duncan Epping and Cormac Hogan while I was studying for my VCAP6-DCD exam, so I had good technical understanding on how vSAN works and how it should be deployed.

As with many new vSAN deployments, a lot of research goes in to finding the correct hardware. VMware have “pre-certified” certain vendor hardware, to make our jobs easier. VMware refers to this as “vSAN Ready-Nodes”, you can get more info here.

The hardware vendor we used was Huawei and the the model is “RH2288H v3”. We opted for the All-Flash configuration.  The most important component with any vendor choice, is the storage controller. Because this server was certified by VMware as a ready node, I expected no installation problems. Don’t think I was prepared for two days of trouble shooting. Anyway I’ll get to that later….

Before you begin installing anything, there are a couple of things you’ll need to change in the bios. All Huawei server’s bios is protected with a default password “Huawei12#$”. This can be disabled or changed within the bios setup utility. If you are going to be enabling “EVC” on your cluster, you will need to enable “Monitor/MWait”

huawei_bios

Next thing we need to do is configure that RAID controller for “Pass-through”. The raid controller that ships with the RH2288H v3 is Huawei’s own “SR 430C / RU 430C”.  This is based on the LSI3108 chip set. Our vSAN nodes required two disk groups and for that reason we also got an additional PCIe RAID controller for the second disk group. The process get into the RAID card bios is the standard “CRTL-R” during raid card initialisation. Once in, select your RAID controller and enable JBOD. In my case I had to do this for both controllers. Save your config and restart.

LSI3108
I’m not going to post about how to install ESXi on this node. The process is similar as my post on Installing ESXi on Huawei : RH5885H V3 (FusionServer).  The one exception is an updated driver that you can obtain from Huawei’s support site here. Download the latest version available. We’ll need this later in the post.

Now that you have ESXi installed you’ll need to do all the basic configurations and get it connected to your VCSA. In my deployment I made use of LAG’s and therefore required an additional configuration to get this up. This is not a requirement for vSAN, you can use a standard virtual switch.
Once you have you port groups set and  configured, you’ll need to configure the vmkernel ports for vSAN. You have the option to create a new TCP/IP stack for vSAN or you can use the “Default Stack”. The latter option is mainly used for when you have a stretched vSAN cluster and require routing across you vSAN vmkernel ports. In my design this was not a requirement.

vmk_create_vsan

The next step would be to enable vSAN on the cluster level. It’s at this point where the I discovered the “vSAN Ready Node” was far from ready for anything.
If you enable vSAN as this point, you will notice that the disk group creation gets stuck on 21% and the host will become unresponsive.

You will see errors similar to this in the vmkernel.log:VSAN error

This is a known issue and VMware has a KB for this. Although it’s for a Dell PERC controller, further research has also linked it to the Huawei’s “SR 430C / RU 430C”.

To resolve the issue, I restarted the host and reconfigured the RAID controllers to remove the disk from the host. This can be done by entering the RAID controller configuration utility and disabling  “JBOD”

The host will start up and connect back to vCenter. Now were going to follow the KB article related to the problem. Enable SSH on the host. Connect to it and execute :

esxcli system module set –enabled=false –module=lsi_mr3

This will force ESXi to use the correct driver which is megaraid_sas.
Restart the host.

I decided to go over Huawei’s documentation to see if there was any known issue related to this and discoverer that they have a separate utility called an “iDriver”. This tool will actually check all your firmware and driver versions on the host and update them if needed. You can find the tools here.

Extract the tool and copy it to your host using scp. Open another ssh session to you host and execute install_driver.sh. 

You will be prompted with a iDriver installer menu.

idriver_menu

From here you can either choose to automatically install the required driver and firmware versions or you can validate the current info. I’m going to select option two but you would select option one if this was a new installation. You can the reboot and use option two to validate the firmware and driver status. You should get an output such as below:

Screenshot from 2017-02-22 14-00-59

After this you can reboot your ESXi node and reconfigure the RAID controller.

You should be able to enable and configure vSAN.

 

Share on Facebook0Tweet about this on TwitterShare on LinkedIn5Share on Google+0

Leave a Reply

Your email address will not be published. Required fields are marked *