Storage Resiliency in Nutanix. (Think about the Architecture!)

Hyperconverged is a great technology, but it does have its caveats.  You have to understand the architecture and design your environment appropriately.   Recently I had a Nutanix cluster that had lost Storage Resiliency.  Storage Resiliency is when there is not enough storage available in the event of the loss of a node.  When storage is written it is done locally and on a remote node.  This provides Data Resiliency, but at the cost of increased storage usage.  This is essentially the same thing as RAID with traditional storage.

I had 3 nodes that were getting close to 80% usage on the storage container.  80% is fairly full and if one node went down the VM’s running on that host would not be able to failover.  They cannot failover because the loss of one node would not provide enough storage for the VM’s on that node to HA to.  Essentially whatever running on that host would be lost including the what is on the drives.  I really wish they would add a feature to not let you use more storage than what is required for resiliency.

I had two options to remedy this.  I could either add more storage which would also require the purchase of another node, or I could turn off replication.  Each cluster was replicating to each other resulting in double the storage usage.  With replication the RPO was 1 hour, but there were also backups which gave an RPO of 24 hours.  An RPO of 24 hours was deemed acceptable so replication was disabled.  The space freed up was not available instantly.  Curator still needed to run background jobs to make the new storage available.

Screen Shot 2016-02-16 at 2.42.49 PM

A lot of time users will just look at the CPU commitment ratio or the memory usage and forget about the storage.  They are still thinking in the traditional 3 tier world.  Like any technology you need to understand how everything works underneath.  At the end of the day Architecture is what matters.

Nutanix Node Running Low On Storage

I manage a few Nutanix clusters and they are all flash, because of this the normal tiering of data does not apply. In a hybrid mode, which has both spinning and solid state drives, the SSD will be used for read and write cache. Only moving “cold” data down to the slower spinning drives as needed.   The other day one of the nodes local drives were running out of free space.  It made me wonder what happens if they do fill up?

With Nutanix it tries keeps everything local to the node.  This provides low latency reads since there is no network for data to cross, but the writes still have to go across the network.  The reason for this is that you want at least two copies of data.  One local and one remote.  So when writes happen, it writes synchronously to the local and a remote node.  Writes are written across all nodes in the cluster, and in the event of a lost node it can use all nodes to rebuild that data.

When the drives do fill up nothing really happens.  Everything keeps working and their is no down time.  The local drives become read only.  Writes will then be written to at least two different nodes ensuring data redundancy.

To check the current utilization of your drives it is under Hardware > Table > Disk

Capture

So it is best practice to try to “right size” your workloads.  Try to  make sure that the VM’s will have their storage needs met by the local drives.  HCI is a great technology it just has a few different caveats to consider when designing for your workloads.

If you want a deeper dive about it check out Josh Odgers post about it.

ECC Errors On Nutanix

When logging into a Nutanix cluster I see that I have 2 critical alerts.

1

With a quick search I found KB KB 3357 I SSH into one of the CVM’s running on my cluster, and ran the following command as one line.

ncc health_checks hardware_checks ipmi_checks ipmi_sel_correctable_ecc_errors_check

Looking over the output I quickly found this line.

3

I forwared all the information to support, and will replace the faulty memory module when it arrives.  Luckly so far I have not seen and issues from this memory issue, and I really liked how quick and easy it was to resolve this issue using Nutanix.

Nutanix .NEXT

I was honored this year to be chosen to be a part of the Nutanix Technology Champions.  I have only recently started using Nutanix, but I could clearly see what made it different.  In an age with the Public Clould all the rage.  I could see something like Nutanix really keeping the private data center alive.  The Public Cloud is so popular because it is such an easy way to consume resources as you need them.  I see what Nutanix is doing how it is really converging everything.  Not only are storage and compute converged, but so are the underlying software.  One software package to rule them all instead of a seperate piece of software for every feature you need.

That is why I am so excited about being a part of the NTC, and the free pass I received to go to the Nutanix .NEXT conference.  Conferences can be really expensive, and I would like to thank Nutanix for investing in its community and showing its appreciation.  I know at it I will meet a lot of new people and learn a lot of new things.  Conferences are really about networking with your peers, and sharing knowledge.  For that reason I see them as valuable as any training.  Hopefully I will meet everyone who reads this post, and if you have not registered yet you can at .NEXT

Updating Prism Central to 5.0

Nutanix recently released the 5.0 code, and it has a lot of new really nice features.  In a future post I plan on going over some of the features, and detail why they are so important.

Before you start upgrading all of your Nutanix host you should first upgrade Prism Central to 5.0  This server gives you an overview of all your Nutanix clusters and some management capabilities.  I am still fairly new to Nutanix so I was not sure how to upgrade Prism Central.  Usually you can upgrade from within the Nutanix console, but with this being a brand new release you had to download it directly from the website.  Sometime soon in the future it will be a part of the automatic upgrade within the Console.

At first I was a little confused about how to upgrade.  You would think there would be a separate upgrade for Prism Central since it was originally a  separate install.  Instead the update is included in the AOS 5 download.  Download the complete AOS 5.0 and also download the Upgrade Metadate File.
2017-01-17-08_43_02-nutanix-portalOnce you have that downloaded everything login to Prism Central. Next click the gear icon and then click upload the Prism Central Binary.  Now point  it to the AOS 5.0 download and the metadate file.  Click upgrade and soon you will be running Prism Central 5.0.  2017-01-17-08_46_42-nutanix-web-console

Blog at WordPress.com.

Up ↑