Migrating from VMware to Nutanix: Before It Happens
My Role
So I am fairly low on the totem pole for decision making where I work and only recently began working here (since 2023). In late 2024 I was placed on a team working on migrating our on-prem “cloud” to another datacenter. This process was nearly complete, and I was just placed here for the following goal of migrating to Nutanix and supporting that work. As of now, I am the architect of a large scale data migration/conversion of virtually all of our data. We are moving data from VMDKs residing in ESXi datastores to SMB and NFS shares hosted using Nutanix Unified Storage (NUS). Due to issues surrounding this process, I am also involved in providing input to our plan to migrate our clusters from ESXi to AHV. This involved procedure planning and remediation of existing issues we live with that would be amplified by moving to AHV (regarding End User Computing (EUC) and networking). The job I am supposed to do? Digital forensics. Lol.
How We Got Where We Are
So the name of this article is a bit of a lie. We currently do reside on Nutanix-based infrastructure (and a lot of it). However, under the physical Nutanix hardware, we are running ESXi hypervisors. This is due to strange planning and execution that occurred during a move between datacenters we had performed recently. In the old datacenter, we had a standard VMWare vSphere-based environment (as most people seemingly did). A plan was hatched to migrate to a new datacenter with better hardware and location sometime around 2020. However, this plan was more of a concept than anything else. As such, over time parts of this process were planned out and executed in small bursts. Hardware was purchased and sent to the datacenter, and the choice was made to go with Nutanix. Eventually, these nodes began being set up and networked, and eventually by late 2024 we had a complete environment in the new datacenter modeled after the old one. This, however, meant that we were not making any design decisions that took advantage of the Nutanix ecosystem. It became essentially the same environment in the old datacenter running on newer hardware with the Nutanix control plane layered over it.
Fast forward to now, we have over 150 Nutanix nodes and we continue to throw more hardware resources into a live system while simultaneously attempting to migrate off of VMware. The pressure and decisions that led to cutting over operations from the old datacenter without trying to consider new design beforehand has invented a nightmare. During the cutover, we experienced many problems. Restoring data from backups was a slower than expected process and had many corruption errors that we were not easily able to remediate due to our unique way of logically storing data.
The restore process worked poorly due to how we handled sending data from our backup provider into ESXi datastores. We were tunneling all restore traffic through one Nutanix Controller VM (CVM) per cluster. I was not directly involved in this process, but the way I understand it is that our backup provider was not designed to send data directly to Nutanix storage containers (which host these datastores when using Nutanix on top of ESXi). In doing so, we were attempting to send incredible amounts of data through a single VM, when this traffic is intended to be distributed over CVMs. We overloaded these VMs, so they missed writes and often crashed. During this cutover process, much of my work involved monitoring the CVMs and ensuring they were properly working. After talking with Nutanix engineers, we were directly told that this was not a supported method of handling data ingress. To solve this, we increased the memory available to CVMs to mitigate their crashing and split backup restore jobs to point to different CVMs within a cluster. This was the best solution we found, and that process is long since over so no longer is being considered by us.
Regarding the corruption errors, we never found if this was the fault of the backup provider or our fault for how we were attempting to write the backups to our Nutanix hosts. The corruption was silent though. We were migrating close to 4 petabytes of data. The majority of this data was stored on VMDKs. Due to how our backup provider stored this data and the way restores happened meant that we could not just hash easily to verify (plus, hashing this much data on-demand would take an incredible amount of time). VMDKs when restored were thin-provisioned, and originally were thick-provisioned. This meant that the disk would have to be inflated before a hash could even be ran. No one working on this wanted to test whether or not that would even work. Instead, we attempted to move VMDKs between datastores. vSphere would provide an error when attempting to move a VMDK that was corrupted in a way related to metadata, and we would be able to re-restore that data. This hacky way let us find the low-hanging fruit. The remaining corruption would not easily be found, so we ended up relying on waiting for user input to decide if we needed to do any ad-hoc restoral since then (which we have had to do as recently as last week).
Our environment is split into two parts. A one-cluster development environment and a nine-cluster production environment. We have no staging environment, and our development environment is fundamentally lacking in components present in production. As such, our testing occurs in production and requires setting up a mini-“staging” environment for changes. Due to issues during the foundation process (that somehow has persisted for nearly six months at this point), the team responsible for this has since given up and life simply has to be lived without an isolated staging environment.
Problems Since Moving
Since completing this move, we have experienced many cases of things not working exactly how they are intended. For instance, much of our infrastructure such as vCenter, domain controllers, web servers, and more all reside on one cluster. When we attempt to run AOS or other upgrades on these clusters, we found that the VMs were not properly migrated. This led to loss of available as our domain controllers would lock up until hard-reset, and we would lose access to vSphere and need to manually resolve that by remoting into the host. This is still a problem we have, so we save upgrades for off-hours, losing this benefit of using Nutanix that we are meant to have.
Our network configuration also has serious design issues. For instance, our EUC VMs are deployed on the same vLAN as our Nutanix Hosts and CVMs (which is a /20 subnet by the way lol). This directly goes against recommendations from Nutanix and leads to a significantly less secure environment. That security is only mitigated by the environment being a dark site (offline). However, all it takes is a malicious user. This issue is being worked in tandem with our migration to AHV, as doing so will require us to change how we serve VMs to end users.
Another network problem experienced has been when a new team to our cloud environment needed an object store to perform data science operations. Nutanix offers an essentially one-click solution for this that handles setting up a managed Kubernetes cluster with access to an entire cluster’s storage for objects. However, for six weeks they were unable to do so. The network team blamed Nutanix, and Nutanix blamed the network team. Nutanix engineers found that packets were just being dropped inconsistently, and it was at such a high rate that deployment would fail. Eventually, the network team was convinced to move that data science cluster to another vLAN, and it worked. They will still swear it is not a network issue by the way. We are now essentially at the “just one one lane” phase of our network design. Just one more vLAN will fix it.
Data Migration Design
Now, back to my part in all of this. I have been tasked with setting up our NUS infrastructure and moving all of our data to this. This means deploying file servers on every cluster we plan to utilize for data storage and moving data to it. Seems simple enough, and it really should be. However, there are a few sticky issues: network considerations, where to place the data, and how to avoid disrupting user experience while moving the data.
Network Considerations
Nutanix File Servers are a very simple one-click deployment solution to deploying a set of servers that are automatically load-balanced and serve NFS or SMB shares. Give it an internal network to talk to CVMs, a client network to talk to EUC VMs, configure the shares, and you are golden. However, there are some restrictions to it. Due to internal network configuration, the vLANs assigned to the internal/client/EUC networks can lead to inoperability. Since our EUC VMs are on the same subnet as our CVMs, this meant the only working configuration for us would be placing the internal and client subnets for the file servers on that same subnet. This would be an incredibly bad band aid solution, and just amplify our existing network problems by piling on more and more future work. As such, we decided to move the CVMs to a different subnet, and utilize a previously unused subnet for the files clients. This led to all three components being on separate subnets.
Where to Place the Data
So we knew we wanted to place the data on NUS File Servers. However, we also knew that we wanted to move to AHV sooner rather than later. We had one large storage cluster that was already on AHV, so we decided to have it be the first location of data. The plan became to empty another cluster of data, convert it to AHV, stand up another file server on that cluster, and then load balance between the clusters. We would do that until all of our clusters were on AHV. What this meant is that I would have to incorporate the load balancing operation between the clusters into the design process. Since our data has unusual storage requirements due to the nature of what the data is used for, I unfortunately cannot go too deep into this process. However, a solution was found that involves using the Nutanix Move utility to move selected targeted data from the old file server to the new one.
Avoiding User Disruption
The biggest problem with this migration is that it is operating in a live system. While data is being moved, it cannot be accessed by users (since the majority of the VMDKs users access are thin provisioned). User work cannot be interrupted, and that work can span weeks. Our maintenance windows are infrequent, and even a three-day maintenance window would not be enough to move any significant portion of this data. Due to strict integrity requirements, and not wanting to risk the corruptions we saw in our restore, data must be hashed before and after being moved to the file server. The data is an NTFS partition on a VMDK, so this is not a large concern. It just takes a lot of time to hash every file, especially when the data is in the petabytes and consists of billions of individual files. As such, the migration is done is sweeps.
Each VMDK belongs to a group of VMDKs. When any one of these VMDKs is in use, that entire group should not be touched. Similarly, when migrating any VMDK in that group no other VMDK in that group should be usable by an end user. We scan by these groups to find entire groups being unused, and then lock that group and begin working exclusively on it. Unfortunately, a single group can take well over a day - the variation between VMDKs is incredible. Some will contain literally nothing, and others will continue millions of 4kb files. There is no getting around this, and since work has not official begun on this we cannot tell how much of a problem these extremes will present. Additionally, it makes giving estimates a bit sticky. For instance, the largest group contains around 1500 VMDKs. The average one will contain less than 10GB of data spread across less than 1000 files. For extreme power users, however, this can be several hundred GB.
We made plans on dividing these groups into subgroups to prioritize work to begin with low disruption and move into higher disruption groups when we are more confident in the process.
Going Forward
This plan is set to begin execution by July 2025, and I will provide updates and more specific details about the execution when that begins. Similarly, when we convert a cluster to AHV and eventually migrate complete off of VMWare infrastructure I will provide the best details I can provide for that process. My main role is a major prerequisite for this conversion, but I will not be directly involved in the conversion process besides keeping track of it and reporting to my management. I apologize for the lack of any detail in this, but it was primarily a vent post for me as this work has stressed me for months. I am not qualified to do this. It is not even close to my job description. I do not want to do this. But it has ultimately became my task, I guess. At least my involvement in this is a fun resume point.