Moving terabytes or even petabytes of data to the cloud is a daunting task. But it is important to look beyond the number of bytes. You probably know that your applications are going to behave differently when accessed in the cloud, that cost structures will be different (hopefully better), and that it will take time to move all that data.
Because my company, Data Expedition, is in the business of high-performance data transfer, customers come to us when they expect network speed to be a problem. But in the process of helping companies overcome that problem, we have seen many other factors that threaten to derail cloud migrations if left overlooked.
Collecting, organizing, formatting, and validating your data can present much bigger challenges than moving it. Here are some common factors to consider in the planning stages of a cloud migration, so you can avoid time-consuming and expensive problems later.
Cloud migration bottleneck #1: Data storage
The most common mistake we see in cloud migrations is pushing data into cloud storage without considering how that data will be used. The typical thought process is, “I want to put my documents and databases in the cloud and object storage is cheap, so I’ll put my document and database files there.” But files, objects, and databases behave very differently. Putting your bytes into the wrong one can cripple your cloud plans.
Files are organized by a hierarchy of paths, a directory tree. Each file can be quickly accessed, with minimal latency (time to first byte) and high speed (bits per second once the data begins flowing). Individual files can be easily moved, renamed, and changed down to the byte level. You can have many small files, a small number of large files, or any mix of sizes and data types. Traditional applications can access files in the cloud just like they would on premises, without any special cloud awareness.
All of these advantages make file-based storage the most expensive option, but storing files in the cloud has a few other disadvantages. To achieve high performance, most cloud-based file systems (like Amazon EBS) can be accessed by only one cloud-based virtual machine at a time, which means all applications needing that data must run on a single cloud VM. To serve multiple VMs (like Azure Files) requires fronting the storage with a NAS (network attached storage) protocol like SMB, which can severely limit performance. File systems are fast, flexible, and legacy compatible, but they are expensive, useful only to applications running in the cloud, and do not scale well.
Objects are not files. Remember that, because it is easy to forget. Objects live in a flat namespace, like one giant directory. Latency is high, sometimes hundreds or thousands of milliseconds, and throughput is low, often topping out around 150 megabits per second unless clever tricks are used. Much about accessing objects comes down to clever tricks like multipart upload, byte range access, and key name optimization. Objects can be read by many cloud-native and web-based applications at once, from both within and outside the cloud, but traditional applications require performance crippling workarounds. Most interfaces for accessing object storage make objects look like files: key names are filtered by prefix to look like folders, custom metadata is attached to objects to appear like file metadata, and some systems like FUSE cache objects on a VM file system to allow access by traditional applications. But such workarounds are brittle and sap performance. Cloud storage is cheap, scalable, and cloud native, but it is also slow and difficult to access.
Databases have their own complex structure, and they are accessed by query languages such as SQL. Traditional databases may be backed by file storage, but they require a live database process to serve queries. This can be lifted into the cloud by copying the database files and applications onto a VM, or by migrating the data into a cloud-hosted database service. But copying a database file into object storage is only useful as an offline backup. Databases scale well as part of a cloud-hosted service, but it is critical to ensure that the applications and processes that depend on the database are fully compatible and cloud-native. Database storage is highly specialized and application-specific.
Balancing the apparent cost savings of object storage against the functionality of files and databases requires careful consideration of exactly what functionality is required. For example, if you want to store and distribute many thousands of small files, archive them into a ZIP file and store that as a single object instead of trying to store each individual file as a separate object. Incorrect storage choices can lead to complex dependencies that are difficult and expensive to change later.
Cloud migration bottleneck #2: Data preparation
Moving data to the cloud is not as simple as copying bytes into the designated storage type. A lot of preparation needs to happen before anything is copied, and that time requires careful budgeting. Proof-of-concept projects often ignore this step, which can lead to costly overruns later.
Filtering out unnecessary data can save a lot of time and storage costs. For example, a data set may contain backups, earlier versions, or scratch files that do not need to be part of the cloud workflow. Perhaps the most important part of filtering is prioritizing which data needs to be moved first. Data that is being actively used will not tolerate being out of sync by the weeks, months, or years it takes to complete the entire migration process. The key here is to come up with an automated means of selecting which data is to be sent and when, then keep careful records of everything that is and is not done.
Different cloud workflows may require the data to be in a different format or organization than on-premises applications. For example, a legal workflow might require translating thousands of small Word or PDF documents and packing them in ZIP files, a media workflow might involve transcoding and metadata packing, and a bioinformatics workflow might require picking and staging terabytes of genomics data. Such reformatting can be an intensely manual and time-consuming process. It may require a lot of experimentation, a lot of temporary storage, and a lot of exception handling. Sometimes it is tempting to defer any reformatting to the cloud environment, but remember that this does not solve the problem, it just shifts it to an environment where every resource you use has a price.
Part of the storage and formatting questions may involve decisions about compression and archiving. For example, it makes sense to ZIP millions of small text files before sending them to the cloud, but not a handful of multi-gigabyte media files. Archiving and compressing data makes it easier to transfer and store the data, but consider the time and storage space it takes to pack and unpack those archives at either end.
Cloud migration bottleneck #3: Information validation
Integrity checking is the single most important step, and also the easiest to get wrong. Often it is assumed that corruption will occur during the data transport, whether that is by physical media or network transfer, and can be caught by performing checksums before and after. Checksums are a vital part of the process, but it is actually the preparation and importing of the data where you are most likely to suffer loss or corruption.
When data is shifting formats and applications, meaning and functionality can be lost even when the bytes are the same. A simple incompatibility between software versions can render petabytes of “correct” data useless. Coming up with a scalable process to verify that your data is both correct and useable can be a daunting task. At worst, it may devolve into a labor-intensive and imprecise manual process of “it looks okay to me.” But even that is better than no validation at all. The most important thing is to ensure that you will be able to recognize problems before the legacy systems are decommissioned!
Cloud migration bottleneck #4: Transfer marshaling
When lifting a single system to the cloud, it is relatively easy to just copy the prepared data onto physical media or push it across the Internet. But this process can be difficult to scale, especially for physical media. What seems “simple” in a proof-of-concept can balloon to “nightmare” when many and varied systems come into play.
A media device, such as an AWS Snowball, must be connected to each machine. That could mean physically walking the device around one or more data centers, juggling connectors, updating drivers, and installing software. Connecting over the local network saves the physical movement, but software setup can still be challenging and copy speed may drop to well below what could be achieved with a direct Internet upload. Transferring the data directly from each machine over the Internet saves many steps, especially if the data is cloud-ready.
If data preparation involves copying, exporting, reformatting, or archiving, local storage can become a bottleneck. It may be necessary to set up dedicated storage to stage the prepared data. This has the advantage of allowing many systems to perform preparation in parallel, and reduces the contact points for shippable media and data transfer software to just one system.
Cloud migration bottleneck #5: Data transfer
When comparing network transfer to media shipment, it is easy to focus on just the shipping time. For example, an 80 terabyte AWS Snowball device might be sent by next-day courier, achieving an apparent data rate of more than eight gigabits per second. But this ignores the time it takes to acquire the device, configure and load it, prepare it for return, and allow the cloud vendor to copy the data off on the back-end. Customers of ours who do this regularly report that four-week turnaround times (from device ordering to data available in the cloud) are common. That brings the actual data transfer rate of shipping the device down to just 300 megabits per second, much less if the device is not completely filled.
Network transfer speeds likewise depend on a number of factors, foremost being the local uplink. You can’t send data faster than the physical bit rate, though careful data preparation can reduce the amount of data you need to send. Legacy protocols, including those that cloud vendors use by default for object storage, have difficulty with speed and reliability across long-distance Internet paths, which can make achieving that bit rate difficult. I could write many articles about the challenges involved here, but this is one you do not have to solve yourself. Data Expedition is one of a few companies that specialize in ensuring that the path is fully utilized regardless of how far away your data is from its cloud destination. For example, one gigabit Internet connection with acceleration software like CloudDat yields 900 megabits per second, three times the net throughput of an AWS Snowball.
The biggest difference between physical shipment and network transfer is also one of the most commonly overlooked during proof-of-concept. With physical shipment, the first byte you load onto the device must wait until the last byte is loaded before you can ship. This means that if it takes weeks to load the device, then some of your data will be weeks out of date by the time it arrives in the cloud. Even when data sets reach the petabyte levels where physical shipment may be faster over all, the ability to keep priority data current during the migration process may still favor network transfer for key assets. Careful planning during the filtering and prioritization phase of data preparation is essential, and may allow for a hybrid approach.
Getting the data into a cloud provider may not be the end of the data transfer step. If it needs to be replicated to multiple regions or providers, plan carefully how it will get there. Upload over the Internet is free, while AWS, for example, charges up to two cents per gigabyte for interregional data transfer and nine cents per gigabyte for transfer to other cloud vendors. Both methods will face bandwidth limitations that could benefit from transport acceleration such as CloudDat.
Cloud migration bottleneck #6: Cloud scaling
Once data arrives at its destination in the cloud, the migration process is only half finished. Checksums come first: Make sure that the bytes that arrived match those that were sent. This can be trickier than you may realize. File storage uses layers of caches that can hide corruption of data that was just uploaded. Such corruption is rare, but until you’ve cleared all of the caches and re-read the files, you can’t be sure of any checksums. Rebooting the instance or unmounting the storage does a tolerable job of clearing caches.
Validating object storage checksums requires that each object be read out into an instance for calculation. Contrary to popular belief, object “E-tags” are not useful as checksums. Objects uploaded using multipart techniques in particular can only be validated by reading them back out.