Deconstructed Data
If you find yourself with a hunger for Greek food in San Francisco, you may want to visit Mezes Kitchen and order a deconstructed gyro. Rather than presenting the dish as a whole, the individual components of the dish are separated out on the plate to be combined at your whim.
A little over ten years ago the same gastronomical experiment was taking place within Amazon’s engineering department. The theory was that if you could distill complex applications down to certain “primitives”, then you could more easily compose them into any type of application you want. These primitives were to be simple, reliable, scalable, and available on-demand. Elastic Cloud Compute (EC2) and Simple Storage Service (S3) were the first such products to be released in what would come to be known as Amazon Web Services (AWS).
These primitives serve as the basic building blocks of Compute and Storage that many other products rely on. One of the main advantages of splitting Compute from Storage is that you can more effectively split the costs from processing and storing your data. When you pay for S3, you may pay less than $0.03 per Gigabyte per month for storing your data. The data is automatically replicated several times across multiple data centers and includes durability guarantees of 99.999999999%.
If some of your data is old and not frequently accessed, but you are not ready to purge it, you can move it to AWS Glacier for a lower cost of $0.007 per Gigabyte per month. There is no storage requisition process to go through, no limit to the amount of storage you can use, and no people to hire to replace hard drives when they go bad.
As for EC2, there are a number of instance types to choose from which offer various configurations of CPUs, Ram, Storage, and even GPUs. Some instance types can be rented for a low as a few cents per hour. Within minutes, you can have a machine available, deployed with the image of your choice for a variety of operating systems. At any point in time, Amazon has excess capacity in their data centers because all of their servers are not being utilized. To encourage use, they auction off this excess capacity. You can bid on instances and pay a “Spot” price which can be much lower than the list price of an on-demand instance. It’s not uncommon to pay 10% of the on-demand price or less for the same machine. The only caveat is that your Spot Instance can be taken away from you at any point in time if the Spot price rises above your bid. To mitigate against this, you can compose clusters with a mixture of on-demand and and spot instances to drive down your overall compute cost.
The result of this on-demand infrastructure allows you to conduct more experiments than you might otherwise had if you were confined to a set number of machines, or if you needed to make a business case for purchasing a large number of machines. Analytics can now be performed adhoc on data with rented infrastructure to see if it yields value before going through an expensive process to try and fit it into an established data warehouse.
It’s not just infrastructure that has been decomposed into primitives. Over the past ten years or so, the various components that comprise a traditional database have been split apart to be recomposed at whim. The engine of a database has retained its familiar SQL interface, but it’s query plan and execution have been replaced by open source projects such as Apache Spark. The data storage layer has been replaced by object stores, like S3. This allows the data layer to expand at the same rate that data volume increases, and the compute layer to elastically expand and contract based on the workload being performed. You can spin up a separate cluster of compute for each department, and shut them down after the employees leave work for the day. Your ETL jobs can spin up their own clusters and then shutdown immediately after the jobs complete. Because of this, you can run ETL at the same time as your users’ queries because they aren’t competing for the same resources.
Even the format that data is saved in has gravitated to open source formats that can be read by many different query engines. This includes column optimized formats like Apache Parquet and ORC. Large datasets can now be joined together within memory across a cluster of computers. Each node adds additional CPU and memory to the total capacity.
Compare this to a traditional Data Warehouse built on an RDBMS with 10 years of data. There needs to be enough storage attached to the server to accommodate all 10 years of data plus extra capacity, which is traditionally planned 6 months to a year in advance. It’s not uncommon for this storage to be part of an expensive corporate SAN that is shared across various OLAP and OLTP databases. Anything older than 10 years may need to be archived off to CSV files somewhere with fingers crossed that the online schema doesn’t change too much before it may ever need to be restored.
The RDBMS server itself is probably a single large expensive machine; with an equally sized (and expensive) failover server sitting idle in case the master should fail. It needs to handle multiple queries from different types of users during the day. Rarely will those queries access all 10 years of data, but rather only the last year or two. There may be ETL jobs that run within the database at night, but never during the day because the utilization of the server’s resources may disrupt the queries of business users. If the data were ever to be migrated from that RDBMS to another data store, you could not simply point a new query engine at them. You would need to export the data out of the database into some common (but sub-optimal) format like CSV files before importing into the new data store.
In an elastic cloud architecture, there is no limit to the amount of data you can store in S3. Store 10, 20, 30 years of data! If you don’t frequently access anything past 10 years, then you could move that older data to AWS Glacier for a cheaper cost. Storing the data in the Parquet format provides for a compressed columnar representation and schema evolution over time. Using a query engine, like Spark SQL, allows you to scale out the computation across as many ephemeral clusters as needed; big or small. Because the data is stored in an open format, you can change query engines if desired without needing to reload data. There is already a shift underway in migrating older workloads from Apache Hive to Apache Spark.
Amazon is not the only game in town for cloud computing; though they are the biggest and most mature. Microsoft, Google, IBM, and even Oracle themselves have their own primitives for storage and compute in their respective cloud offerings.
Whether you have an appetite for gyros or data, you may want to order it deconstructed. The ability to combine tools at whim from cloud providers and open source software allows you to be more agile with your data solutions. As you do so, remember to heed the words from another Greek institution, The whole is greater than the sum of its parts.