Deconstructed Data

If you find yourself with a hunger for Greek food in San Francisco, you may want to visit Mezes Kitchen and order a deconstructed gyro.  Rather than presenting the dish as a whole, the individual components of the dish are separated out on the plate to be combined at your whim.

A little over ten years ago the same gastronomical experiment was taking place within Amazon’s engineering department.  The theory was that if you could distill complex applications down to certain “primitives”, then you could more easily compose them into any type of application you want.  These primitives were to be simple, reliable, scalable, and available on-demand.  Elastic Cloud Compute (EC2) and Simple Storage Service (S3) were the first such products to be released in what would come to be known as Amazon Web Services (AWS).  

These primitives serve as the basic building blocks of Compute and Storage that many other products rely on.  One of the main advantages of splitting Compute from Storage is that you can more effectively split the costs from processing and storing your data.  When you pay for S3, you may pay less than $0.03 per Gigabyte per month for storing your data.  The data is automatically replicated several times across multiple data centers and includes durability guarantees of 99.999999999%.  

If some of your data is old and not frequently accessed, but you are not ready to purge it, you can move it to AWS Glacier for a lower cost of $0.007 per Gigabyte per month.  There is no storage requisition process to go through, no limit to the amount of storage you can use, and no people to hire to replace hard drives when they go bad.

As for EC2, there are a number of instance types to choose from which offer various configurations of CPUs, Ram, Storage, and even GPUs.  Some instance types can be rented for a low as a few cents per hour.  Within minutes, you can have a machine available, deployed with the image of your choice for a variety of operating systems.  At any point in time, Amazon has excess capacity in their data centers because all of their servers are not being utilized.  To encourage use, they auction off this excess capacity.  You can bid on instances and pay a “Spot” price which can be much lower than the list price of an on-demand instance.  It’s not uncommon to pay 10% of the on-demand price or less for the same machine.  The only caveat is that your Spot Instance can be taken away from you at any point in time if the Spot price rises above your bid.  To mitigate against this, you can compose clusters with a mixture of on-demand and and spot instances to drive down your overall compute cost.

The result of this on-demand infrastructure allows you to conduct more experiments than you might otherwise had if you were confined to a set number of machines, or if you needed to make a business case for purchasing a large number of machines.  Analytics can now be performed adhoc on data with rented infrastructure to see if it yields value before going through an expensive process to try and fit it into an established data warehouse.

It’s not just infrastructure that has been decomposed into primitives.  Over the past ten years or so, the various components that comprise a traditional database have been split apart to be recomposed at whim.  The engine of a database has retained its familiar SQL interface, but it’s query plan and execution have been replaced by open source projects such as Apache Spark.  The data storage layer has been replaced by object stores, like S3.  This allows the data layer to expand at the same rate that data volume increases, and the compute layer to elastically expand and contract based on the workload being performed.  You can spin up a separate cluster of compute for each department, and shut them down after the employees leave work for the day.  Your ETL jobs can spin up their own clusters and then shutdown immediately after the jobs complete.  Because of this, you can run ETL at the same time as your users’ queries because they aren’t competing for the same resources.

Even the format that data is saved in has gravitated to open source formats that can be read by many different query engines.  This includes column optimized formats like Apache Parquet and ORC.  Large datasets can now be joined together within memory across a cluster of computers.  Each node adds additional CPU and memory to the total capacity.

Compare this to a traditional Data Warehouse built on an RDBMS with 10 years of data.  There needs to be enough storage attached to the server to accommodate all 10 years of data plus extra capacity, which is traditionally planned 6 months to a year in advance.  It’s not uncommon for this storage to be part of an expensive corporate SAN that is shared across various OLAP and OLTP databases.  Anything older than 10 years may need to be archived off to CSV files somewhere with fingers crossed that the online schema doesn’t change too much before it may ever need to be restored.

The RDBMS server itself is probably a single large expensive machine; with an equally sized (and expensive) failover server sitting idle in case the master should fail.  It needs to handle multiple queries from different types of users during the day.  Rarely will those queries access all 10 years of data, but rather only the last year or two.  There may be ETL jobs that run within the database at night, but never during the day because the utilization of the server’s resources may disrupt the queries of business users.  If the data were ever to be migrated from that RDBMS to another data store, you could not simply point a new query engine at them.  You would need to export the data out of the database into some common (but sub-optimal) format like CSV files before importing into the new data store.

In an elastic cloud architecture, there is no limit to the amount of data you can store in S3.  Store 10, 20, 30 years of data!  If you don’t frequently access anything past 10 years, then you could move that older data to AWS Glacier for a cheaper cost.  Storing the data in the Parquet format provides for a compressed columnar representation and schema evolution over time.  Using a query engine, like Spark SQL, allows you to scale out the computation across as many ephemeral clusters as needed; big or small.  Because the data is stored in an open format, you can change query engines if desired without needing to reload data.  There is already a shift underway in migrating older workloads from Apache Hive to Apache Spark.

Amazon is not the only game in town for cloud computing; though they are the biggest and most mature.  Microsoft, Google, IBM, and even Oracle themselves have their own primitives for storage and compute in their respective cloud offerings.  

Whether you have an appetite for gyros or data, you may want to order it deconstructed.  The ability to combine tools at whim from cloud providers and open source software allows you to be more agile with your data solutions.  As you do so, remember to heed the words from another Greek institution, The whole is greater than the sum of its parts.

The Adventures of Mark Twain and Proprietary Hardware

Samuel Clemens, aka Mark Twain, was a celebrated author and humorist, but did you know that he was also a pioneer in Big Data?  As a teenager, young Samuel worked an apprenticeship with a printer as a typesetter.  After earning success with The Adventures of Tom Sawyer he started to invest in a new way of setting type for a printing press.  It was called the Paige Compositor, and it promised to revolutionize the way information would be printed by setting and printing type one line at a time; rather than a character at a time.  This would dramatically increase data velocity for the times.  Mark Twain became an investor in 1877, but the invention never took off.  After 17 years, only 2 production machines had been built, and Mark Twain was forced to file for bankruptcy.  He had failed to follow one of the cardinal rules of wealth management; diversification.

Just as it is not prudent to keep all of your financial eggs in one basket, it is not recommended to keep all of your data resources in a single server.  Big Data doesn’t mean Big Server, but rather it is necessitating that every layer of an application be distributed across multiple nodes.  This includes file systems, message queues, processing, and data stores.  Distributing your computing resources will diversify the risk of system failure and allow easier scalability by adding more nodes; rather than by purchasing increasingly bigger singleton servers.

How many of you have been involved in a data warehouse project where there is an ETL server extracting from a bunch of relational databases, transforming the data, loading it into an even bigger relational database (ODS/DW), which is then queried by some business intelligence tool?  The BI layer can scale by adding more servers.  The more robust ETL tools can scale by adding more servers and partitioning the data.

The one component that can’t be easily distributed across multiple servers is the RDBMS.  Just like a financial portfolio, it is risky to keep most of your assets in one place.  A distributed computer system is only as diversified as its weakest link.  When the database needs to scale, this usually means buying a more powerful (and more expensive) machine.  In my past experiences, more times than not, the database was where my performance bottleneck would be too.  A typical complaint from the business would start with “Informatica is slow.”  However, after delving into the details, the ETL server was either waiting on the source database to read from, or waiting on the target database to write to.  I have been fortunate in my career to work with some extremely talented DBAs who could really make Oracle purr.  At some point though, the data starts to get too big to manage easily within a standard RDBMS.

About 10 years ago, massive parallel processing (MPP) databases and data warehouse appliances were starting to gain attention because of their fast performance and lower cost compared to Teradata.  Data was automatically replicated and disks or nodes could easily be swapped out if there was a hardware failure.  If your data pushed the capacity of the appliance, it could be extended, or a bigger model purchased.  I partnered with Netezza 7 years ago and was very impressed by the performance of their technology at the time.  I bumped into a Netezza user at a conference recently who shared my appreciation for the product, but then added that license costs had doubled after the company was acquired by IBM.  Being acquired was not unique to Netezza.  Here’s a brief timeline of MPP M&A:

2008

  • DATAllegro is acquired by Microsoft in 2008 and rebranded SQL Server Parallel Data Warehouse.

2009

  • Oracle releases its first version of Exadata in conjunction with HP.

2010

  • Netezza is acquired by IBM
  • Greenplum is acquired by EMC and is now part of their Pivotal big data initiative.
  • SybaseIQ is acquired by SAP

2011

  • Aster Data Systems is acquired by Teradata
  • HP retires its homegrown Neoview line in 2011
  • Oracle announces that Exadata will use Sun-based hardware (acquired in 2010)
  • Vertica is acquired by HP

2013

  • ParAccel is acquired by Actian

Some of these use proprietary hardware, while others can be run over a cluster of commodity servers.  In 2012, Amazon chose ParAccel to be the underpinnings behind its RedShift MPP service which manages all the administration behind the scenes and provides the service at $1,000 per terabyte.  There seems to be a bit of an overlap between Hadoop and MPP databases and a common pattern is emerging with Hadoop handling the batch ETL and the MPP database acting as a cache for user queries.  Every vendor is scrambling to tell their own story as to how their MPP database lives in harmony with Hadoop.  I haven’t heard of a new entrant into the MPP appliance space in years.  However, Hadoop appliances are now on the scene from many of the same vendors that acquired the MPP upstarts, and even a few high performance computing manufacturers such as Cray and SGI.  This recent article from InformationWeek does a good job of describing the landscape.

The trend of both Big Data and Cloud Computing is to spend money on lots of cheaper commodity machines, rather than on proprietary hardware.  These cheaper commodity machines, don’t need to be requisitioned and purchased with upfront costs if they are part of a public cloud.  If you want to do a proof of concept, just purchase compute and storage on demand.  HDFS is more than capable of scaling to petabytes of data, and there is increased focus in running SQL like queries against it.  The Apache Spark project seeks to leverage the resources of a commodity cluster to distribute resilient datasets in memory.  Before mapping out a space in your data center for that shiny new piece of expensive hardware, you might want to see how far you can get on commodity boxes.  There will always be a space for specialty hardware for special situations, but is your Big Data use case really that special compared to the common use cases?

Mark Twain was not ill-informed about the trends in publishing.  He had been in and around the industry since he was a teenager, and the Paige Compositor was headed in the right direction of setting type one line at a time.  He just happened to bet on the wrong hardware.  The Linotype machine began taking orders 3 years prior to the Paige Compositor and had locked up the market with its simpler design and exceeding durability.  After 17 years of investment, it cannot be said that Mark Twain was not a patient man.  I think we can all agree that he would have made a better blogger than venture capitalist.  He is quoted with having learned two lessons from the experience: “not to invest when you can’t afford to, and not to invest when you can.”