Deconstructed Data

If you find yourself with a hunger for Greek food in San Francisco, you may want to visit Mezes Kitchen and order a deconstructed gyro.  Rather than presenting the dish as a whole, the individual components of the dish are separated out on the plate to be combined at your whim.

A little over ten years ago the same gastronomical experiment was taking place within Amazon’s engineering department.  The theory was that if you could distill complex applications down to certain “primitives”, then you could more easily compose them into any type of application you want.  These primitives were to be simple, reliable, scalable, and available on-demand.  Elastic Cloud Compute (EC2) and Simple Storage Service (S3) were the first such products to be released in what would come to be known as Amazon Web Services (AWS).  

These primitives serve as the basic building blocks of Compute and Storage that many other products rely on.  One of the main advantages of splitting Compute from Storage is that you can more effectively split the costs from processing and storing your data.  When you pay for S3, you may pay less than $0.03 per Gigabyte per month for storing your data.  The data is automatically replicated several times across multiple data centers and includes durability guarantees of 99.999999999%.  

If some of your data is old and not frequently accessed, but you are not ready to purge it, you can move it to AWS Glacier for a lower cost of $0.007 per Gigabyte per month.  There is no storage requisition process to go through, no limit to the amount of storage you can use, and no people to hire to replace hard drives when they go bad.

As for EC2, there are a number of instance types to choose from which offer various configurations of CPUs, Ram, Storage, and even GPUs.  Some instance types can be rented for a low as a few cents per hour.  Within minutes, you can have a machine available, deployed with the image of your choice for a variety of operating systems.  At any point in time, Amazon has excess capacity in their data centers because all of their servers are not being utilized.  To encourage use, they auction off this excess capacity.  You can bid on instances and pay a “Spot” price which can be much lower than the list price of an on-demand instance.  It’s not uncommon to pay 10% of the on-demand price or less for the same machine.  The only caveat is that your Spot Instance can be taken away from you at any point in time if the Spot price rises above your bid.  To mitigate against this, you can compose clusters with a mixture of on-demand and and spot instances to drive down your overall compute cost.

The result of this on-demand infrastructure allows you to conduct more experiments than you might otherwise had if you were confined to a set number of machines, or if you needed to make a business case for purchasing a large number of machines.  Analytics can now be performed adhoc on data with rented infrastructure to see if it yields value before going through an expensive process to try and fit it into an established data warehouse.

It’s not just infrastructure that has been decomposed into primitives.  Over the past ten years or so, the various components that comprise a traditional database have been split apart to be recomposed at whim.  The engine of a database has retained its familiar SQL interface, but it’s query plan and execution have been replaced by open source projects such as Apache Spark.  The data storage layer has been replaced by object stores, like S3.  This allows the data layer to expand at the same rate that data volume increases, and the compute layer to elastically expand and contract based on the workload being performed.  You can spin up a separate cluster of compute for each department, and shut them down after the employees leave work for the day.  Your ETL jobs can spin up their own clusters and then shutdown immediately after the jobs complete.  Because of this, you can run ETL at the same time as your users’ queries because they aren’t competing for the same resources.

Even the format that data is saved in has gravitated to open source formats that can be read by many different query engines.  This includes column optimized formats like Apache Parquet and ORC.  Large datasets can now be joined together within memory across a cluster of computers.  Each node adds additional CPU and memory to the total capacity.

Compare this to a traditional Data Warehouse built on an RDBMS with 10 years of data.  There needs to be enough storage attached to the server to accommodate all 10 years of data plus extra capacity, which is traditionally planned 6 months to a year in advance.  It’s not uncommon for this storage to be part of an expensive corporate SAN that is shared across various OLAP and OLTP databases.  Anything older than 10 years may need to be archived off to CSV files somewhere with fingers crossed that the online schema doesn’t change too much before it may ever need to be restored.

The RDBMS server itself is probably a single large expensive machine; with an equally sized (and expensive) failover server sitting idle in case the master should fail.  It needs to handle multiple queries from different types of users during the day.  Rarely will those queries access all 10 years of data, but rather only the last year or two.  There may be ETL jobs that run within the database at night, but never during the day because the utilization of the server’s resources may disrupt the queries of business users.  If the data were ever to be migrated from that RDBMS to another data store, you could not simply point a new query engine at them.  You would need to export the data out of the database into some common (but sub-optimal) format like CSV files before importing into the new data store.

In an elastic cloud architecture, there is no limit to the amount of data you can store in S3.  Store 10, 20, 30 years of data!  If you don’t frequently access anything past 10 years, then you could move that older data to AWS Glacier for a cheaper cost.  Storing the data in the Parquet format provides for a compressed columnar representation and schema evolution over time.  Using a query engine, like Spark SQL, allows you to scale out the computation across as many ephemeral clusters as needed; big or small.  Because the data is stored in an open format, you can change query engines if desired without needing to reload data.  There is already a shift underway in migrating older workloads from Apache Hive to Apache Spark.

Amazon is not the only game in town for cloud computing; though they are the biggest and most mature.  Microsoft, Google, IBM, and even Oracle themselves have their own primitives for storage and compute in their respective cloud offerings.  

Whether you have an appetite for gyros or data, you may want to order it deconstructed.  The ability to combine tools at whim from cloud providers and open source software allows you to be more agile with your data solutions.  As you do so, remember to heed the words from another Greek institution, The whole is greater than the sum of its parts.

Databricks Lights a Spark Underneath Your SaaS

On January 13, Databricks hosted a meetup in their brand new San Francisco headquarters.  On the agenda was what to expect from their roadmap in 2015.  You can find the entire video here and the slide deck here.

For those who are unfamiliar, Databricks is the company behind the open source software Apache Spark, which is becoming an extremely popular Big Data application engine.  You may have heard of Spark described as being the next generation of Map Reduce, but it is not tied to Hadoop.  Some might say that Spark is in the fast lane to become the killer app for Big Data.  Both the number of patches merged, and the number of contributors has tripled from 2013 to 2014.

At the core of Spark is a concept called Resilient Distributed Datasets (RDD’s).  Without RDD’s, there is no Spark.  An RDD is an immutable, partitioned, collection of elements that can be distributed across a cluster in a manner that is fault tolerant, scales linearly, and is mostly* in-memory. An element can be anything that is serializable.  Working with collections in a language like Java is convenient when the number of elements is in the hundreds or thousands.  However, when that number jumps to millions or billions, you can very quickly run into capacity issues on a single machine.  The beauty of Spark is that this collection can be spread out over an entire cluster of machines (in memory) without the developer needing to think too much about it. Each RDD is immutable and remembers how it was created.  If there is a failure somewhere in the cluster, the failed RDD’s are automatically re-created elsewhere in the cluster.  I prefer to think of it as akin to creating ETL using CREATE TABLE AS statements, except it is not limited to the resources of a single database server.  Nor is it limited to just SQL statements.  If you are curious as to the underpinnings of RDD’s, there is no better resource than to read the 2012 white paper written by its Berkeley creators.

*If the dataset does not fit into memory, Spark can either recompute the partitions that don’t fit into RAM each time they are requested or spill it to disk (added in 2014).

The big takeaway from this meetup is that Databricks is doubling-down on Schema RDD’s to aid with stabilizing their APIs so that they can encourage an strong ecosystem of projects outside of Spark.  A Spark Packages index has already been launched so that developers can create and submit their own packages, much like the ecosystem that already exists for Python and R.  In particular, Spark needs to play catch-up to more established projects by adding machine learning algorithms.  The introduction of Spark Packages will accelerate this.  It was revealed that many Apache Mahout developers have already begun to turn their attention to reengineering their algorithms on Spark.  The Databricks team recognizes that it will be difficult for Spark Packages to reach critical mass without stable APIs.  Therefore, the immediate priority appears to be leveraging Schema RDD’s in both internal and pluggable APIs so that they can graduate them from their Alpha versions.

So what exactly is a Schema RDD?  These were introduced last year as part of Spark SQL, the newest component to the Spark family.  A Schema RDD is “an RDD of Row objects that has an associated schema.”  This is essentially putting structure and types around your data.  Not only does this help to better define interfaces, but it allows Spark to optimize for performance.  Now data scientists can interact with RDDs just like they do with data frames in R and Python.

It wasn’t clear how extensible the meta-data in a Schema RDD will be, or if it will be restricted to basics such as names and types.  This could be a very powerful unifying concept for everything Spark.  In data warehousing, it is not uncommon to build a bespoke solution with Informatica or Data Stage as the ETL engine, Business Objects or Cognos as the BI tool, ER Studio or Embarcadero as the data modeling tool, and a mixture of Oracle, DB2, and SQL Server databases.  Each of these applications has their own catalog of meta-data to manage, and too much of the work involves keeping all of this meta-data in sync.  Schema RDDs have the potential of utilizing a single set of meta-data from the point of sourcing data all the way through to dashboards.

Loading these Schema RDDs from any data source will be accomplished by a new Data Source API.  Data can be sourced into a Schema RDD using a plugin and then manipulated in any supported Spark language (Java, Scala, Python, SQL, R).  The Schema RDD will serve as a common interchange format so that the data can be accessed via Spark SQL, GraphX, MLLib, or Streaming; regardless of which programming language is used.  There are already plugins created for Avro, Cassandra, Parquet, Hive, and JSON.  Support for partitioned data sources will be included in a release later this year so that a predicate can determine which HDFS directories should be accessed.

The Spark Machine Learning Library (MLLib) will leverage Schema RDD’s to ensure interoperability and a stable API.  The main focus for MLLib in 2015 will be the Pipelines API.    This API provides a language for describing the workflows that glue together all the data munging and machine learning necessary in today’s analytics.  There was even a suggestion of partial PMML support for importing and exporting models.  It became apparent that MLLib has some catching up to do when a list of 14+ candidate algorithms for 2015 were projected on the screen along with 5 candidates for optimization primitives.  The sooner all these APIs are stabilized, the sooner the Spark community can get to work on delivering stable packages of these; rather then relying on additions to MLLib itself.  The Pipelines API will become the bridge between prototyping data scientists and production deploying data engineers.

Those were my main takeaways from the prepared remarks.  There were a few interesting discoveries in the Q&A.  A general availability version of Spark R is expected sometime in the first half of this year.  There is a hold up right now due to an incompatible licensing issue.  This project is being driven by Berkeley’s AMP labs rather than Databricks.  There still seems to be more questions than answers about how R is going to fit into the Spark ecosystem other than leveraging Schema RDD’s as data frames.

It was admitted that there is “not much going on” for YARN support in the 2015 roadmap.  Databricks leverages Apache Mesos to administer their clusters, so it is logical to assume their incentive will be to promote Mesos integration as a higher priority.  There are plenty of partners in the Spark ecosystem who are more closely aligned with Hadoop, so I would not be surprised to see one or more of them assuming the mantle for this area.

All of the above is open source and freely available to be installed on your own local cluster.  Databricks is in the business of selling access to their own hosted Spark platform, which has been architected to run on Amazon Web Services.  You can leverage their SaaS solution to have an “instant on” environment for working with Big Data.  Their “Notebooks” interface allows you to write code in a browser and run it interactively.  You can scale clusters up and down on the fly, create data pipelines, and even create dashboards to be published to other users.  From a business intelligence / data analysis point of view, it is very attractive to see how easy it is to analyze data and quickly generate appealing charts.  All of this can be done without the need to install a single server in your data center, or the need to hire a team of people to support the inevitable failure of components.

Ali Ghodsi leads engineering and product management for Databricks, and his theme for his presentation at the 2014 Spark Summit was to pay homage to UI pioneer Alan Kay with their shared desire to “Make simple things simple, and complex things possible.”  This sentence is congruent with the roadmap that was previewed for 2015.  Simple operations such as basic aggregations will become easy to perform with Schema RDDs.  Complex things, such as engineering features as part of a data pipeline to build a gradient boosted decision tree model which is then hosted as part of a real-time data stream, will become possible.  It is still early days, but the excitement around Databricks is catching fire, and as Bruce Springsteen correctly observed, “you can’t start a fire without a spark.”