Samuel Clemens, aka Mark Twain, was a celebrated author and humorist, but did you know that he was also a pioneer in Big Data? As a teenager, young Samuel worked an apprenticeship with a printer as a typesetter. After earning success with The Adventures of Tom Sawyer he started to invest in a new way of setting type for a printing press. It was called the Paige Compositor, and it promised to revolutionize the way information would be printed by setting and printing type one line at a time; rather than a character at a time. This would dramatically increase data velocity for the times. Mark Twain became an investor in 1877, but the invention never took off. After 17 years, only 2 production machines had been built, and Mark Twain was forced to file for bankruptcy. He had failed to follow one of the cardinal rules of wealth management; diversification.
Just as it is not prudent to keep all of your financial eggs in one basket, it is not recommended to keep all of your data resources in a single server. Big Data doesn’t mean Big Server, but rather it is necessitating that every layer of an application be distributed across multiple nodes. This includes file systems, message queues, processing, and data stores. Distributing your computing resources will diversify the risk of system failure and allow easier scalability by adding more nodes; rather than by purchasing increasingly bigger singleton servers.
How many of you have been involved in a data warehouse project where there is an ETL server extracting from a bunch of relational databases, transforming the data, loading it into an even bigger relational database (ODS/DW), which is then queried by some business intelligence tool? The BI layer can scale by adding more servers. The more robust ETL tools can scale by adding more servers and partitioning the data.
The one component that can’t be easily distributed across multiple servers is the RDBMS. Just like a financial portfolio, it is risky to keep most of your assets in one place. A distributed computer system is only as diversified as its weakest link. When the database needs to scale, this usually means buying a more powerful (and more expensive) machine. In my past experiences, more times than not, the database was where my performance bottleneck would be too. A typical complaint from the business would start with “Informatica is slow.” However, after delving into the details, the ETL server was either waiting on the source database to read from, or waiting on the target database to write to. I have been fortunate in my career to work with some extremely talented DBAs who could really make Oracle purr. At some point though, the data starts to get too big to manage easily within a standard RDBMS.
About 10 years ago, massive parallel processing (MPP) databases and data warehouse appliances were starting to gain attention because of their fast performance and lower cost compared to Teradata. Data was automatically replicated and disks or nodes could easily be swapped out if there was a hardware failure. If your data pushed the capacity of the appliance, it could be extended, or a bigger model purchased. I partnered with Netezza 7 years ago and was very impressed by the performance of their technology at the time. I bumped into a Netezza user at a conference recently who shared my appreciation for the product, but then added that license costs had doubled after the company was acquired by IBM. Being acquired was not unique to Netezza. Here’s a brief timeline of MPP M&A:
2008
- DATAllegro is acquired by Microsoft in 2008 and rebranded SQL Server Parallel Data Warehouse.
2009
- Oracle releases its first version of Exadata in conjunction with HP.
2010
- Netezza is acquired by IBM
- Greenplum is acquired by EMC and is now part of their Pivotal big data initiative.
- SybaseIQ is acquired by SAP
2011
- Aster Data Systems is acquired by Teradata
- HP retires its homegrown Neoview line in 2011
- Oracle announces that Exadata will use Sun-based hardware (acquired in 2010)
- Vertica is acquired by HP
2013
- ParAccel is acquired by Actian
Some of these use proprietary hardware, while others can be run over a cluster of commodity servers. In 2012, Amazon chose ParAccel to be the underpinnings behind its RedShift MPP service which manages all the administration behind the scenes and provides the service at $1,000 per terabyte. There seems to be a bit of an overlap between Hadoop and MPP databases and a common pattern is emerging with Hadoop handling the batch ETL and the MPP database acting as a cache for user queries. Every vendor is scrambling to tell their own story as to how their MPP database lives in harmony with Hadoop. I haven’t heard of a new entrant into the MPP appliance space in years. However, Hadoop appliances are now on the scene from many of the same vendors that acquired the MPP upstarts, and even a few high performance computing manufacturers such as Cray and SGI. This recent article from InformationWeek does a good job of describing the landscape.
The trend of both Big Data and Cloud Computing is to spend money on lots of cheaper commodity machines, rather than on proprietary hardware. These cheaper commodity machines, don’t need to be requisitioned and purchased with upfront costs if they are part of a public cloud. If you want to do a proof of concept, just purchase compute and storage on demand. HDFS is more than capable of scaling to petabytes of data, and there is increased focus in running SQL like queries against it. The Apache Spark project seeks to leverage the resources of a commodity cluster to distribute resilient datasets in memory. Before mapping out a space in your data center for that shiny new piece of expensive hardware, you might want to see how far you can get on commodity boxes. There will always be a space for specialty hardware for special situations, but is your Big Data use case really that special compared to the common use cases?
Mark Twain was not ill-informed about the trends in publishing. He had been in and around the industry since he was a teenager, and the Paige Compositor was headed in the right direction of setting type one line at a time. He just happened to bet on the wrong hardware. The Linotype machine began taking orders 3 years prior to the Paige Compositor and had locked up the market with its simpler design and exceeding durability. After 17 years of investment, it cannot be said that Mark Twain was not a patient man. I think we can all agree that he would have made a better blogger than venture capitalist. He is quoted with having learned two lessons from the experience: “not to invest when you can’t afford to, and not to invest when you can.”
Awesome content you got here! You can earn some
extra cash from your website, don’t miss this opportunity, for more details simply type in google – omgerido monetize website
LikeLike