Three V’s are Enough!
Any man who has been fortunate enough to find that special someone whom he wants to share the rest of his life with, has been through the experience of picking out an engagement ring and therefore become familiar with the 4 C’s. I’m talking about the Cut, Color, Clarity, and Carat of a diamond. This was a standard created by the Gemological Institute of America (GIA) in the mid 20th century to standardize on how to describe the quality of diamonds. It’s a useful way to describe the different aspects of what impacts the value of a diamond in an easy to remember short list of items. You can tweak each dimension up or down its scale and immediately see the impact to your bank account.
Just as diamonds have a snappy way to describe them, so too does Big Data. Instead of the 4 C’s, we have the 3 V’s: Volume, Velocity, and Variety. You can trace the standardization of this definition back to a paper by Doug Laney for the META Group in 2001 titled “3D Data Management: Controlling Data Volume, Velocity, and Variety.” This paper was written at a time when the seeds for Web 2.0 were being sown, and Doug was mainly referring to the challenges of managing data at web scale. In addition to DBMS data, this included web logs to track clickstream data.
The definition of the 3 V’s might seem a bit obvious, but I will briefly explain them in this blog post. The Volume dimension of data is most readily associated with Big Data, just as temperature change is the most known aspect of Global Warming. Although this is the V that might get the most attention because of it’s “Bigness”, it is probably the least interesting of the dimensions. The volumes of data to manage has been increasing at an exponential rate, but so too has our ability to process and store it; at a reduced cost. The cost of disk storage at the time that Doug wrote his paper was roughly $10 per gigabyte. Today, it is closer to .10 cents per Gigabyte. I have more (flash) storage on my phone today than the total disk storage I had on my computer back in 2001.
Whoever coined the term Big Data, could have just as easily chosen the term Fast Data to describe this new paradigm. Had that been the case, then the Velocity dimension would have garnered the same attention that Volume does. Because data is now being generated at such a fast rate and at such a great volume, it is not acceptable nor feasible to fit this processing into some batch window at the end of the working day. As a data warehouse architect, I would always encounter the requirement for real-time reporting from the business. There are various strategies to enable this, but all of them required adding a level of complexity and cost on top of the batch focused ETL infrastructure. This included staffing up the support team to ensure real-time support when failures occurred. Sometimes the business got there, but in many cases they determined that the higher costs weren’t justified. If you architect a system for real-time data processing in the beginning, you can augment it with batch processing for large workloads. Conversely, if you start with a batch processing system, it is much more difficult to augment or transform it into a real-time data processing system.
Variety is the spice of life and data is no exception. The world of Big Data does not constrain itself to only structured data residing in relational databases. Rather, there is no limit to the type of data that can be ingested and analyzed. This includes documents, photos, audio, and video recordings. Part of the allure of HDFS is that any type of data can be stored, accessed, and analyzed. It frees data from the constraints and structure of a database, and pushes down the responsibility of dealing with schemas from the point of writing the data to the point of reading it.
If you’re thinking of experimenting with Big Data and are curious as to what type of initiative may best lend itself to it, think about which of the 3 V’s are the biggest challenges to fitting your initiative into the traditional data warehouse model over a DBMS. When you need a killer app that can’t be done easily or cheaply with the old way, you may have a good candidate for Big Data.
Most people would agree that these 3 V’s are a reasonable and succinct way to characterize this new paradigm of data. However, if you follow any Big Data web forums, you’ll find all manner of suggestions for how the 3 V’s should be expanded to include other V’s such as Variability, Verification, Viability, Vision, Validity, and even Virility. I was at a conference where an architect put up a slide with 18 different V words!! Leave it to architects and engineers to take an easy to comprehend concept and pedantically pick it apart until it completely loses its intended purpose. My personal favorite is when engineers try to inject some algorithms into the marketing. I found this gem on a web forum:
(Volume + Variety + Velocity + Veracity) * Visualization = Value
Einstein is generally regarded as a pretty smart guy, and he managed to distill his theory of relativity down to E = mc2. Why should data be more complicated than the theory supporting our entire universe? Let’s all just agree that Doug’s original 3 dimensions of Volume, Velocity, and Variety is sufficient enough to characterize what sets Big Data apart from our traditional notions of data. If we can make one of the most important purchasing decisions of our lives based upon 4 C’s, then surely we can get by with just 3 V’s to communicate the dimensions of what makes data the most challenging to work with. When the next major shift in technology presents itself, then we can always come up with a new snappy way to communicate. After all, diamonds may be forever, but technical paradigms are not.