Three V’s are Enough!

Any man who has been fortunate enough to find that special someone whom he wants to share the rest of his life with, has been through the experience of picking out an engagement ring and therefore become familiar with the 4 C’s.  I’m talking about the Cut, Color, Clarity, and Carat of a diamond.  This was a standard created by the Gemological Institute of America (GIA) in the mid 20th century to standardize on how to describe the quality of diamonds.  It’s a useful way to describe the different aspects of what impacts the value of a diamond in an easy to remember short list of items.  You can tweak each dimension up or down its scale and immediately see the impact to your bank account.

Just as diamonds have a snappy way to describe them, so too does Big Data.  Instead of the 4 C’s, we have the 3 V’s: Volume, Velocity, and Variety.  You can trace the standardization of this definition back to a paper by Doug Laney for the META Group in 2001 titled “3D Data Management: Controlling Data Volume, Velocity, and Variety.”  This paper was written at a time when the seeds for Web 2.0 were being sown, and Doug was mainly referring to the challenges of managing data at web scale.  In addition to DBMS data, this included web logs to track clickstream data.

The definition of the 3 V’s might seem a bit obvious, but I will briefly explain them in this blog post.  The Volume dimension of data is most readily associated with Big Data, just as temperature change is the most known aspect of Global Warming.  Although this is the V that might get the most attention because of it’s “Bigness”, it is probably the least interesting of the dimensions.  The volumes of data to manage has been increasing at an exponential rate, but so too has our ability to process and store it; at a reduced cost.  The cost of disk storage at the time that Doug wrote his paper was roughly $10 per gigabyte.  Today, it is closer to .10 cents per Gigabyte.  I have more (flash) storage on my phone today than the total disk storage I had on my computer back in 2001.

Whoever coined the term Big Data, could have just as easily chosen the term Fast Data to describe this new paradigm.  Had that been the case, then the Velocity dimension would have garnered the same attention that Volume does.  Because data is now being generated at such a fast rate and at such a great volume, it is not acceptable nor feasible to fit this processing into some batch window at the end of the working day.  As a data warehouse architect, I would always encounter the requirement for real-time reporting from the business.  There are various strategies to enable this, but all of them required adding a level of complexity and cost on top of the batch focused ETL infrastructure.  This included staffing up the support team to ensure real-time support when failures occurred.  Sometimes the business got there, but in many cases they determined that the higher costs weren’t justified.  If you architect a system for real-time data processing in the beginning, you can augment it with batch processing for large workloads.  Conversely, if you start with a batch processing system, it is much more difficult to augment or transform it into a real-time data processing system.

Variety is the spice of life and data is no exception.  The world of Big Data does not constrain itself to only structured data residing in relational databases.  Rather, there is no limit to the type of data that can be ingested and analyzed.  This includes documents, photos, audio, and video recordings.  Part of the allure of HDFS is that any type of data can be stored, accessed, and analyzed.  It frees data from the constraints and structure of a database, and pushes down the responsibility of dealing with schemas from the point of writing the data to the point of reading it.

If you’re thinking of experimenting with Big Data and are curious as to what type of initiative may best lend itself to it, think about which of the 3 V’s are the biggest challenges to fitting your initiative into the traditional data warehouse model over a DBMS.  When you need a killer app that can’t be done easily or cheaply with the old way, you may have a good candidate for Big Data.

Most people would agree that these 3 V’s are a reasonable and succinct way to characterize this new paradigm of data.  However, if you follow any Big Data web forums, you’ll find all manner of suggestions for how the 3 V’s should be expanded to include other V’s such as Variability, Verification, Viability, Vision, Validity, and even Virility. I was at a conference where an architect put up a slide with 18 different V words!!  Leave it to architects and engineers to take an easy to comprehend concept and pedantically pick it apart until it completely loses its intended purpose.  My personal favorite is when engineers try to inject some algorithms into the marketing.  I found this gem on a web forum:

    (Volume + Variety + Velocity + Veracity) * Visualization = Value

Einstein is generally regarded as a pretty smart guy, and he managed to distill his theory of relativity down to E = mc2.  Why should data be more complicated than the theory supporting our entire universe?  Let’s all just agree that Doug’s original 3 dimensions of Volume, Velocity, and Variety is sufficient enough to characterize what sets Big Data apart from our traditional notions of data.  If we can make one of the most important purchasing decisions of our lives based upon 4 C’s, then surely we can get by with just 3 V’s to communicate the dimensions of what makes data the most challenging to work with.  When the next major shift in technology presents itself, then we can always come up with a new snappy way to communicate.  After all, diamonds may be forever, but technical paradigms are not.

4 comments

  1. Frederic Esnault (@fesnault)'s avatar
    Frederic Esnault (@fesnault) · November 7, 2014

    I must disagree on the fact that 3 V’s are enough. Most Big Data projects or infrastructure are meant to deal with one, two or even three of these V’s, and they do it well, but they miss one very important aspect: if the data to process is not “clean”, then no (meaningful) value can be extracted from it. So I would say that, of course 18 V’s are too much, but we can’t go under 4 : Volume, Variety, Velocity and Veracity. I’d say that in my opinion, the last V (Veracity) is probably the more important of all, as all other are irrelevant if the data is not clean enough.

    Like

    • Pohl Position's avatar
      Pohl Position · November 8, 2014

      Of all the potential V’s to add to the 3, the top 2 are most certainly Veracity and Value. I know that IBM intentionally includes Veracity in their slide decks. I intentionally left these two out, because they aren’t what sets big data apart from regular data. It is a prerequisite when starting a data project to know what Value you are driving at. Likewise, you must have accurate data to work with. If you do not attempt to address the quality of the data, then user acceptance will erode. This was true when building data warehouses 15 years ago, and it is still true today with Big Data. The 3 Vs are game changers to what was traditionally thought acceptable when building a data warehouse 15 or even 5 years ago.

      Like

  2. mack's avatar
    mack · November 6, 2014

    Back in the day, the “three V’s” used to be called Volume, Entropy and Churn, but they’re still the right measures.

    There is a fourth though, that’s very much worth considering: Orthogonality (or Independence, depending on your point of view.) Orthogonality is a measure of the degree of interdependency of the pieces of data. A fully orthogonal (or fully independent) dataset is one in which all the elements can change freely and at independent rates without affecting the consistency and correctness of the data. A highly interdependent dataset would be one with many complex interrelationships that strongly affect how and when data can change while maintaining correctness.

    In crude terms, you can size the storage of a system by thinking of Volume(time) = Entropy * Churn * time (* some optional compression constant). You can size the compute resources as k(Orthogonality) * Churn.

    Like

    • Pohl Position's avatar
      Pohl Position · November 8, 2014

      Orthogonality could be viewed as a reflection of the data modeling technique. In 3NF data modeling, the data is very interrelated so as not to represent the same element more than once.

      Recent NoSQL data stores use a separate document containing all elements relevant to that document root. If you can imagine an insurance policy as a single document, it would contain the primary insured’s address. If the same insured had more than one policy, then his or her address would be duplicated across all the policy documents. The same address could be modeled once as 3NF. It is an implementation detail, that reflects the trade offs one is willing to make when building a system.

      When storage was expensive, 3NF was the standard. In the era of Big Data and cheap storage, more emphasis is being placed on partitioning data so that it can be better distributed. Because of this the pendulum is swinging away from consistency.

      Orthogonality is a characteristic of how every part of a computer system needs to be distributed now. It’s an interesting aspect of Big Data, and worthy of an entire blog post. Stay tuned…

      Like

Leave a reply to Frederic Esnault (@fesnault) Cancel reply