“Big data” is high-volume, -velocity and -variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.
Source – Gartner Research
Industry analysts widely describe the 3 V’s (Volume, velocity and variety) as the trifecta from a definition perspective.Let’s add a fourth V to it, Veracity – which pertains to the signal to noise ratio and the concomitant problem of unclean data.
The below infographic from IBM’s research group does a tremendous job of describing each of the V’s in one succinct picture.
Src – http://www.ibmbigdatahub.com/infographic/four-vs-big-data
Volume refers to the rapidly expanding data volumes. We now create as much information every two days as mankind had created from the start of time until 2003. In the past, you only had humans creating all this data now it’s machines as well (see IoT).
thus, Volume = Tremendous Scale (with a diverse variety of data)
One can see many industry examples in the graphic. Prior to the emergence of projects in the Apache Hadoop ecosystem, a lot of this data was getting trashed as organizations did not know how to accommodate them with conventional RDBMS/EDW techniques.
Existing approaches simply cannot scale and handle all this data.
A modern data platform built on Hadoop changes all that as we will see in subsequent posts.
Velocity refers to the speed at which these feeds (spanning a multitude of business clients ranging from sensors in cars, power generation & distribution, RFIDs in manufacturing, stock ticks in financial services and social media feeds) are moving into corporate applications – which then need to cleanse, normalize and sort the actionable information from the haystack. The other important aspect of Velocity is that all this new data is not just batch oriented but also streaming, real-time & near-time.
This velocity is also both ingress velocity (“the speed at which these feeds can be sent into your architecture”) as well as egress velocity (“the speed at which actionable results need to be gleaned from this data”).
Variety refers to the emergence of semi-structured and unstructured data. Data in the past used to processed & stored in formats compatible with spreadsheets and relational databases. Now you get all sorts of data like photos, tweets,XML, emails etc etc. This adds tremendous strain on techniques that deal with collecting, processing and storing data. Existing approaches are simply unable to accommodate and scale.
The fourth V (Veracity) refers to the data governance process for all this data. How is the development and management of all these data pipelines to be managed with a focus on meta data mgmt, data lineage, data cleanliness etc. The sheer business impetus being the need to reason on data that is consistent and correct.
According to the above model, the challenges of big data management result from the expansion of all four properties, rather than just the volume alone — the sheer amount of data to be managed.
One question is I get very frequently is if existing technology like Complex Event Processing (CEP) can help handle the velocity aspect of Big Data, or, that processing of big data volumes seems awfully close to what they’ve encountered with CEP before.
Let’s talk about this in the next post before addressing the other V’s one by one.
1 comment
Very good post..it may make a lot of sense for you to cover data governance (you only alluded to it in this post) in a followup,
I haven’t run into your work before but definitely top drawer content here. Are you visiting financial clients in the Netherlands at some point?
Cheers!