Everyone has heard of the four V’s of Big Data. If not, they are Volume, Velocity, Variety, Volatility, and Vulnerability. Each is self-explanatory but their implications are not. The 5 V’s have existed since the beginning of data so there is nothing new … it’s just that Big Data is now bringing all of them together. Here is a brief commentary on why each of them is important in the context of Big Data.
Volume – data is growing … rapidly … out of control! While this is starting to sound like a cliche, it is still one of the key realities of big data. Whether you store all of it (impossible) or model all of it (impossible), you will have to deal with the volume effectively. They don’t just call if BIG data for nothing. Buying commercially “robust” and expensive storage systems is no longer an option (the storage vendors will disagree). You must now deploy a smart storage strategy where you have huge amounts of storage which is dirt-cheap. The key here is to deploy the Hadoop File System (HDFS) which is an open and distributed file system that lets you leverage very affordable hardware and make it look like one giant storage device. HDFS also gives you the control layer to access the vast storage resources. Born at Google … enough said, right?
The key to this “V” is that you have to have a) an effective strategy to store data which includes prioritizing which data is stored and which is simply analysed, b) storage technology (HDFS and it’s competitors), and c) the ability to “glean” patterns from samples. It really goes back to statistics … we use smaller samples to describe the whole. It’s no different in Big Data.
Veracity– is another way of talking about data quality. This might just be the most important part of any data initiative. Surely you have heard of the old adage of Garbage in Garbage out (GIGO). If the source of your data is dirty no other value add (reporting, analytics, predictive, etc.) will make sense. While this is the most boring part of data, it is also the very foundation of data. At GBI our experience tells us that any data initiative must focus first on data quality before any other steps are taken. As part of the “cleanup” of data, one has to also think of Master Data Management (and stewardship) and Data Modeling (so make sure that the “meaning” of data is preserved as it is transformed) as key pillars to success on the Big Data journey.
Variety – When we talk about data, we never talk about the same type ever … we talk about pictures, sound, movies, twitter feeds, facebook posts, webpages, sentiments (free form text), streams (e.g. sensor data), and the list goes on. The most mature (and reliable / robust) technology for data is database technology. Databases store data in tables (similar to Excel). So, how do you combine a stream of tabular (also known as structured) data and unstructured (e.g. a comment on a website)? This is easier said than done. While we have an abundance of technologies that can handle both types of data, it’s the combination of the two and extraction of meaning from the combination that is important.
A simple strategy to deal with varied data is to realize that most metadata (i.e. data about the data) is still structured and we are really good at understanding structured data. Know that most of all analytics still works on structured data. Once you extract metadata from unstructured data, you have an almost complete picture. So, for example, I don’t need to know the full comment on Twitter but I can store the sentiment (approximated by AI) as a simple variable. Surprisingly, you don’t lose that much and are yet able to capture the essence.
Vulnerability –important data is dangerous in the wrong hands. It has always been that way? How is it any different in Big Data? It’s not. Now you just have a ton more data to secure. The only piece of advice I can offer on this is to be smart about your data. Data can be “masked” very easily and therefore not all data needs to be treated like Fort Knox. In the Big Data world, you have to become smarter about data. You just have to find the “critical path” in your environment and invest accordingly. Don’t listen to vendors. Use your own judgment. You are the smartest about your data (if not, then get smart now). From all my experience in the field with so many clients, I always (as in 90%+ of the time) see overspend on data. Invest in intelligent people and use data as raw material. You know what I am talking about … you are smarter than me about your data anyway.
Value – The industry is so caught up in the “how” of Big Data that they forgot the What and Why?! So far, this blog has talked about the various aspects of Big Data that need to be addressed to derive value. Yes, VALUE!
You have a lot of data – so what, who cares?
You have very clean data – so what, who cares?
You have master data – so what, who cares?
You have advanced BI – so what, who cares?
You care if:
… you made more money with data
… you improved your customer service with data
… you improved your marketing with data
…. etc etc
Needless to say, the point here is to start with the end in mind. Start with the questions you want answered that are going to create a lot of business value. Then decide how to answer those questions! Clarity is power … clarity of your business challenges is a great power … the ability to formulate effective questions to solve these challenges is even great power!
And … when you have your questions ready … call GBI Data Science and we will do the rest!!