How BIG is your big data?
To measure, or not to measure...
Big data, small data, everywhere there's data!
...but how BIG is "Big Data"?
Could we please STOP focusing on this ridiculous problem - the semantics of how big is "BIG". Yes, data has exploded recently, and our world is filled with more and more data, data capturing devices, data analysis, etc. But why do we care so much about categorizing the data into "Big" data buckets? We are often so focused on how big the data is that we don't get much further than finding new and more accurate ways to measure the size of the bits and bytes, Funnily enough, the size of any particular data agglomeration is actually meaningless except in a geeky data one-upmanship contest.
Do I, as an active data scientist and coach, define and explain "Big Data" to my clients? How do I accurately judge an analytics problem to figure out appropriate tools and architecture solutions?
Strangely enough, I don't ask if they have "Big Data" - here's how the conversation goes:
Me: Do you have access to the data?
Me: Awesome! Sometimes this is actually the hardest part of a project
Me: What does your data look like? Where is it stored?
Client: My X (usually outdated) system or flat file
Me: Ok, I can work with that.
Me: How much data do you have?
Client: Um, oh, well, ...
Me: Is it 10Mb, a GB, a TB, etc? How many tables, or rows and columns, approximately?
Client: Oh, it's about 100Mb. I know it's not big data...
Me: That's a great size!
Me: How many records or fields?
You may have noticed how quickly the apologizing started about data size. This happens frequently in my experience, regardless of how big or small a client (or their data) is. I spend a good amount of time at this point assuring every client that yes, I can still work toward some awesome analytics goals regardless of their data size.
Did you know: it only takes ~30 observations in a good dataset to be able to perform certain statistical tests? Now, for a rare disease, 30 patient instances might actually be pretty large, but in most of the modern data science applications there are many more observations to work with. A 1Mb text file contains approximately 1 million characters. Let's say there are nine 10char fields and nine commas on each line, this file then contains ~10,000 observations - more than enough to do most (if not all) modern statistical tests!
Wait - are you telling me that 30 observations is big data?!
No, I am pointing out that it could be 'significant' data, or data that can be used to perform an analysis that has statistical significance. There are many caveats as to whether or not the power equation will suit the analysis purpose with a very small sample size. I am telling you that it is possible for very small data sets to (potentially) provide a statistically significant analysis result.
So before you start apologizing for the 'size' of your data think about it this way:
- What is the analysis you want to do?
- Do you have "enough" data?
- Do you have "too much" data?
- Do you have a data "problem"?
The hype buildup around big data is really leaving out a key term 'problem'. Do you have "Big Data" should really be Do you have a "Big Data Problem". What's a Big Data Problem? It's ultimately any size of data you can't reasonably handle in a way you need to - which means different things to different people. I believe the term 'big data' might even have stemmed from excuses for systems that couldn't handle a change in data volume or flow.
Consider the following scenarios:
- Someone at IBM or Cisco or Bell might have a "Big Data Problem" analyzing the packets from entire countries of network operations - so much data that is coming so fast that they can't handle the throughput in a reasonable time to take action on repairing infrastructure hardware.
- A retail store might have a "Big Data Problem" because they have 10 years of store transactions stored in excel spreadsheets at each retail location in different formats. The data is messy, disconnected, and likely too large for their tools.
- A successful mom-and-pop bakery might have a "Big Data Problem" handling their inventory and sales forecasting using their one big spreadsheet they have collected for years and that now takes 20 minutes to open on their computer!
- A third-world doctor's group might have a "Big Data Problem" crunching imaging lab results using their cell phone - which is the most ubiquitous computer in remote areas. However a cell phone is only a small computing device and can't really hold onto lots of data or perform intense calculations.
All of these have a common thread: data. The comparative size of the data actually doesn't matter - is the mom-and-pop bakery's data not important because it still (barely) fits into a spreadsheet? Should we discount the need to analyze lab results using cell phones in third world countries because the data is only a few images worth? Are only the "big" players with terabytes of data the ones with data problems? No.
Data Science needs to address all of these types of data problems, and ultimately it doesn't matter how big each scenario's data is - it is "too big" for their use case, and causes them a data problem.
So let's agree to stop measuring our data and looking to compare sizes like some set of school kids in the locker room! Let's continue to solve "data problems" regardless of size so that greater value can be derived from datasets. We need to focus our energy on creating the techniques, tools, and solutions that will answer the endless stream of questions we want to investigate using the plethora of data that exists in our world today.