You’ve probably already heard plenty of people complain about the amorphous definition of big data. In this article, I’m going to define it as databases that are large enough to require MapReduce or some other special algorithm to query. Medium data is anything that requires enterprise level SQL to query. You probably already have medium data if you are enough of a functional business to be reading this. You already know that capturing and having access to your operational data is important. The million dollar question (literally) is whether you need to upgrade to this more advanced technology.
As in the American justice system, you should consider your current infrastructure to be innocent until proven guilty. Guilty, of course, of slowing down your business and failing to deliver the return on investment you need to remain competitive. So let’s take a look at the evidence against medium data:
1. Once you have more than 1 billion rows, it can be slow to query
This is certainly true, but let’s define slow. If you’re extracting to run overnight, is that really such a bad thing? How often do you need to make a decision based on data that you aggregate within one hour? If, perhaps, one of your customers being held hostage at a location that can only be discovered via a rapid query of data collected over the last ten years? With proper planning and a little bit of patience, you can have all the answers that you need available each morning as you arrive. Remember, big data solutions don’t give you instant results either, especially on your fastest moving projects where your data engineers will struggle to keep up with the latest changes in operational data.
2. You need to do advanced algorithms on your most granular data
Although it’s very easy to aggregate data in such a way that you’ll be able to find the average, the sum, and even approximations of the median by any dimension that you’d like, it’s true that you won’t be able to implement algorithms such as clustering, classification trees, and other machine learning and data that you’ve already rolled up. My question is, how often do you need to do this across all of your data? You’ll often find that you want to do it for a specific sample, which often ends up being perfectly manageable as an extract.
3. All the cool kids are doing it!
Emulating the cool kids didn’t serve me very well in high school, and by going to an engineering college I managed to lose touch entirely with what cool meant. Look at me now — I’ve managed to build a good career, family life, and a blog that at least one person appears to be reading! Seriously though, it’s important to remember that the companies you read about are almost by definition outliers. Of the thousands of companies in the Fortune 1000, how many get written about on a regular basis? The ones that you hear about again and again are the remarkable ones with both the needs and the budget to be pushing the limits on a regular basis. The thing about pushing the limits is that there’s a lot of pushing involved! And that pushing, while heroic, is an effort that you could have spent on internal projects with much clearer return on investment. Think for a moment — don’t you have important projects that you don’t have time to implement? Spending precious resources to emulate another leader is neither good leadership nor good management unless you can clearly prove the need for your specific organization.
I hope that this post has made you think about whether you really truly need big data in your organization. I am not of course arguing that no organizations need big data, but even though you may think you want big data, it’s improbable that you need it. Best of luck with your ‘big’ decision!