In a post that recently went viral, Bernard Marr asked the controversial question “why are there so many fake data scientists?” He argues that the incredible demand for data scientists has led many people without the proper qualifications to start using that title.
I can understand his frustration and, I think he’s right that there has been a flood of people changing their titles without necessarily changing their focus or training. However, I’d like to humbly add some additional factors to the discussion.
To understand the first factor, it’s interesting to relate the topic with software development. Although there has been tremendous demand for software developers for nearly a decade now, especially in this era where people struggle to find a job, it has not been a similar surge of people who are simply unqualified changing their titles and hoping that would help them get hired. It is valuable to look into why there is a difference between these two fields, and how we can use that to better understand the dynamics of the data science job market.
The core difference Between Software Development and Data Science
If you drew a graph of the output by skill for the software engineers, you would see that up to a very significant amount of skill, an amount that would take the average person at least one year to learn, they have basically no output. They couldn’t even read the existing code, much less contribute functionality that doesn’t break more often than their works. An impostor would last about five minutes in a software engineer organization. A data scientists on the other hand can start to contribute value after very small amount of training, often just a solid few days of watching tutorials on Excel. Excel is designed to be user-friendly (I’ll stick to one controversy at a time by withholding my opinions on how well it achieves that goal). Most data sets have at least a few nuggets of low hanging fruit that can bring real value to anyone with a tiny bit of business sense and the ability to use a pivot table.
This isn’t a bad thing! The fact that some results can be achieved by a relative newcomer does not diminish the fact that extraordinary results can be achieved by a skilled and experienced practitioner. It simply means that data science is a title with enormous flexibility, one where anyone with common sense and a hard work ethic can start climbing the mountain in contributing to the community. We should be proud of that, and not try to close it off by insisting that only certain people of the scientists. What we should do instead, is to create an understanding of various levels of value that data scientists can bring, and celebrate people who are delivering more value every single day. Michael Jordan (Wikipedia for anyone who hasn’t at least seen Space Jam) has never insisted that I am not a basketball player, despite the enormous gulf of talents between us. He doesn’t need to, because the difference in our skill is plainly visible based on objective measurements, so he can rest assured that anyone who cares about the matter will quickly be able to determine our relative altitudes on the mountain to basketball greatness.
On the other end of the spectrum, I’ve always suspected that one of the reasons that there is constant debate about the definition of “art” is that people who are experienced and skilled at the craft are concerned that unsophisticated observers will not be able to tell the difference between their work and the random dabbling in the newcomer. There is a difference between good art and bad art, but I don’t believe that anyone’s work should be dismissed as “not art”.
There is Still a Boundary
Bringing the conversation back to data science, there are certainly analyses that are simply wrong. Well it’s not exactly accurate to say that they are “not data science”, that does create an objective boundary between things that have any value, and those whose value simply depends on the context (generally, the needs of the stakeholder). Just like in art, sometimes most valuable and important work are not the most sophisticated, though you often find that the practitioner was more than qualified to dazzle us with their technical ability, but had the wisdom and insight to focus on a specific tools required to do a specific job. You don’t often find an inexperienced artist or data scientist creating an elegant masterpiece by accident on their first try.
How to Get Started With Data Science
Newcomers benefit from a variety of publicly available tools and data sets. Alexis Perrier, data scientist and software engineer at Berklee Online, notes that Amazon Web Services has a set of Predictive Analytics tools that allow you to play with concepts like cross validation and AUC. He also suggests Twitter feeds as a great starting data source. This freely available stream of data gives a surprisingly rich diversity of information to play with, including:
- Exact timestamps
- Free text
- RT counts
- IDs to join to other twitter accounts and their tweets
So let’s welcome the new members of our data science community, making clear that they have value that (while not yet at the level of a much more experienced data scientist) they can measure, contribute, and most importantly improve with time.