Over the last few years, a common question I’ve been asked is what does it take to become a data scientist? Often my answers surrounded the technology – i.e. learn Spark, Python, and/or R; take courses in Data Sciences; play with data sets; etc. Yet, I was never fully satisfied with that answer because I had always felt that the heart of Data Sciences (and Big Data in more generic terms) is the data – or more specifically, the ability to understand the data.
Recently, I re-read “Freakonomics: A Rogue Economist Explores the Hidden Side of Everything” and it dawned onto me that in fact that this is the book that you should read to find out if you are interested in a career in Data Sciences. Note, this isn’t a book about techniques nor technologies and it would have been pretty boring if it was. Instead the book focus on the essence of data sciences: asking the right questions, looking for patterns in your data, breaking conventional wisdom, and asking even more questions.
Ask the right questions
In the book, the authors tell a story about Paul Feldman whom had built a business of selling bagels within companies by using an honor system – leave a box to collect money next to the bagels and people will pay for their bagels whenever they take them. There is a fascinating amount of details within the book about how the holidays, type of organization, management types, etc. that affects whether people were paying for the bagels.
As a self-professed geek, I immediately wanted to know whether Feldman could identify if some people had paid more than their fair share because of other social characteristics such as they didn’t pay the last time, they felt charitable, etc. This way it could be possible to identify which individuals were more benevolent (or malevolent). I even started going down the path of how to design the experiment (e.g. could identify people their were more charitable people because of the larger denomination of bills, perhaps one could use Apple Pay to pay for bagels so you could associate (though not identify) individuals and the amounts they pay broken out by industry, etc.
Yet, those questions – while interesting to me – are beside the point from an economic standpoint. The real questions were rooted on whether Feldman could be profitable by delivering and selling bagels using an honor system. He needed only to identify which companies could be trusted to follow through on the honor system.
The difficulty to obtain high quality data
An exceedingly interesting section of the book asks the question “Why are Drug Dealers Living with their Moms”. Ultimately, Sudhir Venkatesh was able to obtain the data needed to answer this question by living with the Black Disciples gang in Chicago for a number of years to gain unfettered access to their lives.
While this is certainly not the most common experience for a Data Scientist, it does call upon the lesson that obtaining high quality data is far more difficult than most people realize. An oft-quoted statistic in Data Sciences (and Data Mining and Business Intelligence before it) is that 80% of your time as a Data Scientist is to make sense of the data. While some Data Scientists consider themselves modelers (and that’s completely cool), it is often difficult to gain context of the data without actually getting your hands dirty. This often means you need to do the mind numbing tasks of parsing the data, the droll task of extracting the data from multiple data sources, to ask survey questions and manually apply qualitative analysis (e.g. theming), among the hundreds of other relatively boring tasks before finally getting to the heart of the data.
Correlation does not equal to Causality
A common problem when sifting and analyzing through a lot of data is that people mistake correlation for causality. Just because there is a correlation does not necessarily mean causation. If that were true, then Nicholas Cage really needs to stop appearing in movies for all of the people whom are drowning by falling into a swimming pool.
From: Fast Company Fast Design: Hilarious Graphs Prove That Correlation Isn’t Causation
While this correlation / causation graph is more obvious (and absolutely hilarious), many times we will succumb to conventional wisdom to immediately infer correlation with causation. For example, it is conventional wisdom that the increase in incarceration times and use of capital punishment are the primary reasons we see the drop in crime in the United States. But based on a deeper analysis of the data – a more controversial picture emerges which notes one of the major factors for the drop in crime was the legalization of abortions and having birth control cheaper and more widely available. Please note, this is not an attempt to advocate for abortions, this is a case where conventional wisdom is not necessarily right and you have to dive deeper into the data to try to understand it.
There is always the risk where you will see patterns where one does not exist, so it’s important to let the data speak to you instead of you modifying the data to fit your understanding.
The Importance and Impact of Outliers
The authors end Freakonomics with a short but astoundingly interesting observation concerning outliers titled Epilogue: two paths to Harvard. Without revealing the contents, the important call out is that underneath every population study, regression analysis, or application of an algorithm – is a person (or the events pointing to that person). No matter what the norm is – outliers do exist and it is important to count them.
As you can see from the questions asked, to getting high quality data, to breaking conventional wisdom – Freakonomics is a great primer for budding data scientists. If after reading this book you are fascinated by its pattern matching and detective work, this is a solid signal that you may enjoy data sciences. Note, this is not to trivialize the importance of understanding such key technologies like Spark, Python, R, Machine Learning, etc. But you need to be interested in continually asking questions and going past conventional wisdom, to make that technology work for you.