Great books have the power to transform us, shaping our thoughts and expanding our horizons. For example, the book "How to Change Your Mind" literally changed my mind on the topic of consciousness and psychedelics.
When it comes to data science, there have been many books that have greatly impacted my thinking. Many of their lessons have stood the test of time, offering insights that remain relevant even as technology evolves at breakneck speed. Here are ten of them.
1 - "Thinking, Fast and Slow" by Daniel Kahneman (2011)
While not a data science book, "Thinking, Fast and Slow" is a cornerstone text for any aspiring unicorn data scientist. Daniel Kahneman, a Nobel laureate in Economics, presents a groundbreaking exploration of the two systems that drive the way we think:
- System 1: Fast, intuitive, and emotional
- System 2: Slower, more deliberative, and more logical
Why It's Essential for Data Scientists
- Meta-Understanding of Data Science: On a meta level, data science functions as the System 2 of an organization - deliberate, analytical, and rational. It needs to co-exist with and complement the fast, heuristic-driven System 1 thinking that often dominates business decision-making.
- Cognitive Biases: The book delves deep into various cognitive biases that affect decision-making. As a data scientist, understanding these biases is crucial when interpreting data, designing experiments, or presenting findings to stakeholders.
- Improved Data Interpretation: Kahneman's insights help data scientists recognize when intuition (System 1) might be misleading us in data analysis, encouraging a more thoughtful, System 2 approach.
- Effective Communication: Understanding how others think and make decisions can help data scientists communicate their findings more effectively, bridging the gap between analytical insights and intuitive decision-making.
2 - "Factfulness" by Hans Rosling, with Ola Rosling, and Anna Rosling Rönnlund (2018)
Many data enthusiasts may already be familiar with Hans Rosling's captivating data visualization videos and his pioneering work with Gapminder. But even if you have already seen all his videos and TED talks, you still should not miss his book "Factfulness".
Why It's Essential for Data Scientists
- Human Biases in Data Interpretation: At its core, the book is less about data itself and more about the human biases that influence how we perceive and interpret data. In this sense, it serves as a perfect companion to "Thinking, Fast and Slow," offering a data-centric perspective on cognitive biases.
- Data-Driven Worldview: I appreciate so much the authors' fact-based, yet optimistic view of global progress. Being grounded in reality doesn't mean one has to become jaded or cynical. This "data-driven optimism" is refreshing and inspiring. It means data scientists can bring so much positive impact through our work.
- Practical Frameworks: The book masterfully demonstrates how to think about large-scale global issues through the lens of analytics and statistics. As we perform countless rounds of data cleaning and analysis, even if it gets tedious, we can be uncovering grand understanding of the world. Isn't that motivating?
3 & 4 - "How to Lie with Statistics" by Darrell Huff (1954) and "What is a p-value anyway?" by Andrew Vickers (2009)
This pairing of books, separated by over half a century, offers a fascinating look at the enduring challenges and potential pitfalls in statistical analysis and interpretation. They are both extremely engaging and informative, presenting real-world examples with humor and wit.
If "Factfulness" demonstrates the power of data mastery, then these two books show what is on the other end of the spectrum. Every data scientist should read these books, both to be entertained, and also to avoid getting caught in common statistical traps.
5 & 6 - "The Black Swan" by Nassim Nicholas Taleb (2007) and "The Physics of Wall Street" by James Owen Weatherall (2013)
Before data scientists, quants were already making a name for themselves in the world of finance. These two books offer crucial insights into the work of quantitative analysts, the complexities of financial modeling, and the limitations of predictive analytics.
Why It's Essential for Data Scientists
- Limitations of Models: "All models are wrong, but some are useful." Our work as data scientists often revolves around creating and tuning models. We can become too attached to the models we create, to the point of bending the world to fit the model, instead of the other way around. Every model has its built-in assumptions. Every model should be applied with caution, and interpreted with care and humility.
- Black Swan Events: Taleb's concept of "Black Swan" events - rare, unpredictable occurrences with extreme impact - is particularly relevant in today's hyper-connected world. It reminds us that not everything can be predicted, even with the most sophisticated models. In today's world where systems are interconnected with intricate dependencies, sudden and widespread chaos are more common than we might think.
- Managing Uncertainty: These books provide valuable perspectives on handling uncertainty in complex systems. They teach us that uncertainty is not just a nuisance to be eliminated, but a fundamental aspect of reality that must be thoughtfully managed. For data scientists, this means developing robust methodologies that account for uncertainty, communicating the limitations of our predictions clearly, and helping decision-makers navigate ambiguous situations.
- Interdisciplinary Thinking: Data scientists come from diverse backgrounds, and this diversity is a strength of the field. Weatherall's book showcases the value of interdisciplinary approaches, illustrating how concepts from physics have been applied to finance. This encourages data scientists to think beyond their immediate field, drawing insights and methodologies from various disciplines to solve complex problems. It reminds us that some of the most innovative solutions come from unexpected connections between different areas of knowledge.
7 - "The Book of Why" by Judea Pearl (2018)
If I had a quarter for every time I heard someone say "correlation does not imply causation" as a way to shut down a potentially productive line of thinking, well, I'd have many quarters.
While this tenet should indeed be deeply ingrained into every data scientist's mind, it doesn't mean we should be constrained by it to a point of uselessness. Every analysis wants to be causal analysis. What I mean by this is that the goal of an analysis is to understand the world better. And understanding the world better is not purely descriptive. Understanding the world often means providing a mechanistic explanation for how things work. That is inherently a causal interpretation. Therefore, an analysis with a causal interpretation that is potentially on the right track can be miles more useful than one that is purely descriptive.
For decades, Professor Pearl has been advocating causal inference in a field notorious for its insistence on model-free frameworks. "The Book of Why" is a call to action. While we must be cautious about inferring causation, we shouldn't let this caution paralyze us. Instead, we should dare to make causal hypotheses and seek creative ways to design experiments to test them. Pearl provides a framework for thinking about causation rigorously, moving beyond mere statistical associations and towards a deeper understanding of the phenomena we study.
8 - "Effective Python" by Brett Slatkin (2019)
I know I proposed "SQL" for the age-old question of "Python or R," but investing in python skills will open doors (specifically, career opportunities).
It's however quite overwhelming to improve on python skills. There are so many books to choose from. I have tried several of them, and "Effective Python" is what I recommend. It's concise, clear, and practical. The book organizes key python concepts into small, digestible topics. You can pick up one topic at a time whenever you have a few minutes to spare. Before you know it, you'd have gone through all the topics in the book!
9 - "Machine Learning: A Probabilistic Perspective" by Kevin P. Murphy (2012)
In the vast sea of machine learning literature, "Machine Learning: A Probabilistic Perspective" by Kevin P. Murphy stands out as a unique resource. By framing machine learning concepts through a probabilistic lens, this book provides a unifying principle that is applicable to both traditional ML and the AI-driven world.
10 - "Designing Data-Intensive Applications" by Martin Kleppmann (2017)
We round out our list with "Designing Data-Intensive Applications" by Martin Kleppmann, because data engineering is a crucial skill for the modern data scientist. This comprehensive book covers essentially all topics a data engineer should know, from the fundamentals of data models and query languages to the intricacies of distributed systems and the challenges of maintaining data integrity at scale.
Build Your Own Data Science Library
Rome wasn't built in a day, and neither is a unicorn data scientist. While this list might look long, and some of these books are quite hefty, remember that learning is a journey, not a sprint. Over time, as you explore these books, you'll find yourself building a robust foundation in critical thinking, causal reasoning, statistical analysis, and data engineering.
Do you have your favorite data science book? Would you like to recommend it to other data scientists? Let us know!