How to Transition into a Data Scientist Role: A Comprehensive Guide

A visual representation of a data scientist's journey

Many individuals have reached out to me recently, inquiring about how they can pursue a career as a data scientist. After gathering the most common questions and refining my responses over time, I've put together this comprehensive guide. It's my hope that this resource will help anyone interested in this field to learn and grow.

A visual representation related to data science

Index

What is a data scientist?
What background is needed?
What should be studied?
Step-by-step subject breakdown.
Recommended courses, books, and films.
Next steps.

1. What is a Data Scientist?

Finding a data scientist can be quite rare, as even seasoned professionals often struggle to define the boundaries of the role. One way to delineate it is to say that a data scientist is responsible for developing predictive and/or explanatory models using machine learning and statistical techniques. For a more detailed description, you can refer to the following link:

After all, what’s new in Data Science? The 10 Bias and Causality Techniques that Everyone Needs to Master

Being a data scientist requires continuous learning and adapting. Simply mastering machine learning techniques isn't enough; it's a mindset shift to tackle problems with skepticism and without biases. This introductory guide aims to save you time in seeking out the best materials and learning paths, while acknowledging the hundreds of hours necessary for proper training.

2. What Background is Necessary?

The first wave of data scientists predominantly came from software development, computer science, and engineering backgrounds. They were tasked with creating machine learning models, optimizing processes, and minimizing cost functions. Their work involved analyzing unstructured data and developing problem-specific programs, often resorting to manual processing due to computational limitations. Fortunately, most of these tasks are now streamlined by high-performance software, allowing current data scientists to focus more on modeling than engineering.

The good news is that the learning curve has become less steep. People from various backgrounds can enter this field, largely due to the prevalent use of Python—a high-level programming language that simplifies coding. Writing in Python is akin to writing in English, allowing newcomers to grasp the basics in just a few weeks. Furthermore, much of the data scientist's workload is being automated or is shifting towards specialized roles like Machine Learning Engineer or Data Engineer.

Working with Big Data is now as straightforward as writing SQL in environments like DataBricks. Creating scalable algorithms for production is becoming simpler with tools like SageMaker. Even complex feature engineering is increasingly being automated with AutoML.

In summary, while a programming background remains important, its significance is diminishing. I anticipate that programming will evolve into a niche skill confined to IT professionals. Therefore, my advice is to channel your efforts into analysis, modeling, and scientific inquiry.

3. So, What Should Be Studied?

Programming: Python and SQL

Learning to program is essential, and many languages can meet that need. However, for beginners, Python stands out due to its extensive community support for data analysis. You'll find numerous examples on platforms like Kaggle and Stack Overflow, making it easier to learn and find job opportunities.

Machine Learning: The Common Denominator

It's unavoidable that you must cover the fundamentals of machine learning. When I began studying this field in 2014, most courses focused heavily on deriving models. This approach can provide a solid understanding of how different models operate, but I recommend focusing first on understanding that a model is essentially a "black box" that converts inputs into outputs. There are many techniques common across various models, so it's best to learn these first before diving into the mathematical details.

Statistics

I’ve saved the most critical and challenging aspect for last. Mastery of statistics is what will set a data scientist apart from a Machine Learning Engineer. Start with descriptive statistics and learn how to conduct thorough exploratory data analyses (EDA). Understanding selection bias, Simpson's Paradox, variable associations, and the fundamentals of statistical inference will be crucial for your development.

4. What is the Ideal Track?

This question is complex and can vary based on individual backgrounds. For beginners, I suggest pursuing the subfield that resonates most with you as you progress.

Prerequisites:

Mathematics: Algebra ? Calculus
Statistics: Descriptive Statistics ? Probability ? Inference
Python: Data types ? Iterations ? Conditionals ? Functions

Basic Knowledge:

Basic knowledge encompasses what every data scientist should know, irrespective of their specialization: - Data Analysis with Pandas: Managing various file types. - Statistics: Exploring associations between variables and hypothesis testing. - Visualization: Using tools like matplotlib and bokeh. - Data Handling: Proficiency in SQL and querying APIs. - Supervised Machine Learning: Understanding gradient descent, bias-variance trade-off, validation, and feature selection.

Intermediate Knowledge:

At this stage, a data scientist should specialize in more niche areas: - Statistics: Bayesian statistics and causal experiments. - Data Handling: Data ingestion and working with unstructured data. - Production Algorithms: Developing transformation pipelines and APIs.

Advanced Knowledge:

Advanced skills may not be required for every data scientist, but familiarity is beneficial: - Deep Learning: Exploring reinforcement learning and computer vision. - Statistics: MCMC and causal modeling techniques.

5. Where to Learn All This?

Since this guide is intended for anyone interested in data science globally, I will recommend online courses, preferably free options with English subtitles that cover the essential topics outlined above.

Python:

The most comprehensive Python courses include MITx's “Introduction to Computer Science and Programming Using Python” and “Python for Everybody” by Michigan.

Machine Learning:

Andrew Ng's famous course on Machine Learning at Stanford is highly recommended, as it provides a solid technical foundation. Other options include courses from the University of Washington and Udacity.

Statistics:

For a more extensive exploration, consider MIT’s “Fundamentals of Statistics.” If you prefer shorter, more basic courses, options include Harvard's Probability and Inference courses.

Book Recommendations:

Data Science From Scratch by Joel Grus: A great introductory book for beginners.
Python for Data Analysis by Wes McKinney: A must-read for anyone working with data.
Practical Machine Learning with Scikit-Learn and TensorFlow: A detailed and updated guide on machine learning concepts.

Film and Documentary Recommendations:

MoneyBall (2011): A film showcasing data analysis in baseball.
AlphaGo (2017): A documentary that illustrates the power of machine learning.
The Joy of Stats (2010): A fun look at statistics presented by Hans Rosling.

6. Next Steps:

As you advance, you'll likely specialize in a sub-area that captures your interest, whether that's Bayesian statistics, econometrics, or deep learning.

If you found this guide helpful, you may also enjoy: - Understanding the scope of data science. - Interpreting machine learning outcomes. - A brief history of statistics.