Exploratory Data Analysis: A Comprehensive Guide for Data Scientists

Introduction

Exploratory Data Analysis (EDA) is a crucial initial step in any data science project. It involves a detailed examination of your dataset to uncover its fundamental characteristics, identify potential anomalies, and reveal hidden patterns and relationships. Gaining this insight is vital for guiding subsequent steps in the machine learning workflow, including data preprocessing, model development, and result analysis.

The EDA process typically consists of three primary tasks:

Step 1: Dataset Overview and Descriptive Statistics
Step 2: Feature Assessment and Visualization
Step 3: Data Quality Evaluation

Each of these tasks can involve extensive analysis, leading you to engage in activities like slicing, printing, and plotting your pandas dataframes extensively.

Selecting the Right Tools

In this article, we will explore each step of a productive EDA process and explain why you should consider using ydata-profiling as your go-to tool. We will employ the Adult Census Income Dataset, which can be accessed freely on platforms like Kaggle or the UCI Repository (License: CC0: Public Domain).

Join Us!

If you're new to data science, we invite you to join us for an event on Thursday, December 7! The Data-Centric AI Community is a welcoming group of data enthusiasts passionate about learning data science. We host engaging "Code with Me" sessions and are initiating informal study groups.

Next Event: December 7, 2023 — Learn how to build a Data Science Portfolio to land your dream job!

Step 1: Data Overview and Descriptive Statistics

Upon accessing a new dataset, the first question that arises is: What am I working with? A comprehensive understanding of your data is essential for effective handling in future machine learning tasks.

Typically, you start by analyzing the number of observations, the types of features, the overall missing value rate, and the percentage of duplicate entries. With some pandas manipulation, you can extract this information using simple code snippets.

The output format may not be optimal. If you’re familiar with pandas, you may start your EDA process with df.describe(), which provides statistics for numeric features. To get insights on categorical features, you could use df.describe(include='object'), but a more efficient approach is to use ydata-profiling for a comprehensive profiling report.

The report generates a complete overview of the dataset, including essential statistics in the Overview section. For instance, the Adult dataset consists of 15 features and 32,561 observations, with 23 duplicate records and an overall missing rate of 0.9%. The dataset is recognized as tabular and heterogeneous, containing both numerical and categorical features.

You can also examine the raw data and duplicates to further grasp the features before delving into more complex analyses.

Step 2: Feature Assessment and Visualization

After reviewing the overall data descriptors, it’s essential to focus on the individual properties of each feature through Univariate Analysis, as well as their interactions and relationships via Multivariate Analysis. Both analyses rely on appropriate statistics and visualizations tailored to the feature types.

Univariate Analysis

Examining the characteristics of each feature is vital for determining their relevance and the necessary data preparation for optimal results. Outliers and inconsistencies may arise, necessitating standardization for numerical data or one-hot encoding for categorical features.

Best practices dictate a thorough investigation of descriptive statistics and data distributions to identify potential outlier removal, standardization, label encoding, data imputation, and other preprocessing tasks.

For example, assessing race and capital.gain reveals that the latter has a high percentage of zero values, raising questions about its contribution to the analysis. In contrast, analyzing race highlights an underrepresentation of non-white categories, leading to concerns about bias and fairness in machine learning models.

Multivariate Analysis

Multivariate Analysis focuses on the interactions and correlations between features. Interactions illustrate how feature pairs behave together, while correlations quantify the strength of their relationships.

For instance, examining the interaction between age and hours.per.week shows that most individuals work around 40 hours a week, with some working longer hours, particularly among those aged 30 to 45.

Correlations can be visualized using a correlation matrix or heatmap, which visually represents the relationships between features. The correlation between education and education.num stands out, indicating redundancy, while other correlations like sex and occupation reveal interesting insights.

Step 3: Data Quality Evaluation

As we transition into a data-centric approach in AI development, understanding potential complicating factors in our data is crucial. These factors may stem from errors during data collection or intrinsic characteristics of the data itself, including missing values, imbalanced data, duplicates, and highly correlated features.

Identifying these data quality issues early in a project and continuously monitoring them is vital. Failure to address them before building models can jeopardize the entire machine learning pipeline, leading to flawed analyses and conclusions.

The ydata-profiling tool excels in automatically generating data quality alerts, highlighting issues such as duplicates, high correlation, imbalanced data, missing values, and zeros. For example, the feature race is highly imbalanced, and capital.gain is predominantly zero.

Key Insight: Beyond EDA

Data profiling extends beyond EDA. While EDA is typically an exploratory step before data pipeline development, data profiling should be an iterative process integrated into every stage of data preprocessing and model building.

Conclusions

A thorough EDA is foundational to a successful machine learning pipeline. It serves as a diagnostic tool for understanding the dataset's properties, relationships, and issues, enabling you to address them effectively.

This guide has outlined the three fundamental steps for an effective EDA, emphasizing the role of ydata-profiling in streamlining the process and alleviating mental burden. I hope this resource aids you in mastering the art of data exploration. Feedback, questions, and suggestions are always welcome. Let's collaborate in the Data-Centric AI Community!

Join Us Next Thursday, November 23!

If you are not yet part of the Data-Centric AI Community, consider joining us! We are a friendly group of data enthusiasts passionate about learning data science. Our next session will be on November 23, 2023, where we will explore Exploratory Data Analysis concepts.

See you there?

About Me

I am a Ph.D. Machine Learning Researcher, Educator, and Data Advocate. Here on Medium, I write about Data-Centric AI and Data Quality, educating the Data Science and Machine Learning communities on transitioning from imperfect to intelligent data.

Google Scholar | LinkedIn | Data-Centric AI Community | GitHub | Instagram