Exploratory data analysis or EDA

Exploratory data analysis or EDA

Review

  • Practices of EDA
  • PACE workflow
  • Ethics of working with data
  • Data visualizations

Exploratory data analysis (EDA) The process of investigating, organizing, and analyzing datasets and summarizing their main characteristics, often employing data wrangling and visualization methods.

EDA is an important first step in understanding any new dataset or question.

Practices of EDA

  • DiscoveringData professionals familiarize themselves with the data so they can start conceptualizing how to use it
  • StructuringThe process of taking raw data and organizing or transforming it to be more easily visualized, explained, or modeledBiasOrganizing data in groupings, categories, or variables that don't accurately represent the whole datasetIt's essential to constantly check our biases to ensure true representation of what the data is showing.
  • CleaningIs the process of removing errors that may distort your data or make it less useful.
  • Joining The process of augmenting or adjusting data by adding values from other datasets
  • ValidatingThe process of verifying that the data is consistent and high quality
  • PresentingMaking your cleaned dataset or data visualizations available to others for analysis or further modelingData visualization A graph, chart, diagram, or dashboard that is created as a representation of information

Key Takeaways:

  1. Exploratory Data Analysis (EDA) is a process used by data professionals to investigate, organize, and analyze data sets. It helps to summarize the main characteristics of the data.
  2. The six main practices of EDA are discovering, structuring, cleaning, joining, validating, and presenting. These practices can be performed in any order and often the EDA process is iterative.
  3. Discovering is about familiarizing oneself with the data, structuring is about organizing the data, cleaning is about removing errors, joining is about augmenting the data with values from other data sets, validating is about verifying the data's consistency and quality, and presenting is about making the cleaned data set or data visualizations available to others.
  4. Bias can occur during the structuring process and it's important for data professionals to try to avoid this. Bias occurs when data is organized in a way that doesn't accurately represent the whole data set.
  5. The story uncovered from the data should come from the data itself, not from the individual's mind or biases in the data. Data should be conveyed in both an ethical and accessible way.

Storytelling

Storytelling is the way that your insights make it to other people and really make change.
A really good way to tell a story with data is to think about categories of users, categories of devices, or categories of use cases.
  • Storytelling with data is crucial for effectively communicating insights and driving change.
  • When telling a story with data, consider categories of users, devices, or use cases to provide a clear and concise narrative.
  • Curiosity is a key trait for any data analyst, driving the desire to learn and understand more.

Combine PACE and EDA practices

  • The PACE (Plan, Analyze, Construct, and Execute) workflow is used by data professionals to stay focused on the end goal of any given dataset.
  • EDA (Exploratory Data Analysis) applies to every part of PACE, as its six practices all intersect with PACE's parts.
  • Data insights should always be guided by a project's purpose and goals. Miscommunication can lead to confusion, disagreement, and wasted time.
  • It's crucial for data professionals to accurately represent the data. If the project plan does not align with what the data is saying, it's their responsibility to communicate that to stakeholders.
  • Data professionals should never bypass what is required by the data due to timelines, stakeholder pressure, or client needs.

Reference guide: The EDA process

The six practices of EDA are iterative ****and non-sequential

Exploratory data analysis (EDA) is not like a cake recipe. It is not a step-by-step process you follow. Instead, the six practices of EDA are iterative and non-sequential.

  • Iterative: Relating to or involving repetition of a process
  • Non-sequential: Not arranged in or following an order or sequence.

Because of the varying nature of datasets, the approach to exploring that data will be different each time. That means that you will need to use your logic and experience throughout the EDA process to determine which of the six practices to utilize, how many times to apply them, and when in the process you should apply them.

Visual example

Imagine you are assigned a dataset that has only 200 rows and five columns of data about trees in a coniferous forest in Norway. You know that to complete your full analysis you’ll need more than 1,000 rows and at least two more columns. Even without much more detail than that, your entire EDA process might look something like this:

img by Google
  1. Discovering: You check out the overall shape, size, and content of the dataset. You find it is short on data.
  2. Joining: You add more data.
  3. Validating: You perform a quick check that the new data doesn’t have mistakes or misspellings.
  4. Structuring: You structure the data in different time periods and segments to understand trends.
  5. Validating: You do another quick check to ensure the new columns you’ve made in structuring are correctly designed.
  6. Cleaning: You check for outliers, missing data, and needs for conversions or transformations,
  7. Validating: After cleaning, you double check the changes you made are correct and accurate,
  8. Presenting: You share your dataset with a peer.

Notice you performed the “validating” practice iteratively, or multiple times, to make sure your changes to the data did not unwittingly introduce errors. Also, because you recognized the need for more data up front, the practice of “joining” was performed immediately following the practice of “discovering.”

After you present your cleaned dataset to a peer, there is a good chance you will receive notes or ideas for more exploration and/or cleaning. Because of that, you will see even more iterations.

Pro tip: Data scientists expect to perform the practices of EDA multiple times on a dataset before they feel comfortable declaring it “clean” and ready for modeling or machine learning algorithms.

The importance of EDA in ethical machine learning

As algorithms and machine learning networks begin to make more and more decisions on behalf of individuals, companies, and even governments, the discussion of ethics and regulation becomes more and more important. According to the Institute for Ethical AI & Machine Learning, there are eight principles for developing machine learning systems in a responsible way.

Key principles of the EDA process

The following two principles are inherently part of the EDA process:

  • Human augmentation: This principle ensures humans are inserted throughout the AI or machine learning algorithm systems for oversight. Thorough EDA, performed by data scientists, is perhaps one of the best ways to limit bias, imbalance, and inaccuracies being fed into an algorithm.
  • Bias evaluation: Without human interference, bias is too easily injected and reproduced in machine learning models. Performing methodical EDA processes will lead data scientists to be aware of and act on biases and imbalances in the data.

Pro tip: The importance of assuring adherence to ethical standards cannot be overstated in the data career space. Data professionals need to continuously grow their capacities to recognize bias and discrimination by consistently applying an ethical mindset to their EDA work.

Beyond machine learning, EDA is applicable to nearly any important data-based decision. Moving forward, you will learn about many applications of EDA and the necessity of an iterative and non-sequential approach.


PACE with data visualizations

  1. Data visualizations are crucial tools in effectively communicating complex information and telling stories with data.
  2. Different audiences may require different types of visualizations. The design of data visualizations must consider the needs and interests of the respective audience.
  3. Misrepresentation of data through visualization can lead to confusion or misinformation. Thus, ethical presentation of data is critical.
  4. Accessibility should be considered in data visualization design. For example, color choices should be made with color blindness in mind.
  5. Tools like Tableau, Matplotlib, Seaborn, and Plotly are commonly used for creating data visualizations in a professional setting.