Scrubbing Data
- Scrubbing data is the second phase of the OSEMN framework, focusing on cleaning and preparing data for analysis.
- The scrubbing process transforms raw, dirty data into clean, usable data.
- Four main tasks in data scrubbing:
- Removing duplicates
- Formatting records
- Solving for missing values
- Checking records for mistakes or wrong values
- Clean data is essential for accurate analysis and reliable conclusions.
- Dirty data can lead to errors in exploration, analysis, and decision-making.
- Taking the time to scrub data thoroughly helps avoid problems in later stages of the data analysis process.
Remove Duplicate Records
- Removing duplicates is a crucial first step in the data scrubbing process to ensure accurate analysis.
- Duplicates can occur due to human error (e.g., multiple data entries) or machine errors (e.g., accidental double scanning).
- Duplicate records can distort analysis results and lead to inaccurate conclusions.
- Identifying duplicates can be done manually or using automated tools like Excel's built-in duplicate finder.
- Before removing duplicates, ensure that the repeated data is indeed an error and not missing information (e.g., missing middle names in records with the same first and last names).
- Removing duplicates is an essential step in preparing data for further analysis and modeling in the OSEMN (Obtain, Scrub, Explore, Model, and iNterpret) framework.
Format Your Records
- Data formatting is crucial for accurate analysis and consistent results.
- Two main aspects of data formatting:
- Ensuring consistent formatting throughout the dataset
- Associating data with the correct data type
- Inconsistent formatting can lead to inaccurate summaries and confuse data analytics tools.
- Examples of formatting issues:
- Location data: Standardizing city names (e.g., New York City, NYC, New York, NY)
- Currency: Converting prices to a single currency for comparison
- Different data types (text, numbers, dates) require specific formatting rules for correct interpretation by software and databases.
- Tools like Excel can be used to update and standardize data formatting.
- Proper formatting enables accurate summarization, analysis, and visualization of data.
- Consistent formatting is essential for calculating statistics like averages, maximums, or identifying trends.
- Data cleaning and formatting are crucial steps in the data analysis process, preparing data for further exploration and analysis.
Handle Missing Values
- Missing values are common in datasets and can occur due to various reasons such as unknown information or machine errors.
- There are two main options for handling missing values:
- Fill in the missing values with indicators like "unknown" or "N/A"
- Remove records with missing values (less preferred option)
- Filling in missing values is generally preferred as it preserves other valuable data in the record and avoids potential bias introduction.
- When filling in missing values, consistency is key. Use the same indicator (e.g., "unknown" or "N/A") throughout the dataset.
- Removing records with missing values can lead to loss of valuable information and potentially introduce bias in the data.
- It's crucial to handle missing values before starting the analysis to ensure accurate results.
- Spreadsheet tools often provide features to easily detect and fill in missing values during the data cleaning process.
- Proper handling of missing values is an essential step in the data cleaning process, contributing to more reliable and accurate analyses.
Check for Wrong Values
- Checking for obviously wrong values is a crucial final step in data scrubbing and cleaning.
- Understanding the context of data collection helps identify incorrect values and avoid flagging correct data as wrong.
- Treat obviously wrong data points similarly to missing values - replace with an error indicator or delete the entire record if necessary.
- Negative values may be correct in some contexts, such as recording returns in a sales database.
- Always consider the data's context and expected range of values when identifying incorrect data points.
- Proper data cleaning and verification ensure a clean dataset for the next phase of analysis in the OSEMN framework.
Summary
- Data scrubbing is crucial for analysis: It prepares your dataset by cleaning and organizing the information.
- Remove duplicates: Identify and eliminate duplicate records to ensure data accuracy.
- Maintain consistent formatting: Ensure all data follows a uniform format and has clear data types.
- Address missing values: Identify missing data and either replace it with placeholders or remove the entire record.
- Correct wrong values: Scan for incorrect data, replace with accurate information when possible, or remove if necessary.
- Use a systematic approach: Following a checklist helps ensure thorough and consistent data scrubbing.
- Data quality impacts analysis: Clean, well-organized data is essential for accurate and reliable analytical results.