Manually cleaning data

Manually cleaning data

Verify and report results

  1. Verification of data is crucial to ensure the data cleaning process has been executed correctly and the data is accurate and reliable.
  2. Verification allows for the identification and rectification of errors before beginning analysis, ensuring the insights gained can be trusted for decision-making.
  3. Reporting on the data cleaning process promotes transparency, builds trust within the team, and ensures all stakeholders are informed about important project details.
  4. Changelogs, which are files containing a chronologically ordered list of modifications made to a project, are useful tools for tracking a dataset's evolution and for communicating and reporting data to others.

Confirm data-cleaning meets business expectations

See the big picture when verifying data-cleaning

  1. Consider the business problem
  2. Consider the goal
  3. Consider the data
🚥
Do the numbers make sense?
  • Verification is a crucial step in any data analysis project, ensuring the reliability and accuracy of the cleaned data.
  • Verification involves comparing the original unclean data set with the cleaned one and identifying and correcting any persisting issues.
  • A big-picture perspective is essential during verification. It reaffirms the alignment of the project with the original business problem and goals.
  • Considering the source of the data and testing the data collection and cleaning processes are crucial to ensure the data can solve the problem and meet the project objectives.
  • Seeking feedback from teammates and scrutinizing any suspicious or potentially problematic data can help identify overlooked issues.
  • Verification safeguards the trustworthiness of the insights gained from the analysis, helping companies avoid significant mistakes.

Example 1: Verify data with spreadsheets

Use spreadsheet tools such as Find and Replace and pivot tables to find, understand, and fix errors in your spreadsheet.

Use a pivot table to understand errors in a spreadsheet

  1. Select the XXX column.
  2. Select Insert > Pivot Table. In the Create pivot table dialog box, choose New Sheet then Create.
  3. This creates a new tab that is mostly blank.
  4. Additionally, the Pivot table editor pane is in the window.
  5. Next to Rows. Select Add, then the XXX column.
  6. Next to Values, select Add then select XXX. This adds a value for the XXX column.
  7. By default, Google Sheets sets the value to summarize by COUNTA (the total number of values in a range). This will show how many times each supplier name comes up. It’s a great way to check for misspellings and other anomalies. Note: Don’t use COUNT, because COUNT counts only numerical values.
  8. When there is only one instance of the misspelled name, manually change it to the correct spelling.
  9. To return to the original sheet, select the Sheet1 tab.

Example 2: Use a CASE statement to verify data in SQL

Use CASE statements to correct misspellings in SQL.

  1. The SQL table used in this example is not available for download, but if you were performing a similar query you’d first make sure to load the data in BigQuery.
  2. Start your SQL query with the basic structure:

SELECT

FROM

WHERE

  1. In the FROM clause, specify the table you're pulling data from after FROM. For example, project-id.customer_data.customer_name
  2. In the SELECT clause, specify the columns you want to return. In this example, you want customer_id and first_name.
  3. However, there is a misspelling in a customer’s first name.
  4. To correct the misspelled name "Tnoy" to "Tony", use a CASE statement.

Enter CASE. On the next line, enter WHEN first_name = 'Tnoy'THEN 'Tony'. This tells SQL to replace any instances of Tnoy in the first_name column with Tony.

On the next line, add the statement ELSE first_name to keep other names as they are.

End the statement with END AS cleaned_name.This creates a new column called cleaned_name that will contain the data cleaned with the CASE statement.

Delete the WHERE clause because you don’t want to filter the query.

The final statement should be:

SELECT
    Customer_id,
    CASE
	WHEN first_name = 'Tnoy' THEN 'Tony'
	ELSE first_name
	END AS cleaned_name
FROM
   project-id.customer_data.customer_name

Verification of data cleaning

  • Verification of data cleaning is crucial to ensure the accuracy and reliability of data.
  • The process includes comparing the cleaned data to the original, unclean dataset and looking for common problems.
  • Manual cleaning such as eliminating extra spaces or removing unwanted marks can be done.
  • Spreadsheet tools like TRIM and remove duplicates can help fix common errors automatically.
🚥
TRIM A function that removes leading, trailing, and repeated spaces in data
🚥
Remove duplicates A tool that automatically searches for and eliminates duplicate entries from a spreadsheet
🚥
Pivot table A data summarization tool that is used in data processing Pivot tables sort, reorganize, group, count, total or average data stored in a database.
  • Using Find and Replace can help correct misspellings or errors throughout the dataset.
🚥
Find and replace A tool that looks for a specified search term in a spreadsheet and allows you to replace it with something else
  • In SQL, CASE statements can be used to handle misspellings or errors.
🚥
CASE statement The CASE statement goes through one or more conditions and returns a value as soon as a condition is met

Data-cleaning verification checklist

Correct the most common problems

Make sure you identified the most common problems and corrected them, including:

  • Sources of errors: Did you use the right tools and functions to find the source of the errors in your dataset?
  • Null data: Did you search for NULLs using conditional formatting and filters?
  • Misspelled words: Did you locate all misspellings?
  • Mistyped numbers: Did you double-check that your numeric data has been entered correctly?
  • Extra spaces and characters: Did you remove any extra spaces or characters using the TRIM function?
  • Duplicates: Did you remove duplicates in spreadsheets using the Remove Duplicates function or DISTINCT in SQL?
  • Mismatched data types: Did you check that numeric, date, and string data are typecast correctly?
  • Messy (inconsistent) strings: Did you make sure that all of your strings are consistent and meaningful?
  • Messy (inconsistent) date formats: Did you format the dates consistently throughout your dataset?
  • Misleading variable labels (columns): Did you name your columns meaningfully?
  • Truncated data: Did you check for truncated or missing data that needs correction?
  • Business Logic: Did you check that the data makes sense given your knowledge of the business?