Document the cleaning process
Documentation The process of tracking changes, additions, deletions, and errors involved in your data-cleaning effort
- Recover data-cleaning errors
- Inform other users of changes
- Determine quality of data
A changelog, a file containing a chronologically ordered list of modifications made to a project, is typically utilized to track these changes.
- Data cleaning is a crucial part of the data analysis process, and documenting these changes is essential.
- Documentation allows for the recovery of data-cleaning errors, informs other users of changes made, and helps determine the quality of the data to be used in analysis.
- Google Sheets version history and SQL query history are tools that can be used to create and view changelogs.
- Reporting is another effective way to share documentation and keep the team updated.
Embrace changelogs
Google Sheets | Microsoft Excel | BigQuery |
---|---|---|
1. Right-click the cell and select Show edit history. 2. Click the left-arrow < or right arrow > to move backward and forward in the history as needed. |
1. If Track Changes has been enabled for the spreadsheet: click Review. 2. Under Track Changes, click the Accept/Reject Changes option to accept or reject any change made. |
Bring up a previous version (without reverting to it) and figure out what changed by comparing it to the current version. |
Typically, a changelog records:
- Data, file, formula, query, or any other component that changed
- Description of what changed
- Date of the change
- Person who made the change
- Person who approved the change
- Version number
- Reason for the change
- Engineers, writers, and data analysts use methods like automated version control, changelogs, and version control systems to track changes in their work.
- Changelogs provide a detailed record of work, including what was changed, who made and approved the change, and why it was changed. This is especially useful for data analysts to ensure data integrity and consistency, and to understand the reasons behind specific changes.
- Version control systems are commonly used when making changes to shared queries. They allow analysts to track all changes and revert to previous versions if needed.
- Changelogs and version control systems together provide a powerful tool for maintaining data integrity and providing a clear understanding of the history of data changes.
Follow best practices for changelogs
A changelog for a personal project may take any form desired. However, in a professional setting and while collaborating with others, readability is important. These guiding principles help to make a changelog accessible to others:
- Changelogs are for humans, not machines, so write legibly.
- Every version should have its own entry.
- Each change should have its own line.
- Group the same types of changes. For example, Fixed should be grouped separately from Added.
- Versions should be ordered chronologically starting with the latest.
- The release date of each version should be noted.
All the changes for each category should be grouped together. Types of changes usually fall into one of the following categories:
- Added: new features introduced
- Changed: changes in existing functionality
- Deprecated: features about to be removed
- Removed: features that have been removed
- Fixed: bug fixes
- Security: lowering vulnerabilities
A good changelog is clear, concise, and easy to understand. It should be chronologically ordered, with each version having its own entry and each change having its own line. Grouping similar types of changes and noting the release date of each version also enhances readability.
Deciding if a change is significant enough to include in the changelog often depends on its impact on the project. If it alters the functionality, fixes a bug, or enhances security, it is typically significant enough to be included. Minor cosmetic changes or changes that do not affect the project's operation may not need to be included.
Common data errors
- Human error in data entry
- Flawed processes
- System issues
Advanced functions for speedy data cleaning
The IMPORTRANGE function in Google Sheets allow you to insert data from one sheet to another.
The QUERY function is also useful when you want to pull data from another spreadsheet.
The FILTER function is fully internal to a spreadsheet and doesn’t require the use of a query language.