What is the Best Way to Clean Up a Large Data Set?

Consider for a moment just how much data every company has access to. Each department is continuously collecting it in some form, whether it’s about customers, cash flow, employees, or the business’ products or services. There are scores of insights waiting to be uncovered. An analysis of inbound customer service queries can inform your product road map. Examining point-of-sale behavior can lead to product page adjustments. Digging into the why behind advertising performance can alter how each audience is targeted. With appropriate action, the result of each analysis could lead to a stronger conversion funnel and business growth.

Bird view of a city at night with lights.

But none of the aforementioned analyses will be worthwhile if the data you start with is dirty — and all data sets are vulnerable to this common misfortune. In this guide, we’ll describe what dirty data actually is, why it needs to be scrubbed, and the steps that help make it sparkling clean.

What is dirty data?

Before we share data-cleaning methods, it’s important to be on the same page about what dirty data is and why it needs to be scrubbed in the first place. Here are three terms we will use throughout this guide:

Raw data: This is information that is provided (perhaps directly from individuals), collected (such as by digital advertising platforms), and/or compiled (by a data aggregator, for example). Raw data therefore comes direct from one or multiple sources and has not been manipulated in any way.

Dirty data: This is a synonym for raw data, as it has not been processed through cleansing software or reviewed manually, and therefore still contains flaws that can negatively impact an analysis.

Clean data: This is raw or dirty data that has been modified — to remove flaws such as duplicates, typos, missing values, etc. — and run through quality control to ensure its accuracy.

How data gets dirty to begin with

You may question the assertion that there could be a multitude of issues with your data. Maybe you just collected it or you trust the source it came from. What could be wrong? Here is an example that will help you understand how a variety of flaws can end up in a given data set.

The Social WorkLife company just finished its annual employee reviews for all office locations. Now managers are supposed to input their direct reports’ results into an internal system for aggregation and analysis. Here are the six fields they must enter manually: name, date of birth, gender, position title, overall numerical rating for this year, and overall numerical rating for last year.

The managers have access to a digital handbook that explains how each field should be filled out, but they don’t all refer to it. There are managers who type birthdates with forward slashes in-between the month, day, and year, while others use dashes. For employees who do not identify as male or female, managers type non binary (two words), nonbinary (one word), or non-binary (hyphenated). Some might not have seen that the handbook says to completely exclude employees who have had their reviews delayed. Instead of excluding those employees, the managers submit their information but leave this year’s rating blank.

Even six variables can present an infinite number of input variations. Imagine these errors not just at one office, but at scale among all the Social WorkLife offices. When an analyst attempts to isolate ratings by year by gender, for example, they will be faced with numerous variations of the same gender categories that should have been condensed. When encountering blank cells for ratings, the analyst does not know why they are missing. Did they leave the company or was it an oversight?

As you can see, large raw data sets can contain a number of defects, especially when there is a manual-entry component involved. These issues could skew results and lead to an incorrect analysis. That’s why it’s best to follow the systemic approach of data collection, data cleaning and transformation, and then data analysis.

What is data cleaning?

Data cleaning is the action of first identifying irrelevant, incorrect, unwanted, and incomplete or missing information (“dirty data”); and then modifying it (“cleaning”) to ensure it meets the data quality standards of validity, accuracy, completeness, consistency, and uniformity. In addition to being called data cleaning, this process is also referred to as data cleansing or data scrubbing.

This may be done manually, but as with most processes that can be automated, it is better to use specialty data-cleansing processes and then have a data scientist run a quality control review.

Note that this process is a type of data transformation, which also needs to be completed before an analyst can start their job. Data transformation involves changing some of the essential attributes of cleaned-up data, such as its format, values, and structure.

What are the benefits of data cleaning?

Maybe your data contains some duplicates or there are missing values in each column. So what? Is that going to throw off an analysis? The answer is yes. Dirty data doesn’t only contain a few flaws here and there. There is a high probability that there will be numerous issues, especially when you’re talking about cleaning up a large data set.

Let’s look at a few of the broader benefits of turning raw data into clean data.

Efficiency: Not only can data analysts complete their work faster, but they also won’t need to redo it as a result of data errors. When analysts are able to complete their work faster, related business decisions can be made more quickly as well.

Reliability: Confidence in the data and resulting analysis cannot be understated. Inaccurate information weakens trust and decision-making, and will ultimately have a negative impact on the business. Only with clean data can analysts create extrapolation models, map data to other sources, prepare accurate reports for executive leadership, and more.

Visibility: Knowing that you are working with a clean data set makes it easier to detect errors coming from new raw data and correct them when they appear.

Steps for cleaning a large data set

It’s best to take a systematic approach to cleaning data, as a well-organized, documented process is efficient and scalable. Use this framework to develop your own workflow based on your in-house expertise and available technology.

Step 1: Remove unwanted observations

What it is: In the context of data collection, observations are the pieces of information being gathered to later be measured, counted, monitored, etc. Unwanted observations include duplicate and irrelevant data.

A duplicate observation is an identical instance of an entry. Irrelevant observations are data that have no meaning to the task at hand, nor do they fit into the context of the larger goal. An example of this would be having a column of phone numbers in a data set that is being used to analyze customer churn.

How it happens: Often, unwanted observations occur as a result of combining data sets. Identical data might be collected when a form is submitted twice or there is a manual data-entry error.

Step 2: Fix structural errors

What it is: This is typical in categorical data, where there are mislabeled classes or inconsistent capitalization or punctuation (such as n/a, N/A, and NA). Structural errors also refer to typos and grammatical mistakes.

A final structural component to be on the lookout for is a long categorical label. This isn’t an error per se, but characters could get cut off on labels when the data are visualized in a chart.

How it happens: Structural errors are often the result of open response fields that allow individuals to manually type their submission instead of selecting from a preset menu. For example, the “department” field in an employee feedback survey may receive responses like HR, human resources, and Human Resources — three variations of the same thing, which would result in three separate categories.

Step 3: Filter unwanted outliers

What it is: These are extremely high or extremely low values in comparison to others in the same category. For example, if a majority of data points fall between 50 to 100, the values of 7 and 453 would be considered outliers. It’s possible that the values are genuine, but they are more than likely incorrect inputs.

The key word in this step, though, is unwanted. If the outliers are real and relevant, then keeping them in the analysis may offer interesting insights, depending on the data that is being studied.

How it happens: Outliers that are incorrect typically occur as a result of manual mistakes during data entry.

Step 4: Deal with missing data

What it is: Literal missing values. Blank fields can skew results or not give an analyst enough information to work with. Many algorithms will not accept missing data fields.

Because of this, it is important to take corrective action, such as removing the observations, inputting values based on similar observations, or changing how the data is used so as not to include null values. The options are not ideal, so it is important to put systems in place that ensure fields are not left blank during data collection.

How it happens: It could be that the data simply weren’t collected (a field wasn’t marked as required on a survey) or were not available from the third-party source (maybe the sales team purchases a lead list in which annual income is missing from some leads). Adding data validation when you have control over the information being submitted is a good way to avoid this in the future.

Versioning data: Why it’s important

This is obvious, but we have to say it: When you delete data, it’s potentially unretrievable. Mistakes can be made during the cleaning process. Perhaps both instances of a duplicate observation are deleted or categorical labels are merged but should not have been. Saving only the latest version of the data will wipe out the possibility of retrieval.

Create a separate file or tab for every new version of the data set to spare your business the pain of losing wanted information. For example:

Version one: raw data

Version two: clean data

Version three: QC data

Versions four+: data for analysis

In version two, keep all irrelevant observations in case they become relevant for other projects. To create version three, run the cleaned data through quality control to catch any manual or algorithmic errors. Subsequent versions will remove observations that are irrelevant to the particular analysis.

Identifying opportunities to clean your data

Before collecting data, it is helpful to start with the end in mind. That means involving a data analyst from the beginning to develop the data-collection project. They will be able to identify the right variables, assist with formatting, and offer guidance on collecting the cleanest data possible.

Naturally, dirty data will fall through the cracks, no matter how well you structure your data-collection project. It can’t be helped when scraping data or combining multiple sources.

Mozart Data helps you clean things up. Our intuitive interface enables you to connect all of your data sources into a best-in-class warehouse, where you can clean, transform, and organize your data for analysis. Then, with just a bit of SQL knowledge, you can quickly access insights to drive your business forward.

For other data tips, like the right time to hire a data analyst, visit our blog!

Become a data maestro

Data analysis

Is Steph Curry a Good Shooter?

This post was written by Mozart Data Co-Founder and CEO, Peter Fishman.  In 2015, I became a season ticket holder

Education

Everyone Uses Data

This post was written by Shai Weener on Mozart’s data analyst team.  I was on a hike through the Marin

Business intelligence

The SQL Hurdle

This post was written by Shai Weener on Mozart’s data analyst team.  A couple of years ago, as I was