Art Of Data Cleaning

Data is the core of any analytical process. Data is gathered from many sources and is also manually entered sometimes. Data is in various varieties and high volumes. Due to a variety of sources involved in collecting data, combined data might have inaccuracies. To achieve higher quality data, we start by cleaning data whenever we begin analyzing data. It helps in getting accurate results of machine learning solutions. We perform data cleaning as it helps businesses avoid bad decisions based on their ML results due to inaccurate data being involved. Cleaning data is essential because any ML model changes will not get good results when incorrect information is involved.
Data Cleaning is the process of finding inappropriate records from a database and replacing, modifying, or deleting them based on the type of inaccurate information.
Why Do Businesses Need Data Cleaning?
- It leads to misguided decision making
The building block of any business decision is data. Accurate information is crucial for decision making. Accurate data will lead to better decision making and hence successful business decisions. - It saves time and avoids unnecessary information
Businesses need clean data so they can contact the right kind of prospects with the right approach without wasting their time and resources on the wrong candidates. - It increases data quality
Data quality impacts business practices. Data helps businesses track the success of the products and services. It helps in knowing employee performance. These performance metrics allow businesses to streamline their processes. - It saves the cost of processing unrequired data
According to surveys, the processing duplicate data is proportional to the time it lasts in your database. Wasting resources on processing unnecessary information is equivalent to money. This will ease the effective and efficient utilization of resources and achieving business goals.
Types Of Data That Need Cleaning Include
Missing Values
Missing Data refers to empty values in the database. There are three types of missing values:
- Missing at Random: It means that there is a systematic relationship between missing values and available data.
- Missing Completely at Random: It means that there is no relationship between missing values and remaining available data.
- Missing Not Random: It refers to missing data, and reasons for missing are unknown.
5 |
10 |
12 |
NaN -> Missing Value |
18 |
Outliers
Outliers are records distinctly different from other documents.
Types Of Outliers
- Global Outliers: It is a set of values far outside the range of values in the dataset.
- Contextual Outliers: It is a set of values whose importance significantly reduces due to the remaining data’s context.
- Collective Outliers: It refers to collecting the different points from the rest of the data points.
5 -> Outlier |
25 |
28 |
30 |
45 -> Outlier |
Type Conversion Data
It refers to the data which is not stored correctly in objects of expected data types.
mark |
roger |
steve |
15 -> Wrong type of data |
Howard |
Irrelevant Data
It refers to the data which is not required to satisfy the analytics/machine learning problem.
Duplicate Data
It refers to the repeated values in the dataset.
1 |
2 |
2 -> Duplicate Data |
3 |
4 |
Syntax Errors
It refers to text values that are considered different due to spelling errors, extra white spaces, or special characters.
t-shirt |
trousers |
T-shirts -> plural of 1st record |
hats |
tops |
Non-Uniform Data
It refers to the records which are not unanimous with rest in terms of a unit of measure.
10 Rs. |
20 Rs. |
50 USD -> different currency unit |
35 Rs. |
50 Rs. |
Inconsistent Data
It refers to a set of records being different across the same/different dataset
Inaccurate Data
It refers to the set of records that are not close to its true values.
How To Find Missing Data?
- Plotting data on HeatMap
When we plot data on Heatmap, the missing values are highlighted with empty spaces or assigned colors. And this works well only for datasets with limited features. - Showing Percentage of Missing Data
We show the percentage of missing data compared to the aggregate data of the feature. - Plotting data on Histogram
When we plot data on a histogram, we see the patterns existing in missing data.
How To Find Outliers?
Global Outliers can be detected using the following methods
- Sorting the data: We sort the data in the list, and entries on top and bottom would act out as outliers based on the amount of difference from the rest of the data.
- Plotting Data: We plot data using BoxPlot or Histogram. And this helps in identifying patterns in the data. Asterisks on BoxPlots and isolated bins of data point out the outliers.
- Z-Score Calculation: Z-scores are the difference between the standard deviation from the mean for an observation. Ideal z-score is 0. Observations with high/low z-scores are considered outliers. Generally, observations with z-scores of +/- 3 are considered outliers.
- InterQuartile Range: It is the range of values between the first and third quartiles. IQR is considered to be 50%. Values outside this range are outliers.
- Hypothesis Test: Null Hypothesis and Alternative hypothesis are created. Null Hypothesis points to all values derived from the same sample, and it follows a normal distribution. An alternative Hypothesis means that one value is not drawn from the same sample. P-values are calculated and based on their comparison with the significance level. We conclude that the null hypothesis is accepted or rejected. Hence, a lower p-value means the value is an outlier and vice-versa.
How To Perform Data Cleaning?
How To Handle Missing Data/Outliers
- Drop the Observation: We just remove the row missing value.
- Drop the Feature: We remove the column containing missing values.
- Impute the Missing: For the Numeric variable, we can calculate the missing value based on the median, mean of all values of the column.
- Replace the Missing: We replace the missing values with special values.
Handling Type Conversion Data
We find unfitting data of the column. We convert the data to the appropriate data type. If the data can’t be converted, we remove it.
Handling Syntax Error Data
We try to make changes to observations having syntax errors. The following changes can be made:
- We remove extra spaces.
- We change different forms of the same word so that they are not treated as separate values.
- We also change the case of the words.
- We make spelling changes.
- We remove punctuation marks from values.
Conclusion
There are many types of data involved during analysis. Every form of data requires a different way of cleaning. We read about the various types of data and mishaps among the data features. We even found the means to find such data and techniques to handle such data.
With an array of unique and competitive data science solutions for businesses, ZealousWeb strives to provide best-in-class results thereby increasing their customer satisfaction. Their level of expertise in this domain says a lot about their experience and their ability to channel exponential business growth.
We can clean all types of data ranging from numbers to characters and dates.
There can be 100s to millions of records that can be cleaned.
Removing unnecessary records will only improve the results.
We choose to replace missing values with other values to use records in the data analysis process still.