DATA CLEANING AND VALIDATION
The most important thing in statistical analysis is data. Without data, nothing can be done. But faulty data will result in faulty output. Data are nothing but information. And information can be quite faulty sometimes because of its source and the data collector. Our key word is ‘source’. Like any information, it too can come from various sources- Primary Data (Data collected directly from the source i.e., from someone who has experienced it first handed or have seen it happening) and Secondary Data (Data collected from indirect sources i.e., not primary source). Primary data sources include raw data, interview transcript, original documents, videos, audios, government or legal documents etc. And Secondary data sources include journal, articles, histories, documentaries etc.
Both primary and secondary data can be faulty. For
example, if we ask our grandparents about some things happened in past, there
is high probability that there would be some faulty information, because of
their inability to recall past events. Similarly, some people don’t wish to
reveal true information (for e.g., age, income etc.)
Secondary data can sometimes act as gossip. And as most gossip tends to deviate a large
extent from the real fact, that can happen very easily with secondary data.
People often manipulate data to suit their purpose or may make some data up by
assumptions. So, you might not surely know how the original publisher has
conducted his research.
Raw data are often a combination of two types of data
– noise and signal. Noise are random, unimportant, wrong data which reduces the
quality of data and would produce wrong, misleading result if not treated
properly. Signals are important, useful information which helps us to solve our
problem. The journey from the collection of raw data to the end conclusion are
full of signals and distractions/noise. Signals guide us through the path while
noises misguide us. Thus, we need to pick up signals and disregard the noises.
Noisy data are like weeds hidden in the forest of data which needs to be
identified and treated. Noisy data may contain missing data, duplicate data,
outliers, irreverent data, structural errors etc... Applying data validation
technique can help us to identify noises. [1]
So, it is very important to clean and validate data
before processing them. So, data collection is quite difficult state and in
fact the most important state of any statistical survey. Validation can be done
both manually and using software. But it should never be ignored.
First, we need to perform validation and quality
checks for data. The steps for validating data-
4.1.Completeness: Data collected whether collected whether from primary or secondary source might not be complete. There can be some incomplete fields in the data set which needs to
be identified and rectified or removed.
Incomplete data can be of many types:
4.1.1.
Data
can be incomplete in case the respondent is not required to fulfill it or it
can imply 0.
For example:
Sex |
Age |
Literate |
Years of
education |
M |
25 |
Yes |
16 |
F |
21 |
No |
|
F |
24 |
Yes |
3 |
F |
21 |
Yes |
18 |
Note: Here, the blank can be identified as 0 and it can easily be
identified from column 3.
Sex |
Age |
Number of
children |
M |
25 |
1 |
F |
16 |
|
F |
24 |
1 |
F |
14 |
|
Note: Here, the blank can be identified as 0 and it
can easily be identified from column 2, as they are children.
4.1.2.
Data
can be missing completely at random, where there is no dependency of missing
data on available columns.
For example:
Sex |
Age |
Years of
education |
M |
25 |
|
F |
21 |
|
F |
|
3 |
F |
|
18 |
4.1.3.
Data
can be missing in non-random pattern but random in sense that it cannot be
detected from other columns, where a particular section of people refuses to
answer or a particular question is mostly unanswered.
Sex |
Age |
Number of
children |
M |
56 |
1 |
F |
|
2 |
M |
64 |
1 |
F |
|
0 |
Sex |
Age |
Literate |
Years of
education |
M |
25 |
Yes |
|
F |
21 |
No |
|
F |
24 |
Yes |
|
F |
21 |
Yes |
|
Note: In table 1 we can see usually females avoided to answer their age.
And in 2nd table we can see that years of education was avoided by people.
Missing data of type b and c are hardest to determine.
d. Missing data might not
always be blank but can have filler terms such as “NA”, “Null”, etc.…
Check the percentage of completeness:
First identify all the blanks, null, NA values available in dataset and
take a count of that.
Then % of completeness= (Total data points/missing data) *100
Deal with Incomplete Data:
If it is of random or completely random then the most logical thing
would be to delete the data but even if it is random in nature but can be
logically detected then we can fill up the missing data using a suitable method
from the below mentioned methods.
i.
Fill
the missing values with the values preceding it or succeeding it.
ii.
Fill
the missing values with the average of the column data.
iii.
Fill
the missing value if it lies within the data series through Interpolation.
iv.
Fill
the missing value if it lies in the beginning or end of the data series through
Extrapolation.
v.
Fill
the missing values with the help of moving averages.
4.2.Accuracy: Accuracy of data implies error free real
data. Accuracy can be checked by
fixing some validation criteria. For example, if our survey is about workforce
working in
government sector, the age should be lying between 18-60 years. If
our survey is on income earned by people in India, it should be in rupees, etc.
In this way we can verify the accuracy of data. Also, if we are studying the
number of children in a family. And a person puts No for “if the family has
children?” but puts 2 under “Number of children”, data needs to be cross
validated. If we ask question about age or any question whose answer is
expected to be numerical but what we receive is theoretical. Such things can be
detected by putting logical data verification criteria.
Inaccurate data can be of many types:
4.2.1.
Outdated
Data: Data are usually very volatile in nature and changes frequently, so they
are needed to be constantly updated. If we use old data then it might provide inaccurate
results and inaccurate understanding regarding present scenario.
4.2.2.
Wrong
Data Type: We need to check the data type for each column and put filter on it
so that wrong data are not inputted or can be identified in the dataset. For
example, age should be numerical, some questions should take yes/no answers, etc…
4.2.3.
Inaccurate
or erroneous data: The respondent or investigator may put inaccurate data in
the data entry stage which needs to be detected and corrected.
Check Inaccurate Data:
i.
Time
validation: Checking the time of data collection, updated etc.… is very
important.
ii.
Data
type validation: We can validate datatypes based on logic or using various
tools. We can also filter data based on the expected datatype for the column.
iii.
Cross validation: Some data can be correlated
for example; an uneducated person shouldn’t put an answer to the column asking
the year of graduation and so on. So, data needs to be cross validated for
accuracy check.
iv.
Labelling
Data: We can label unstructured data in order to transform it into a structured
format and it can be helpful to detect anomaly.
Outliers can be treated by
following ways-
I.
‘Trimming: In this
approach, the data points that fall outside a certain range or percentile are
removed from the dataset. This can be done by defining upper and lower bounds
based on the mean and standard deviation or other statistical measures.
II.
Capping: This
approach involves replacing the outlier values with the nearest reasonable
value within a certain range. For example, the values above and below the upper
and lower limits can be replaced with the respective limit values.
III.
Transformation:
This approach involves transforming the data using mathematical functions such
as logarithmic, exponential, or power transformations. This can help to reduce
the impact of extreme values on the overall data distribution.’[2]
Outliers can be detected by
following ways-
a.
If the z score of
a data point is more than 3, it indicates that the data point is quite
different from the other data points. Such a data point can be an outlier.[3]
b.
The interquartile
range, often abbreviated IQR, is the difference between the 25th percentile
(Q1) and the 75th percentile (Q3) in a dataset. It measures the spread of the
middle 50% of values. One popular method is to declare an observation to
be an outlier if it has a value 1.5 times greater than the IQR or 1.5 times
less than the IQR.[4]
4.4.Uniqueness: Duplicate data is a very common problem in
data analysis. While collecting data or recording data we often come across
duplicates. These data can manipulate the analysis, they need to be detected
and treated or deleted depending upon the duplicates. They can be merged into a
single data or treat according to the duplicates.[5]
[1] https://www.javatpoint.com/what-is-noise-in-data-mining
[2] https://villagers.tech/machine-learning/handling-outliersz-score-trimming-capping/#:~:text=Capping%3A%20This%20approach%20involves%20replacing%20the%20outlier%20values,can%20be%20replaced%20with%20the%20respective%20limit%20values.
[3] https://www.simplypsychology.org/z-score.html
[4] https://www.statology.org/find-outliers-with-iqr/
[5] https://www.collibra.com/us/en/blog/the-6-dimensions-of-data-quality
Comments
Post a Comment
If you have any doubt or suggestion kindly let me know. Happy learning!