DATA CLEANING AND VALIDATION

 


The most important thing in statistical analysis is data. Without data, nothing can be done. But faulty data will result in faulty output. Data are nothing but information. And information can be quite faulty sometimes because of its source and the data collector. Our key word is ‘source’. Like any information, it too can come from various sources- Primary Data (Data collected directly from the source i.e., from someone who has experienced it first handed or have seen it happening) and Secondary Data (Data collected from indirect sources i.e., not primary source). Primary data sources include raw data, interview transcript, original documents, videos, audios, government or legal documents etc. And Secondary data sources include journal, articles, histories, documentaries etc.

Both primary and secondary data can be faulty. For example, if we ask our grandparents about some things happened in past, there is high probability that there would be some faulty information, because of their inability to recall past events. Similarly, some people don’t wish to reveal true information (for e.g., age, income etc.)

Secondary data can sometimes act as gossip.  And as most gossip tends to deviate a large extent from the real fact, that can happen very easily with secondary data. People often manipulate data to suit their purpose or may make some data up by assumptions. So, you might not surely know how the original publisher has conducted his research.

Raw data are often a combination of two types of data – noise and signal. Noise are random, unimportant, wrong data which reduces the quality of data and would produce wrong, misleading result if not treated properly. Signals are important, useful information which helps us to solve our problem. The journey from the collection of raw data to the end conclusion are full of signals and distractions/noise. Signals guide us through the path while noises misguide us. Thus, we need to pick up signals and disregard the noises. Noisy data are like weeds hidden in the forest of data which needs to be identified and treated. Noisy data may contain missing data, duplicate data, outliers, irreverent data, structural errors etc... Applying data validation technique can help us to identify noises. [1]

So, it is very important to clean and validate data before processing them. So, data collection is quite difficult state and in fact the most important state of any statistical survey. Validation can be done both manually and using software. But it should never be ignored.

First, we need to perform validation and quality checks for data. The steps for validating data-

4.1.Completeness: Data collected whether collected whether from primary or secondary source might not be complete. There can be some incomplete fields in the data set which needs to


be identified and rectified or removed.

 

Incomplete data can be of many types:

 

4.1.1.      Data can be incomplete in case the respondent is not required to fulfill it or it can imply 0.

For example:

Sex

Age

Literate

Years of education

M

25

Yes

16

F

21

No

 

F

24

Yes

3

F

21

Yes

18

Note: Here, the blank can be identified as 0 and it can easily be identified from column 3.

Sex

Age

Number of children

M

25

1

F

16

 

F

24

1

F

14

 

 

Note: Here, the blank can be identified as 0 and it can easily be identified from column 2, as they are children.

 

4.1.2.      Data can be missing completely at random, where there is no dependency of missing data on available columns.

For example:

Sex

Age

Years of education

M

25

 

F

21

 

F

 

3

F

 

18

 

4.1.3.      Data can be missing in non-random pattern but random in sense that it cannot be detected from other columns, where a particular section of people refuses to answer or a particular question is mostly unanswered.

Sex

Age

Number of children

M

56

1

F

 

2

M

64

1

F

 

0

 

 

Sex

Age

Literate

Years of education

M

25

Yes

 

F

21

No

 

F

24

Yes

 

F

21

Yes

 

 

Note: In table 1 we can see usually females avoided to answer their age. And in 2nd table we can see that years of education was avoided by people. Missing data of type b and c are hardest to determine.

 

d. Missing data might not always be blank but can have filler terms such as “NA”, “Null”, etc.…

 

Check the percentage of completeness:

 

First identify all the blanks, null, NA values available in dataset and take a count of that.

Then % of completeness= (Total data points/missing data) *100

 

Deal with Incomplete Data:

 

If it is of random or completely random then the most logical thing would be to delete the data but even if it is random in nature but can be logically detected then we can fill up the missing data using a suitable method from the below mentioned methods.

 

i.                    Fill the missing values with the values preceding it or succeeding it.

ii.                  Fill the missing values with the average of the column data.

iii.                Fill the missing value if it lies within the data series through Interpolation.

iv.                Fill the missing value if it lies in the beginning or end of the data series through Extrapolation.

v.                  Fill the missing values with the help of moving averages.

 

4.2.Accuracy: Accuracy of data implies error free real data. Accuracy can be checked by fixing some validation criteria. For example, if our survey is about workforce working in
government sector, the age should be lying between 18-60 years. If our survey is on income earned by people in India, it should be in rupees, etc. In this way we can verify the accuracy of data. Also, if we are studying the number of children in a family. And a person puts No
 for “if the family has children?” but puts 2 under “Number of children”, data needs to be cross validated. If we ask question about age or any question whose answer is expected to be numerical but what we receive is theoretical. Such things can be detected by putting logical data verification criteria.

 

Inaccurate data can be of many types:

 

4.2.1.      Outdated Data: Data are usually very volatile in nature and changes frequently, so they are needed to be constantly updated. If we use old data then it might provide inaccurate results and inaccurate understanding regarding present scenario.

 

4.2.2.      Wrong Data Type: We need to check the data type for each column and put filter on it so that wrong data are not inputted or can be identified in the dataset. For example, age should be numerical, some questions should take yes/no answers, etc…

 

4.2.3.      Inaccurate or erroneous data: The respondent or investigator may put inaccurate data in the data entry stage which needs to be detected and corrected.

 

Check Inaccurate Data:

 

                                                              i.      Time validation: Checking the time of data collection, updated etc.… is very important.

 

                                                            ii.      Data type validation: We can validate datatypes based on logic or using various tools. We can also filter data based on the expected datatype for the column.

 

                                                          iii.       Cross validation: Some data can be correlated for example; an uneducated person shouldn’t put an answer to the column asking the year of graduation and so on. So, data needs to be cross validated for accuracy check.

 

                                                          iv.      Labelling Data: We can label unstructured data in order to transform it into a structured format and it can be helpful to detect anomaly.

 

4.3.Consistency:  Consistency is a very important criteria for data validation, data/response collected should be consistent only then they can be considered reliable. Consistency check in case of primary data can be by asking similar type of question over same period and verify the results. Or by asking same question in different period. Also, it can be checked with some logical criteria and establishing correlation between different questions. Consistency can also be checked by detecting outliers in data.

 


4.3.1.      Outliers: Outliers are observations which are significantly different from other. Some outliers represent true values from natural variation in the population. Other outliers may result from incorrect data entry, calculation error etc.

 

Outliers can be treated by following ways-

 

                                                                                I.            ‘Trimming: In this approach, the data points that fall outside a certain range or percentile are removed from the dataset. This can be done by defining upper and lower bounds based on the mean and standard deviation or other statistical measures.

 

                                                                             II.            Capping: This approach involves replacing the outlier values with the nearest reasonable value within a certain range. For example, the values above and below the upper and lower limits can be replaced with the respective limit values.

 

                                                                          III.            Transformation: This approach involves transforming the data using mathematical functions such as logarithmic, exponential, or power transformations. This can help to reduce the impact of extreme values on the overall data distribution.’[2]

 

Outliers can be detected by following ways-

 

a.       If the z score of a data point is more than 3, it indicates that the data point is quite different from the other data points. Such a data point can be an outlier.[3]

 

b.      The interquartile range, often abbreviated IQR, is the difference between the 25th percentile (Q1) and the 75th percentile (Q3) in a dataset. It measures the spread of the middle 50% of values. One popular method is to declare an observation to be an outlier if it has a value 1.5 times greater than the IQR or 1.5 times less than the IQR.[4]


 



4.4.Uniqueness: Duplicate data is a very common problem in data analysis. While collecting data or recording data we often come across duplicates. These data can manipulate the analysis, they need to be detected and treated or deleted depending upon the duplicates. They can be merged into a single data or treat according to the duplicates.[5]

 



[1] https://www.javatpoint.com/what-is-noise-in-data-mining

[2] https://villagers.tech/machine-learning/handling-outliersz-score-trimming-capping/#:~:text=Capping%3A%20This%20approach%20involves%20replacing%20the%20outlier%20values,can%20be%20replaced%20with%20the%20respective%20limit%20values.

[3] https://www.simplypsychology.org/z-score.html

[4] https://www.statology.org/find-outliers-with-iqr/

[5] https://www.collibra.com/us/en/blog/the-6-dimensions-of-data-quality

Comments

Popular posts from this blog

Moments-Central and Raw Moments- Mean, kurtosis, Variance and Skewness

WHY STATISTICS?

Averages-Measures of Central Tendency and its Types- Mean, Median, Mode