Data Collection

 


“As Charles Kettering says, a problem well-stated is a problem half solved.[1]


Before moving to any further step, we need to state the problem in clear terms. ‘What is the problem?’, ‘What do we need to find out?’, ‘What is our objective?’. These things need to be very clear before we move to the next step. Without a clear set goal, we would move like a ship without radar. A data-analyst is rarely given a problem statement but a question and asked to find a solution to the question. For example, ‘Why are we losing customers?’, ‘How can we increase sale?’, ‘How to reduce crime?’, ‘How can we achieve our goals?’, ‘How to increase literacy?’, ‘why isn’t our plans working?’.

All these questions have hidden problem statements such as- ‘We are losing customers’, ‘we are not able to increase sale’, ‘crime is increasing or is very high’, ‘we are unable to achieve goals’, ‘Literacy rate is very less’, ‘plans aren’t working accordingly.’ These statements are like saying that ‘X has fever’. But, fever by itself is not a problem but a symptom or effect of a hidden problem. And that is what we need to find out, the root cause of the problem stated, i.e., “Why is x suffering from fever?”. This is the work of the data analyst.

Most of the questions are stated in simple common language but an analyst needs to see it in terms of mathematics or technical and analyze the data to find the solution to the question asked. The solution needs to be represented in layman terms for making it understandable and not in terms of mathematics. It is a circular process; it transforms question provided in common language to mathematical terms and then after getting solution in numerical term it needs to be converted in common language. That is, the analyst first needs to find the underlying problem statement in the question asked. Then, they need to find the hidden problems which might have given rise to the problem. These can be stated as sub problem statements under the main one. Like in our previous ‘fever’ example a sub statement can be, ‘the patient might have an undetected stomach infection’. All these can act as hypotheses for the problem i.e., the common-man statement has been transformed into analytical statement. Once this is done, we can move forward with next steps in order to accept or reject the hypothesis. But not all analysis has a hypothesis. Sometimes, we might need to extract detailed insights from the data or get the required information from it etc.



[1] A problem well stated is a problem… (Charles Kettering Quote) - Famous Inspirational Quotes & Sayings (inspirationalstories.com)

2.1. Definition

‘Data collection is gathering data from sources in order to find solution to the stated problem. Analysis would produce incorrect outcome if the data is not properly collected. In order to collect proper data, one needs to first set the aim of the study. This aim is drawn from the first step of analysis i.e., ‘Defining the Problem’. Once, the goal is set we next need to collect only relevant data. Choosing the data which are most relevant to our objective is an important step. So, we need to select relevant variables (measurable data/quantitative data, that changes) and attributes (unmeasurable data/qualitative data, that changes) under the reference of study and also the area of study. The source from which the data is collected and the method used is also very important regarding the relevance and precision of data.’

[1]

2.2. Types of Data Collection

There are two types of data collection method: Primary and Secondary.

2.2.1.      ‘Primary Data: Primary Data are those data which are collected directly from the source by the analyst or collected from data published by the authorities who themselves are responsible for its collection.

 

a.       Data collection through direct interviews.

b.      Data collection through survey-questionnaire/schedule/observation.

c.       Data collection through polls- votes/headcounts.

d.      Data collection through officially published documents, records, historical scriptures.

 

2.2.2.      Secondary Data: Secondary Data are those data which are not collected directly from the source by the analyst or collected from data published by the authorities who themselves are responsible for its collection. It is collected for a purpose but is used for some other purpose.


a.       Data taken from unofficially published documents/journals.

b.      Data taken from internet.

c.       Data taken from magazines.

 

Primary data contains more detailed information, less error, recent data, most survey relevant data as compared to secondary data. Secondary data are more economical to collect than primary data. Primary data are more reliable than secondary data.’[2]

For example, if an accident occurs then it is more reliable to obtain the information directly from an eye-witness than someone who has heard about it. The primary data are more reliable, detailed and precise as compared to the secondary data. Again, suppose we are to analyze the infant mortality rates in state X. It is more economical and logical to use National Health Census data as secondary source.

2.3.            Methods of Collecting Primary Data

‘Primary data are obtained by the following four methods:

2.3.1.      Direct Personal Observation: In direct personal observation, the investigator collects information directly from the source, or from directly observing the situation. For example, if the survey is on female students leaving school, then in direct personal investigation, the investigator asks questions to those students leaving school. Also, the reporters reporting from the place of occurrence and directly describing the situation is an example of direct personal observation.

 

2.3.2.      Indirect Oral Observation: In indirect oral investigation, the investigator collects information indirectly. For example, like in previous example, the investigator asks questions to the classmate of the female students leaving school or like a reporter interviewing the people who has seen the crime and then reporting to us.

 

2.3.3.      Questionnaire sent through mail: Questionnaire is a list of questions set by the investigator and filled by the respondent. This questionnaire is sent to respondent through mail. It contains a list of questions, a brief description of the survey and a stipulated time within which it needs to be filled up and return to the investigator.

 

The questionnaire needs to be set up in such a way that it contains minimum number of questions, the questions must be multiple choice type mostly yes or no and require minimum calculation. The questions should be easy to understand and explanation must be provided with the questions where required. The questions shouldn’t hurt the respondent’s sentiment.

 

2.3.4.      Schedule sent through respondents: Schedule are similar to questionnaire but it is filled up by the investigator himself by interviewing the respondents.’ [3]

2.4.            Population and Sample

The collected data can be either from entire population or a part of it. Population is the entire group of observations which are under the reference of study. Sample is a part of the population which is expected to have all the characteristics and properties of the population. Population characteristics such as population mean, population variance etc., are called parameters. Sample characteristics such as sample mean, sample variance etc., are called statistics. Sampling can be done either completely random or judgement or a mixture of both. ‘[4]

 

 

 

Example:

If our study topic is illiteracy among women in India then all women in India will be considered as the population and women in India in age-group 15-25 years, will be considered a sample but might not be an appropriate one. i.e., mathematically, sample is a subset of population.

 

 

2.5.            Sampling

Sample is a part of the population which is expected to have all the characteristics and properties of the population. At times, it might be impractical at times to analyze all the population units for example, if we are cooking a dish and want to check if it tastes fine then there, we can taste only a portion of it and not the entire food. In such cases we take a sample (the portion) and it would taste same as the entire food would taste from which it has been taken.[5]

2.6.            Types of Sampling

There are two types of sampling techniques- Random and Non-Random Sampling.

2.6.1.      Random Sampling: Random Sampling also known as Probability Sampling is a sampling technique where each population unit (member of the population) has equal chance of being included in the sample. It has less chance of bias.

 

2.6.2.      Non-Random Sampling: Non-Random Sampling or Non-Probability Sampling is a sampling technique where each sampling doesn’t have equal chance of being included in the sample. It has more chance of bias.

2.6.1.      ‘Types of Random Sampling

2.6.1.1.Simple Random Sampling: Simple Random Sampling is a sampling technique where each population unit (member of the population) has equal chance of being included in the sample. In this technique, there is no biasness is present. Various Random number tables are available from which the numbers can be chosen at random.

 

Suppose, the population is of size N and a sample of size n is to be drawn. Then we need to number all population units serially from 1 to N and then select a sheet in the random table book and select n units either horizontally or vertically. We can also choose through lottery method by drawing n chits from a jar consisting chits of all numbers from 1 to N.

 

2.6.1.2.Stratified Random Sampling: Stratified Random Sampling is a sampling technique where the total population is divided into strata (groups) based on certain characteristics. Then samples are drawn at random from each stratum through simple random sampling.

 

Suppose, the population is of size N and a sample of size n is to be drawn. We shall first divide the population into K non-overlapping groups where each group exhibits a specific character. For example, if we divide population based on sex then we would have 3 Strata – Male, Female, Others. Then we need to draw sample at random from each stratum in such a way that sample size drawn from each stratum when added, sums up to total sample size n. If n1 units are drawn from strata 1, n2 units are drawn from strata 2,…, nk units are drawn from strata k then n1+n2+…+nk=n.

 

2.6.1.3.Systematic Random Sampling: Systematic Random Sampling is a sampling technique where the first sample unit is chosen at random and the following units are chosen at equal interval.

 

Suppose, by using simple random sample we chose the 1st sample unit which is the 10th population unit. Then by using lottery / random method we choose the interval as 10 then the second sample unit would be 20th population unit, 3rd sample unit would be 30th population unit…, nth sample unit would be n*10th population unit.

 

2.6.1.4.Clustered Random Sampling: Clustered Random Sampling is a sampling technique where the total population is divided into cluster (sub groups) of same sizes. All the sub-groups possess same characteristics. Here, instead of choosing units, a particular sub group is chosen at random.

 

Suppose, the population is of size N and a sample of size n is to be drawn. We shall first divide the population into K distinct sub groups. Then we select one or multiple clusters and including all units from the chosen clusters to form the desired sample of size n.

2.6.2.      Types of Non-Random Sampling

2.6.2.1.Convenience Sampling: In this sampling technique, the researcher chooses sample units which are easily accessible to them, i.e., based on their convenience.

 

Suppose, the researcher is to conduct survey on the household expenditure of X region and the researcher instead of choosing the units at random choses units which are closer to each other in order to avoid travelling.

 

 

2.6.2.2.Voluntary Sampling: In this sampling technique, the respondent voluntary chooses to participate in the survey and forms the sample units.


Suppose, the researcher uploads the survey in some online forum and the respondents are given an option to skip the survey or participate in it. Here, the respondents voluntarily choses to participate in the survey.

 

2.6.2.3.Judgement Sampling: In this non-random sampling technique, the researcher based on their experience and judgement choses the sample.

 

Suppose, the researcher is supposing to conduct survey on women safety in company X and their opinion on management. But knowing that not everyone would provide honest answer, he would select the sample upon proper judgement.

 

2.6.2.4.Quota Sampling: In this sampling technique, the entire population is divided into sub-groups and then sample units are drawn from them by a non-random sampling technique.


Suppose, the population is of size N and a sample of size n is to be drawn. We shall first divide the population into K distinct sub groups/quotas. Then we need to draw sample from each quota in such a way that sample size drawn from each quota when added, sums up to total sample size n. This sample selection is based on non-random sampling.

 

2.6.2.5.Snowball Sampling: In this sampling technique, the sample units are chosen on the way through network. When we are performing survey on a topic for which the respondents are difficult to find we use this method.

 

Suppose, we are to conduct survey on weed consumption in present youth. Here, it would be difficult to detect the population or sample and harder to find voluntary sample. So, we can first find a person and then ask if he/she would know someone who consumes weed and proceed in similar way.’[6]

 

2.7.      Biasness and error in data collection stage

Data biasness can occur even in the very basic i.e., data collection stage. It can be intentional or unintentional in nature. It is important to detect bias at the very initial stage and be removed or rectified. Data needs to be verified properly before putting into use.

The respondent may provide wrong information intentionally or unintentionally, in case of an old event the respondent might fail to recall, or in certain cases the respondent may be reluctant to share certain information, etc.…

The investigator may collect wrong information on purpose or by mistake. They may not want to visit all the units and try to get information from other nearby source or may hear something wrong and record that, etc…[7]



[1] https://blog.panoply.io/data-collection-how-what-when

[2] Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill

[3] Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill

[4] Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill

[5] https://www.simplilearn.com/types-of-sampling-techniques-article

[6] https://www.simplilearn.com/types-of-sampling-techniques-article

[7] https://www.futurelearn.com/info/courses/data-science-artificial-intelligence/0/steps/113337#:~:text=In%20a%20statistical%20sense%2C%20bias,want%20to%20say%20something%20about.

Comments

Popular posts from this blog

WHY STATISTICS?

Everyone is a born Statistician!

STORY TELLING WITH STATISTICS