Data Collection
“As Charles Kettering says, a problem well-stated is a problem half solved. “[1]
All these questions have hidden problem statements
such as- ‘We are losing customers’, ‘we are not able to increase sale’, ‘crime
is increasing or is very high’, ‘we are unable to achieve goals’, ‘Literacy
rate is very less’, ‘plans aren’t working accordingly.’ These statements are
like saying that ‘X has fever’. But, fever by itself is not a problem but a
symptom or effect of a hidden problem. And that is what we need to find out,
the root cause of the problem stated, i.e., “Why is x suffering from fever?”.
This is the work of the data analyst.
Most of the questions are stated in simple common
language but an analyst needs to see it in terms of mathematics or technical
and analyze the data to find the solution to the question asked. The solution
needs to be represented in layman terms for making it understandable and not in
terms of mathematics. It is a circular process; it transforms question provided
in common language to mathematical terms and then after getting solution in
numerical term it needs to be converted in common language. That is, the
analyst first needs to find the underlying problem statement in the question
asked. Then, they need to find the hidden problems which might have given rise
to the problem. These can be stated as sub problem statements under the main
one. Like in our previous ‘fever’ example a sub statement can be, ‘the patient
might have an undetected stomach infection’. All these can act as hypotheses
for the problem i.e., the common-man statement has been transformed into
analytical statement. Once this is done, we can move forward with next steps in
order to accept or reject the hypothesis. But not all analysis has a
hypothesis. Sometimes, we might need to extract detailed insights from the data
or get the required information from it etc.
[1] A
problem well stated is a problem… (Charles Kettering Quote) - Famous
Inspirational Quotes & Sayings (inspirationalstories.com)
2.1. Definition
‘Data collection is gathering data from sources in
order to find solution to the stated problem. Analysis would produce incorrect
outcome if the data is not properly collected. In order to collect proper data,
one needs to first set the aim of the study. This aim is drawn from the first
step of analysis i.e., ‘Defining the Problem’. Once, the goal is set we next
need to collect only relevant data. Choosing the data which are most relevant
to our objective is an important step. So, we need to select relevant variables
(measurable data/quantitative data, that changes) and attributes (unmeasurable
data/qualitative data, that changes) under the reference of study and also the
area of study. The source from which the data is collected and the method used
is also very important regarding the relevance and precision of data.’
2.2. Types
of Data Collection
There are two types of data collection method: Primary
and Secondary.
2.2.1.
‘Primary
Data: Primary Data are those data which are collected directly from the source
by the analyst or collected from data published by the authorities who
themselves are responsible for its collection.
a.
Data
collection through direct interviews.
b.
Data
collection through survey-questionnaire/schedule/observation.
c.
Data
collection through polls- votes/headcounts.
d.
Data
collection through officially published documents, records, historical
scriptures.
2.2.2.
Secondary
Data: Secondary Data are those data which are not collected directly from the
source by the analyst or collected from data published by the authorities who
themselves are responsible for its collection. It is collected for a purpose
but is used for some other purpose.
a.
Data
taken from unofficially published documents/journals.
b.
Data
taken from internet.
c.
Data
taken from magazines.
Primary data contains more detailed information, less
error, recent data, most survey relevant data as compared to secondary data.
Secondary data are more economical to collect than primary data. Primary data
are more reliable than secondary data.’[2]
For example, if an accident occurs then it is more
reliable to obtain the information directly from an eye-witness than someone
who has heard about it. The primary data are more reliable, detailed and
precise as compared to the secondary data. Again, suppose we are to analyze the
infant mortality rates in state X. It is more economical and logical to use
National Health Census data as secondary source.
2.3.
Methods of Collecting Primary Data
‘Primary
data are obtained by the following four methods:
2.3.1. Direct Personal Observation: In
direct personal observation, the investigator collects information directly
from the source, or from directly observing the situation. For example, if the
survey is on female students leaving school, then in direct personal
investigation, the investigator asks questions to those students leaving
school. Also, the reporters reporting from the place of occurrence and directly
describing the situation is an example of direct personal observation.
2.3.2. Indirect
Oral Observation: In indirect oral investigation, the investigator collects
information indirectly. For example, like in previous example, the investigator
asks questions to the classmate of the female students leaving school or like a
reporter interviewing the people who has seen the crime and then reporting to
us.
2.3.3. Questionnaire
sent through mail: Questionnaire is a list of questions set by the investigator
and filled by the respondent. This questionnaire is sent to respondent through
mail. It contains a list of questions, a brief description of the survey and a
stipulated time within which it needs to be filled up and return to the
investigator.
The
questionnaire needs to be set up in such a way that it contains minimum number
of questions, the questions must be multiple choice type mostly yes or no and
require minimum calculation. The questions should be easy to understand and
explanation must be provided with the questions where required. The questions
shouldn’t hurt the respondent’s sentiment.
2.3.4. Schedule
sent through respondents: Schedule are similar to questionnaire but it is
filled up by the investigator himself by interviewing the respondents.’ [3]
2.4.
Population
and Sample
The collected data can be either from entire
population or a part of it. Population is the entire group of observations
which are under the reference of study. Sample is a part of the population
which is expected to have all the characteristics and properties of the
population. Population characteristics such as population mean, population
variance etc., are called parameters. Sample characteristics such as sample
mean, sample variance etc., are called statistics. Sampling can be done either
completely random or judgement or a mixture of both. ‘[4]
Example:
If our study topic is illiteracy among women in India
then all women in India will be considered as the population and women in India
in age-group 15-25 years, will be considered a sample but might not be an
appropriate one. i.e., mathematically, sample is a subset of population.
2.5.
Sampling
Sample is a part of the population which is expected
to have all the characteristics and properties of the population. At times, it
might be impractical at times to analyze all the population units for example,
if we are cooking a dish and want to check if it tastes fine then there, we can
taste only a portion of it and not the entire food. In such cases we take a
sample (the portion) and it would taste same as the entire food would taste
from which it has been taken.[5]
2.6.
Types
of Sampling
There are two types of sampling techniques- Random and
Non-Random Sampling.
2.6.1.
Random
Sampling: Random Sampling also known as Probability Sampling is a sampling
technique where each population unit (member of the population) has equal
chance of being included in the sample. It has less chance of bias.
2.6.2.
Non-Random
Sampling: Non-Random Sampling or Non-Probability Sampling is a sampling
technique where each sampling doesn’t have equal chance of being included in
the sample. It has more chance of bias.
2.6.1.
‘Types of Random Sampling
2.6.1.1.Simple
Random Sampling: Simple Random Sampling is a sampling technique where each
population unit (member of the population) has equal chance of being included
in the sample. In this technique, there is no biasness is present. Various
Random number tables are available from which the numbers can be chosen at
random.
Suppose, the population is of size N and a sample of size n is to be
drawn. Then we need to number all population units serially from 1 to N and
then select a sheet in the random table book and select n units either
horizontally or vertically. We can also choose through lottery method by
drawing n chits from a jar consisting chits of all numbers from 1 to N.
2.6.1.2.Stratified Random Sampling: Stratified Random Sampling
is a sampling technique where the total population is divided into strata (groups)
based on certain characteristics. Then samples are drawn at random from each
stratum through simple random sampling.
Suppose, the population is of size N and a sample of size n is to be
drawn. We shall first divide the population into K non-overlapping groups where
each group exhibits a specific character. For example, if we divide population
based on sex then we would have 3 Strata – Male, Female, Others. Then we need
to draw sample at random from each stratum in such a way that sample size drawn
from each stratum when added, sums up to total sample size n. If n1 units
are drawn from strata 1, n2 units are drawn from strata 2,…, nk
units are drawn from strata k then n1+n2+…+nk=n.
2.6.1.3.Systematic
Random Sampling: Systematic Random Sampling is a sampling technique where the
first sample unit is chosen at random and the following units are chosen at
equal interval.
Suppose, by using simple random sample we chose the 1st
sample unit which is the 10th population unit. Then by using lottery
/ random method we choose the interval as 10 then the second sample unit would
be 20th population unit, 3rd sample unit would be 30th
population unit…, nth sample unit would be n*10th population unit.
2.6.1.4.Clustered Random Sampling: Clustered Random Sampling
is a sampling technique where the total population is divided into cluster (sub
groups) of same sizes. All the sub-groups possess same characteristics. Here,
instead of choosing units, a particular sub group is chosen at random.
Suppose,
the population is of size N and a sample of size n is to be drawn. We shall
first divide the population into K distinct sub groups. Then we select one or
multiple clusters and including all units from the chosen clusters to form the
desired sample of size n.
2.6.2. Types of Non-Random Sampling
2.6.2.1.Convenience Sampling: In this sampling technique, the
researcher chooses sample units which are easily accessible to them, i.e.,
based on their convenience.
Suppose, the researcher is to conduct survey on the
household expenditure of X region and the researcher instead of choosing the
units at random choses units which are closer to each other in order to avoid
travelling.
2.6.2.2.Voluntary Sampling: In this sampling technique, the
respondent voluntary chooses to participate in the survey and forms the sample
units.
Suppose, the researcher uploads the survey in some
online forum and the respondents are given an option to skip the survey or
participate in it. Here, the respondents voluntarily choses to participate in
the survey.
2.6.2.3.Judgement
Sampling: In this non-random sampling technique, the researcher based on their
experience and judgement choses the sample.
Suppose, the researcher is supposing to conduct survey
on women safety in company X and their opinion on management. But knowing that
not everyone would provide honest answer, he would select the sample upon
proper judgement.
2.6.2.4.Quota Sampling: In this sampling technique, the entire
population is divided into sub-groups and then sample units are drawn from them
by a non-random sampling technique.
Suppose, the population is of size N and a sample of
size n is to be drawn. We shall first divide the population into K distinct sub
groups/quotas. Then we need to draw sample from each quota in such a way that
sample size drawn from each quota when added, sums up to total sample size n.
This sample selection is based on non-random sampling.
2.6.2.5.Snowball Sampling: In this sampling technique, the
sample units are chosen on the way through network. When we are performing
survey on a topic for which the respondents are difficult to find we use this
method.
Suppose, we are to conduct survey on weed consumption
in present youth. Here, it would be difficult to detect the population or
sample and harder to find voluntary sample. So, we can first find a person and
then ask if he/she would know someone who consumes weed and proceed in similar
way.’[6]
2.7. Biasness and error in data collection
stage
Data biasness can occur even in the very basic i.e.,
data collection stage. It can be intentional or unintentional in nature. It is
important to detect bias at the very initial stage and be removed or rectified.
Data needs to be verified properly before putting into use.
The respondent may provide wrong information
intentionally or unintentionally, in case of an old event the respondent might
fail to recall, or in certain cases the respondent may be reluctant to share
certain information, etc.…
The investigator may collect wrong information on
purpose or by mistake. They may not want to visit all the units and try to get
information from other nearby source or may hear something wrong and record
that, etc…[7]
[1] https://blog.panoply.io/data-collection-how-what-when
[2]
Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill
[3]
Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill
[4]
Statistical Methods combined edition (volumes 1 and 2), NG Das, Mc Graw Hill
[5] https://www.simplilearn.com/types-of-sampling-techniques-article
[6] https://www.simplilearn.com/types-of-sampling-techniques-article
[7] https://www.futurelearn.com/info/courses/data-science-artificial-intelligence/0/steps/113337#:~:text=In%20a%20statistical%20sense%2C%20bias,want%20to%20say%20something%20about.
Comments
Post a Comment
If you have any doubt or suggestion kindly let me know. Happy learning!