lunes, 27 de noviembre de 2017

Scatter plots in survey sampling

When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.

For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.

The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.

I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.
So, let’s apply that principle to scatter plots. I am using the Lucy population from the TeachingSampling package to recreate my idea. The following code is used to draw a $\pi PS$ sample.

# The inclusion probability is proportional to Income 
# The selected sample of size n=400 
n <- 400
res <- S.piPS(n, Lucy$Income)
sam <- res[,1]
# The sample is stored in an data.sample 
data.sample <- Lucy[sam, ]

The sampling weights will be stored in the data.sample data in the column wk. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample
data.sample$Pik.s <- res[,2]
# wk is the sampling weight
data.sample$wk <- 1/data.sample$Pik.s

Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees)) +

The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) + geom_point()

In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample object as many times as the sampling weight wk. I am using the mefa::rep function to achieve this goal. So, the newLucy object is an intent to mimic the finite population by using the selected sample.

newLucy <- NULL

for(i in 1:nrow(data.sample)){
newLucy <- rbind(newLucy,
rep(data.sample[i, ],

newLucy <-

Now, with the newLucy population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +
geom_point() + geom_jitter(width = 15, height = 15)

7 comentarios:

  1. Get the idea about the data literacy in this sectio. In this survey sampling you can see the stochlastic structure and also the exception. With essay ghostwriting service you will able to understand all the technical code mention in this stream or in this blog.

  2. Thanks for posting your valuable thoughts with us & our readers. Please keep continue writing on this blog.
    Bulk SMS Provider

  3. We can see some kind of employees graphs shown here. When i was studying in the class of intermediate then i chose the book of statistics for reading. With the help of this statistic book i learn about the calculation of big amounts and also know that how we can make the employees average graphs.

  4. The total error paradigm follows the same approach , keep writing like this amazing things click here to see more about the quality writing services.