miércoles, 8 de agosto de 2018

Calibración de personas en encuestas de hogares


Es decir que, bajo el segundo escenario de calibración, todas la personas dentro del hogar comparten los mismos pesos de muestreo y además estos pesos son iguales al peso del hogar. Esta propiedad sólo se presenta en el segundo escenario. Es más, bajo el primer escenario, se garantiza que los hombres y las mujeres (dentro de un mismo hogar) tengan diferentes factores de expansión. Lo anterior, desde un punto de vista teórico no reviste ningún inconveniente, pero hay quienes quisieran conservar aquellas propiedades de los esquemas de muestreo en los factores de expansión finales. Estevao y Sarndal (2006) presentan algunas propiedades teóricas del segundo escenario.


Silva, PL. d N. 2004. «Calibration estimation: when and why, how much and how».Riode Janeiro: Instituto Brasileiro de Geografia e Estatística.

Estevao, Victor, y Carl-Erik Särndal. 2006. «Survey Estimates by Calibration on Com-plex Auxiliary Information».International Statistical Review / Revue Internationale deStatistique74 (2): 127-47.

lunes, 27 de noviembre de 2017

Scatter plots in survey sampling

When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population.

For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators.

The idea behind this paradigm is the representative principle. If a person $k$ is included in the sample with a sampling weight $w_k$, she/he represents to himself and $w_k-1$ remaining people. That way, you obtain design-unbiasedness.

I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person $k$ is included in the sample, but he/she is not alone. Behind person $k$ there are other people, not included in the sample, and you have to realize that.
So, let’s apply that principle to scatter plots. I am using the Lucy population from the TeachingSampling package to recreate my idea. The following code is used to draw a $\pi PS$ sample.

# The inclusion probability is proportional to Income 
# The selected sample of size n=400 
n <- 400
res <- S.piPS(n, Lucy$Income)
sam <- res[,1]
# The sample is stored in an data.sample 
data.sample <- Lucy[sam, ]

The sampling weights will be stored in the data.sample data in the column wk. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample
data.sample$Pik.s <- res[,2]
# wk is the sampling weight
data.sample$wk <- 1/data.sample$Pik.s

Now, let`s make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees)) +

The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population.

ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) + geom_point()

In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample object as many times as the sampling weight wk. I am using the mefa::rep function to achieve this goal. So, the newLucy object is an intent to mimic the finite population by using the selected sample.

newLucy <- NULL

for(i in 1:nrow(data.sample)){
newLucy <- rbind(newLucy,
rep(data.sample[i, ],

newLucy <- as.data.frame(newLucy)

Now, with the newLucy population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +
geom_point() + geom_jitter(width = 15, height = 15)

martes, 21 de noviembre de 2017

dplyr and the design effect in survey samples

Blogdown entry here.

For those guys like me who are not such R geeks, this trick could be of interest. The package dplyr can be very useful when it comes to data manipulation and you can extract valuable information from a data frame. For example, when using if you want to count how many humans have a particular hair color, you can run the following piece of code:


starwars %>% filter(species == "Human") %>%
  group_by(hair_color) %>%
  summarise(n = n())
auburn 1
auburn, grey 1
auburn, white 1
black 8
blond 3
brown 14
brown, grey 1
grey 1
none 3
white 2

As a result the former query gives you a data frame and you can use it to make another query. For example, if you want to know the average number of individuals in the data frame you can use the summarise twice:


starwars %>% filter(species == "Human") %>%
  group_by(hair_color) %>%
  summarise(n = n()) %>%
  summarise(x.b = mean(n))

Now, turning our attention to statistics, it is known that, when dealing with sample surveys, one measure of interest is the design effect defined as

$Deff \approx 1 + (\bar{m} - 1)\rho$

where $\bar{m}$ is the average cluster size and $\rho$ is the intraclass correlation coefficient. If you are dealing with survey data and you want to figure out the value of $\bar{m}$ and $\rho$, you can use dplyr. Let’s use the Lucy data of the samplesize4surveys package to show how you can do it.


m <- Lucy %>% group_by(Zone) %>%
  summarise(n = n()) %>%
  summarise(m = mean(n))

rho <- ICC(y = Lucy$Taxes, cl = Lucy$Zone)$ICC

DEFF <- 1 + (as.integer(m) - 1) * rho

domingo, 19 de noviembre de 2017

Automatic output format in Rmarkdown

I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don't want to format them one by one. For example, I have created the following data frame in dplyr

data2 %>% group_by(uf) %>%
  summarise(n = n(), ) %>%

One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function.

t1 <- data2 %>%
  group_by(uf) %>%
  summarise(n = n(), ) %>% arrange(desc(n))


However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá.


Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result:

jueves, 15 de junio de 2017

Sampling weights and multilevel modeling in R

So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay.

Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample.

When using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students.

Let’s assume that in the finite population we have 10.000 students and 40 schools. For the sake of my example, let's consider that you have selected 500 students allocated in 8 schools. For the sake of easiness, let’s think that a simple random sample is used (I know, this kind of sampling design is barely used) to select students. Think about it, if you take into account only the student’s weights to fit your multilevel model, you will find that you are estimating parameters with an expanded sample that represents 10.000 students that are allocated in a sample of just eight schools. So, any conclusion stated will be wrong. For example, when performing a simple analysis of variance, the percentage of variance explained by the schools will be extremely low, because of you are expanding the sample of schools. Now, if you take into account both sets of weights (students and schools), you will find yourself fitting a model with expanded samples that represent 10.000 students and 40 schools (which is good).

Unfortunately, as far as I know, the R suitcase lacks of a package that performs this kind of design-based inference to fitting multilevel models. So, right about now, we can unbiasedly estimate model parameters, but when it comes to estimate standard errors (from a design-based perspective) we need to use other computational resources and techniques like bootstrapping or Jackknife. According to the assumption of independence, most of the applied statistical methods cannot be used to analyze this kind of data directly due to dependency among sampled observation units. Inaccurate standard errors may be produced if no adjustment is made when analyzing complex survey data

When it comes to educational studies (based on large-assessment tests), we can distinguish (at least) four set of weights: total student weight, student house-weight, student senate-weight and school weight. TIMMS team claims that total student weight is appropriate for single-level student-level analyses. Student house weight, also called normalized weight, is used when analyses are sensitive to sample size. Student house weight is essentially a linear transformation of total student weight so that the sum of the weights is equal to the sample size. Student Senate weight is used when analyses involve more than one country because it is total student weight scaled in such a way that all students’ senate weights sum to 500 (or 1000) in each country. School weight should be used when analyzing school-level data, as it is the inverse of the probability of selection for the selected school.

R workshop

We will use the student house-weight to fit a multilevel model. As stated before, the sum of these weights is equal to the sample. For the R workshop, we will use PISA 2012 data (available in the OECD website). I have done a filter for the Colombian case and saved this data to be directly compatible with R (available here). Let’s load the data into R.

rm(list = ls())

setwd("/your working directory")


Now, we create an object containing the student house-weights and summarize some results based on that set of weights. Notice that the total student weights are stored in the column W_FSTUWT of the PISA database. I recall you that I am working with the first plausible value of the mathematics test and that score will be defined as our (dependent) variable of interest for the modeling.  

n <- nrow(PisaCol)
PisaCol$W_HOUSEWHT <- n * PisaCol$W_FSTUWT / sum(PisaCol$W_FSTUWT)

PisaCol %>%
  group_by(STRATUM) %>%
  summarise(avg1 = weighted.mean(PV1MATH, w = W_HOUSEWHT),
            avg2 = weighted.mean(PV2MATH, w = W_HOUSEWHT))

We use the function lmer of the lme4 package to obtain the estimation of the model coefficients in the null model (where schools are defined as independent variables).

### Null model ###

HLM0 <- lmer(PV1MATH ~ (1 | SCHOOLID), data = PisaCol,
             weights = W_HOUSEWHT)

# 62.81% of the variance is due to students
# 37.19% of the variance is due to schools
100 * 3569 / (3569 + 2113)

As you may know, the PISA index of economic, social and cultural status has a strong relationship to student achievement, so it is a good idea to control for this variable in a more refined model. 

### ESCS mdel ###

HLM1 <- lmer(PV1MATH ~ ESCS + (1 + ESCS | SCHOOLID), data = PisaCol,
             weights = W_HOUSEWHT)

# After contoling for ESCE, 34.58% of the variance is due to schools
100 * (96.12 + 1697.36) / (3392.58 + 96.12 + 1697.36)

So then, in summary: we have 3569 units of within-schools variance (63%), after controlling for ESCE that figure turns out to 3392 units (student background explains 5% of that variation). We have 2113 (37%) units of between-school variances, after controlling for ESCE that figure turns out to 1793 (student background explains 15% of that variation). The following code makes a graph that summarizes the relationship of the student achievement with ESCE.

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
  theme_minimal() + geom_point() + theme(legend.position="none")

ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) +
  geom_point(aes(colour = SCHOOLID)) + theme(legend.position="none")

lunes, 17 de abril de 2017

Small Area Estimation 101

Small area estimation (SAE) has become a widely used technique in official statistics since the last decade of past century. When the sample size is not enough to provide reliable estimates at a very particular level, the power of models and auxiliary information must be applied with no hesitation. In a nutshell, SAE tries to exploits similarity and borrows strength from available information.

I will write some posts to present, step by step, the fundamentals of SAE and how it can be implemented in the R software. The first post (this one you are reading now) is about basic concepts such as sampling and databases, the second and third posts will deal with direct and indirect estimates, respectively; the fourth post will introduce model-assisted estimation; and finally, the fifth post will deal with the Fay-Harriot method.

Let's begin. First of all, The Australian Bureau of Statistics declares that small area estimation refers to methods of producing sufficiently reliable estimates for geographic areas that are too fine to obtain with precision, using direct survey estimation methods. By direct estimation, we mean classical design-based survey estimation methods that utilize only the sample units contained in each small area. Small area estimation methods are used to overcome the problem of small samples sizes to produce small area estimates that improve the quality of direct survey estimates obtained from the sample in each small area. The more sophisticated of these methods work by taking advantage of various relationships in the data, and involve, either implicitly or explicitly, a statistical model to describe these relationships.

Now, I want to reproduce a clarifying explanation from Dr. Little (paraphrasing Groves) about SAE, and its use in survey sampling. They claim that regression estimates provide relatively precise predictions for small areas from a survey that account for the differences between areas of characteristics included as predictors in the survey but do not account for differences in characteristics not included in the study. Direct estimates for each area are unique to the area and hence take into account both observed and unobserved relevant characteristics; however, they have low precision in areas where the sample size is small. The SAE model combines the regression estimate and direct estimate for each area in a sensible way, balancing bias and precision.

For this technique to succeeds, Longford 2005 claims that areas that are known to be similar to one another should be receiving similar estimates, rather than estimates independent of one another. The degree to which similarity can or should be imposed can be chosen from statistical grounds by minimizing the overall discrepancy (mean squared error).

Rahman 2008 emphasizes that SAE uses data from similar domains to estimate the statistics in a particular small area of interest, and this ‘borrowing of strength’ is justified by assuming a model which relates the small area statistics. SAE is the process of using statistical models to link survey outcome or response variables to a set of predictor variables known for small areas to predict small area-level estimates. Traditional area-specific estimates may not provide enough statistical precision because of small sample observations in small geographical regions. In such situation, it may be worth checking whether it is possible to use indirect estimation approaches based on the linking models.

R workshop

This code will not produce any small area estimation, but it will help to introduce basic concepts of sampling. We will use the BigLucy database from the TeachingSampling package to illustrate how to obtain a probabilistic sample from a finite population. This database is about deals with some economic variables for a population of 85296 companies spread out into 100 counties (areas) in a particular year of some fake country. The aim of the exercise is 1) to select a stratified sample according to the size of the companies and 2) obtain accurate estimates of the total income within each of the 100 counties. Note that the parameters of interest are given by:

$$t_{y,d} = \sum_{k \in U_d} y_k$$

Where $t_{y,d}$ denotes the total income of the $d$-th county, $y_k$ is the income of the $k$-th company belonging the $d$-th county. The whole population of companies into the county is noted by $U_d$. Thus, this code computes the total income for each county along with the number of companies belonging each county. Finally, this parameters are saved in a new database named Results

##### Setting things up #####

setwd(“/wherever your prefer location is")

rm(list = ls())


Total <- BigLucy %>%
  group_by(Zone) %>%
  summarise(Income. = sum(Income)) %>%

N <- BigLucy %>%
  group_by(Zone) %>%
  summarise(N.county = n()) %>%

Results <- data.frame(N, Total$Income.)

#Checking the population total
(Total <- sum(Results$Total.Income.))

Now, once the parameters of the finite populations are computed, the time has come for us to select a sample. The sampling design we will use is an stratified sampling by taking advantage of the classification of each company into the database according to its level: small, medium and large. The selected sample will be stored into an object named data.sample and the sampling weights will be saved in another object called FEX.

##### Drawing a stratified sample #####

# Level is the stratifying variable
# Defines the size of each stratum
N1 <- summary(Level)[[1]]
N2 <- summary(Level)[[2]]
N3 <- summary(Level)[[3]]
Nh <- c(N1,N2,N3)
# Defines the sample size at each stratum

n1 <- round(N1 * 0.05)
n2 <- round(N2 * 0.05)
n3 <- round(N3 * 0.05)
# Draws a stratified sample
sam <- S.STSI(Level, Nh, nh)

data.sam <- BigLucy[sam,]

data.sam$FEX <- NULL
data.sam$FEX[data.sam$Level == "Big"] <- Nh[1] / nh[1]
data.sam$FEX[data.sam$Level == "Medium"] <- Nh[2] / nh[2]
data.sam$FEX[data.sam$Level == "Small"] <- Nh[3] / nh[3]

save(data.sam, file = "data.sam.RData")

In summary, we have drawn a sample of 4265 companies: 145 big companies, 1290 medium companies and, 2830 small companies. Each of these companies is spread out through the 100 counties in the country. In the Results database will be stored the information about those counties (how many companies are in each county and, how many companies are in the sample for each county.) along with different estimations for the total income of the counties. 

##### In summary #####

n <- data.sam %>%
  group_by(Zone) %>%
  summarise(n.county = n()) %>%

Results$n.county <- n$n.county
Results <- Results[c("Zone", "N.county", "n.county", "Total.Income.")]

save(Results, file = "Results.RData")

In the following post, we will use this sample to estimate the total income for each of the hundred zones in BigLucy’s population. 


sábado, 28 de enero de 2017

Regression to the mean (or at the end, people are not as smart as you could expect)

Francis Galton very cleverly coined the term "regression to (or towards) the mean" meaning that if a variable is shown extreme in a first measurement, then the following observed values of that very variable will tend to get closer to the average of its distribution. The classical example is height: a tall child will have (on average) parents less tall than himself. Moreover, extremely small parents tend to have children who are smaller than average, but in both cases, the children tend to be closer to the mean than were their parents (Senn, 2011).

Ok, this story should be widely known by the readers of this blog. However, I want to put forward another point of view. This is from Rolf Tarrach (former president of Luxemburg University) who has written a book on logical reasoning entitled <<The Pleasure of Deciding>>. He claims that the regression to the mean is a phenomenon that occurs not only in body measuring but also in cognitive measuring. That is: smart parents will tend to have children who are not as smart as expected.

So if you consider yourself as an intelligent person and you have decided to share your life with a smart mate, it is very likely that your children won't be smarter than you two. So, do not expect your children to be geniuses. That fact goes against the common idea that insists in requesting to children of smart parents to be even more intelligent. Of course, there are some exemptions such as the Bach family or the Bernoulli family. But, those are isolated deviations from the normality of real life.

I want to finish with this story about mathematician Bernard Shaw and dancer Isadora Duncan. She told him: “Would it not be wonderful if we could have a child who had your brains and my beauty?” He replied: “Yes, but suppose the child had your brains and my beauty!”

PS: About geniuses, Koenker (1998) claims that Galton not only managed to invent Regression in one plot but also a bivariate kernel density estimation.

lunes, 16 de enero de 2017

Multilevel regression with poststratification (Gelman's MrP) in R - What is this all about?

Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables.

The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables).

Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable.

Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let's assume that:

  1. a population is divided into $H$ strata of interest (for example states),
  2. the parameters of interest are the means (same rules apply for proportions) in each strata $\theta_h$ ($h=1, \ldots, H$), 
  3. every stratum is cross-classified by some demographics $j \in H$ (from now on defined as post-srata), besides every population count $N_j$ is known, and
  4. all of the population means $\mu_j$ can be estimated by using some statistical technique, such as multilevel regression.

This way, the mean in stratum $h$ ($\theta_h$), is defined as a function of means in post-strata $j$ ($\mu_j$), and post-strata counts ($N_j$):

$$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$

The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are

  • the variable of interest (for example, voting intention), 
  • the auxiliary variables (classification on census demographic cells) and, 
  • the random effects (strata of interest, that usually are states or counties).

The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights.

R workshop

Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are $\theta_1, \ldots, \theta_5$.

Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on. 

The following code shows how to load the database and obtain the cell counts. 

> rm(list = ls())
> set.seed(123)
> library(TeachingSampling)
> library(dplyr)
> library(lme4)
> data("Lucy")
> # Number of industries per level
> table(Lucy$Level)

   Big Medium  Small
    83    737   1576
> # Number of industries per zone
> table(Lucy$Zone)

  A   B   C   D   E
307 727 974 223 165
> # Size of post-strata
> (Np <- table(Lucy$Level, Lucy$Zone))
           A   B   C   D   E
  Big     30  13   1  16  23
  Medium 180 121 111 187 138
  Small   97 593 862  20   4

Of course, this technique works over a selected sample. That's why we are going to select a random sample of size $n = 1000$. We can also create some tables showing the counts on the sample.

> # A sample is selected
> SLucy <- sample_n(Lucy, size = 1000)
> table(SLucy$Level)

   Big Medium  Small
    33    280    687
> table(SLucy$Zone)

  A   B   C   D   E
130 295 426  86  63

The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the lmer function. The object Mupred contains the corresponding $\mu_j$ ($j$ is defined for each level) regarding each stratum (zones).

> # Step 1: <<MR>> - Multilevel regression
> M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy)
> coef(M1)
  (Intercept) LevelMedium LevelSmall
A    1265.596   -579.1851  -893.8958
B    1138.337   -579.1851  -893.8958
C    1189.285   -579.1851  -893.8958
D    1248.658   -579.1851  -893.8958
E    1284.322   -579.1851  -893.8958

[1] "coef.mer"
> SLucy$Pred <- predict(M1)
> # Summary
> grouped <- group_by(SLucy, Zone, Level)
> sum <- summarise(grouped, mean2 = mean(Pred))
> (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))
          [,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224
[2,]  686.4107  559.1518  610.1000  669.4724  705.1373
[3,]  371.7001  244.4412  295.3894  354.7618  390.4267

Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function aggregate to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification
> # Mean income estimation per zone
> colSums(Np * Mupred) / table(Lucy$Zone)

       A        B        C        D        E
643.5724 312.8052 332.1726 682.8031 778.2428

How accurate was the estimation? It was good compared to the true parameters on the finite population. 

> # True parameters
> aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean)
  Group.1        x
1       A 652.2834
2       B 320.7469
3       C 331.0195
4       D 684.9821
5       E 767.3879


domingo, 1 de enero de 2017

3PL models viewed through the lens of total probability theorem (updated)

As I currently am the NPM for PISA in Colombia, I must assist to several meetings dealing with the proper implementation of this assessment in my country. Few of them are devoted to the analysis of this kind of data (coming from IRT models). As usual, OECD has hired organizations with high technical standards. The institute that handles all this data and take part in the analysis is ETS (Educational Testing Services).

PISA and ETS are changing from Rasch models to 2PL models. That involves a significant technical effort to maintain comparability along time. I had the opportunity to talk with some experts from ETS last year, and I formulated them the following question: ¿why to consider 2PL models instead of 3PL models? Well, the answer was not easy, and I am not pretending to explain it here in detail, in part because I am still convinced about the advantages of 3PL models that exceed those of 2PL models. However, they yielded me to a recent paper entitled Is There Need for the 3PL Model? Guess What?

The article was written by Mathias von Davier from ETS. I liked the way von Davier showed 3PL models, as coming from a total probability setup involving 2PL models. Consider the following hierarchical structure: first, the test-taker decides whether he/she is answering that item by guessing or not; then, he/she uses his/her ability to found the correct response. So, the stochastic process behind this structure can be easily shown in a tree diagram:

Screen Shot 2017 01 01 at 12 46 51 PM


Remember that 3PL models can be written as:

$$P_{3PL}(x=1) = P(Guess) + P(NoGuess) \times P_{2PL}(x=1|NoGuess)$$

Note that, per the model, once the student has chosen to answer by guessing, a correct answer is always found (kind of weird, isn’t it?). So, a major criticism against 3PL models is related to this last point. In R, you can estimate 3PL models by using the mirt package. So, for example, when using the LSAT7 data on the second item, we can estimate this guessing parameter.

data <- expand.table(LSAT7)
md2  <-  mirt(data, 1, itemtype = '3PL', IRTpars = TRUE)
coef(md2, IRTpars = TRUE)$Item.2

We found that the guessing parameter is estimated as 0.295. This way, the model is specified as:

$$P_{3PL}(x=1) = 0.295 + 0.705 \times P_{2PL}(x=1|NoGuess)$$. 

PD: Alexander Robitzsch pointed me out to this paper (Aitkin, 2006) where an alternative 3PL has been proposed which aims to address the critique.

sábado, 24 de diciembre de 2016

Computing Sample Size for Variance Estimation

The R package samplesize4surveys contains functions that allow to calculate sample sizes for estimating proportions, means, difference of proportions and even difference of two means. It also permits the calculation of sample error and power level for a fixed sample size.

Here four functions are introduced for the estimation of a population variance and for conducting statistical hypothesis testing on this parameter of interest. Right away is the description of these functions:

  1. Function ss4S2 allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular value of the coefficient of variation or the relative margin of error. Additionally, it offers to the user the option of mapping the coefficient of variation and the margin of error as a function of the sample size, to make easier the decision about $n$.
  2. Function ss4S2H allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular power level to detect a population variance greater than the value set in the null hypothesis. It also offers to the user the option of mapping the power level in function of the sample size.
  3. Function e4S2 allows calculating the coefficient of variation and the margin of error for a particular sample size. It also allows obtaining a mapping similar to the one of ss4S2.
  4. Function b4S2 allow calculating the power level for a fixed sample size. It also allows obtaining a mapping similar to the one of ss4S2H

In order to use the above functions it is necessary to install and call the package that contains them in the Comprehensive R Archive Network (CRAN). That for, it is required to type the following code lines from the console:


For example, the following code line gives the necessary sample size to estimate the variance of a characteristic of interest in a finite population (with a coefficient of kurtosis of one) to reach an estimated coefficient of variation of maximum 5% and a relative margin of error of 3%

ss4S2(N = 10000, K = 1, CV = 0.05, me = 0.03, DEFF = 2, plot = TRUE)

Screen Shot 2016 12 24 at 6 45 56 PM

On the other hand, as the package is in constant update, the authors have arranged a repository in which users can use the newest features and interact with the academic community to correct possible errors in computer codes and improve the efficiency of functions, among others. In order to access to this version control, it is necessary to type the following lines from R.


In this paper you can find the mathematical background behind those R functions. 

sábado, 3 de diciembre de 2016

Highlighting R code for the web

When blogging about statistics and R, it is very useful to differentiate the body text to R code. I used to manage this issue by highlighting the code and pretty-R was a valuable instrument from Revolutions Analytics to accomplish this. However, as you may know, Microsoft acquired that company, and now this feature (dressing R code for the web) is not available anymore.

After some searching, I found this online syntax highlighter and it seems to work pretty well. Besides, it allows you to select from different styles, and you can even choose among a lot of computational languages.

How important is that variable?

When modeling any phenomena by including explanatory variables that highly relates the variable of interest, one question arises: which of the auxiliary variables have a higher influence on the response? I am not writing about significance testing or something like this. I am just thinking like a researcher who wants to know the ranking of variables that influence the response and their related weight.

There are a variety of methods that try to answer that question. The one inducing this thread is very simple: isolate units from variables. Assume a linear model with the following structure (for the sake of simplicity, assume only two explanatory variables):

$$y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$

If you assume this model as true and $\beta_i > 0$, then the influence of variable $x_i$, over response $y$ could be found when isolating measure units from variables. Then, one could fit a model over the standardized variables (explanatory and response) and then directly comparing the regression coefficients. Another way to do this is by means of the following expression:

$$I(i) = \frac{\beta_i}{sd(\beta_i)} = \beta_i\frac{ sd(x_i)}{sd(y)}$$

For example, let's consider the following model $y = -500 x_1 + 50 x_2 + \varepsilon$, then the relative importance of the first and second variable is around 500/(500+50) = 0.9, and 50/(500+50) = 0.1, respectively. The following code shows how to perform this simple analysis in R.

n <- 10000

x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)

model <- lm(y ~ 0 + x1 + x2)

# 1a. Standardized betas
sd.betas <- summary(model)$coe[,2]
betas <- model$coefficients
imp <- abs(betas)/sd.betas
imp <- imp/sum(imp)

# 1b. Standardized betas
imp1 <- abs(model$coefficients[1] * sd(x1)/sd(y))
imp2 <- abs(model$coefficients[2] * sd(x2)/sd(y))

imp1 / (imp1 + imp2)
imp2 / (imp1 + imp2)

# 2. Standardized variables
model2 <- lm(I(scale(y)) ~ 0 + I(scale(x1)) + I(scale(x2)))

domingo, 20 de noviembre de 2016

Lord's Paradox in R

In an article called A Paradox in the Interpretation of Group Comparisons published in Psychological Bulletin, Lord (1967) made famous the following controversial story:

A university is interested in investigating the effects of the nutritional diet its students consume in the campus restaurant. Various types of data were collected including the weight of each student in the month of January and their weight in the month of June of the same year. The objective of the University is to know if the diet has greater effects on men than on women. This information is analyzed by two statisticians.

The first statistician observes that at the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of the semester (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is obvious from their background). On average, neither men nor women gained or lost weight during the course of the semester. The first statistician concludes that there is no evidence of any significant effect of diet (or any other factor) on student weight. In particular, there is no evidence of any differential effect on both sexes, since no group shows systematic differences.

The second statistician examines the data more carefully. Note that there is a group of men and women who started the semester with the same weight. This group consisted of thin men and overweight women. He notes that those men gained weight from the average and these women lost weight with respect to the average. The second statistician concludes that by controlling for the initial weight, the university diet has a positive differential effect on men relative to women. It is evident that for men and women with the same initial weight, on average they differ since men gained more weight, and women lost more weight.

The following chart shows the reasoning of both statisticians in dealing with the problem. Note that the black line describes a 45 degrees line, the green points are the data coming from the men and the red ones from the women

Screen Shot 2016 11 20 at 6 04 27 PM

The reasoning of the first statistician focuses on the expectations of both distributions. Specifically in the coordinates (x = 60, y = 60), for females, and (x = 70, y = 70) for males, where black, red and green lines appear to coincide. The reasoning of the second statistic is limited to the continuum induced by the overlap of red and green dots. Specifically to the space induced by x = (60, 70), y = (60, 70). Suppose we have access to this dataset as shown in the following illustration, where the first column denotes the initial weight of the students, the second column indicates the final weight, the third column describes the difference between pesos and the last one defines the Sex of the student.

Screen Shot 2015 12 30 at 11 13 09 PM

The findings of the first statistician are obtained through a simple regression analysis that, taking as a response variable the difference between weights, induces a coefficient of regression equal to zero for the variable sex, which indicates that there are no significant differences in the weight difference between men and women.

Screen Shot 2016 11 20 at 6 04 49 PM

The findings of the second statistic are obtained through a covariance analysis, taking as response variable the final weight and covariates are sex and the initial weight of the individual. This method induces a coefficient of regression equal to 5.98 which implies that there is significant difference between the final weight of the people, according to sex.

Screen Shot 2016 11 20 at 6 05 28 PM

For Imbens and Rubin (2015), both are right when it comes to describing the data, although both lack a sound reasoning in establishing some kind of causality between the diet of the university and the loss or gain of weight in the students. Regardless of this I still find more interesting the analysis that arises from the comparison between men and women who started with the same weight (ie all data restricted to x = (60, 70) y = (60, 70). 

R workshop

Lord's paradox summarizes the analysis of two statisticians who analyze the average weight of some students within a particular university. At the end of the semester (June), the average weight of the men is identical to their average weight at the beginning of that six months (January). This situation also occurs for women. The only difference is that women started the year with a lower average weight (which is evident from their natural contexture). On average, neither men nor women gained or lost weight during the semester.

To perform the simulation, we assumed that both the final weight of the men and the women follow a linear relationship with the original weight. Thus, it is assumed that $y_{2i}^M = \beta_0^M + \beta_1 y_{1i}^M + \varepsilon_i$ for the weight of women; and $y_{2i}^H = \beta_0^H + \beta_1 y_{1i}^H + \varepsilon_i$, for the weight of men. Where $y_{1i}^M$ denotes the weight of the $i$-th female at the beginning of the semester, and $y_{2i}^M$ denotes the weight of the $i$-th female at the end of the semester. The notation for men (H) maintains this logic.

Now, note that from their natural contexture, men must have greater weight than women. Suppose that on average the weight of men is equal to that of women plus a constant $c$. In addition, the mean weight in both groups is identical in both times. Then, we have $\bar{y}^M = \beta_0^M + \beta_1 \bar{y}^M$ and that $\bar{y}^H = \beta_0^H + \beta_1 \bar{y}^H = \beta_0^H + \beta_1 (\bar{y}^M + c).$ Hence, after some algebra, we have that $\beta_0^M = (1 - \beta_1) \bar{y}^M$ and $\beta_0^H = \bar{y}^H - \beta_1 (\bar{y}^M + c)$.

The following code replicates a set of data that follows the relationship proposed by Lord.

N <- 1000
b <- 10
l <- 50
u <- 70
Mujer1 <- runif(N, l, u)
Hombre1 <- Mujer1 + b
beta1 <- 0.4
Mujerb0 <- (1 - beta1) * mean(Mujer1)
Hombreb0 <- mean(Hombre1) - beta1 * (mean(Mujer1) + b)
sds <- 1
Mujer2 <- Mujerb0 + beta1 * Mujer1 + rnorm(N, sd=sds)
Hombre2 <- Hombreb0 + beta1 * Hombre1 + rnorm(N, sd=sds)

The graph can be done with the following piece of code:

datos <- data.frame(inicio = c(Mujer1, Hombre1), final = c(Mujer2, Hombre2))
datos$dif <- datos$final - datos$inicio
datos$sexo = c(rep(0, N), rep(1, N))
ggplot(data = datos, aes(inicio, final, color = factor(sexo))) +
  geom_point() + stat_smooth(method = "lm")  +
  geom_abline(intercept = 0, slope = 1) +
  ggtitle("Paradoja de Lord") + theme_bw()

lunes, 14 de noviembre de 2016

Intercept or not? That's the question!

My current passion is statistical modeling. While each model requires the researcher to make a proper contextualization of the problem he/she is addressing, which means that no model is equal to another, there is a common question that the researcher should answer before estimating model parameters.

Do I fit the model with an intercept or not?

While seeking for the goodness of fit, the researcher is tempted many times to run automated variable selection procedures (i.e. stepwise, forward, backward). If luckily, these methods will provide you "the best model" for you to choose (based on the highest coefficient of determination, or lower AIC, BIC, or DIC). Call me old fashioned, and retrogressive, but I have always been a little reluctant to the practice of throwing the data into the software waiting for the best model to come automatically.

Returning to the subject of this entry I will highlight the importance of inclusion/omission of the intercept in a model. For this, I will consider the following cases

1. If the response variable Y is continuous:

When the explanatory variable X is also continuous. This is the classic case of a linear regression model, where the inclusion of the intercept assumes that when X = 0, the mean value of Y = 0, and corresponds to the estimate of the intercept. However, when excluding the intercept, we are demanding that the average value of Y = 0 when X = 0. Thus the inclusion or exclusion of the intercept, in many cases, depends on the nature and interpr etation of the variables.

When the explanatory variable X is categorical. Without loss of generality, let's assume it as dichotomous (two levels); in this case, when fitting a regression line including the intercept, one can define a dummy variable representing the first level of the variable X, and the model is set as

$Y_i = \beta_0 + \beta_1D_{1i} + E_i$

Where D1 = 1 for units belonging to the first level of X and, D1 = 0 for units belonging to the second level of X. In this case, the interpr etation of this model is as follows: For individuals belonging to level 1, the mean value of Y is given by $ \beta_0 + eta1$. For units belonging to level 2, the average value of Y is given by $ \beta_0$. Coefficient $ \beta_1$ is defined to be the difference between these two levels. If the estimate is significant, it implies that the variable X does have considerable influence on Y. That is, the mean value of Y at each level of X varies in a significant way.

On the other hand, if the regression is fitted without an intercept, two dummies variables must be created, each one representing the as many levels as X has. The model is formulated as

$Y_i = \beta_0D_{1i} + \beta_1D_{2i} + E_i$

For units at the first level (D1 = 1), the mean value of Y is given by $ \beta_0$ and, for units at the second level (D2 = 1), the average value of Y is given by $\beta_1$. Thus, even if the estimate of either $\beta_0$ or $\beta_1$ is significant, that does not imply that X has any influence over Y. All we can claim is that in this model is that the two parameters are significantly different from zero. So, if you really want to establish whether X influences Y, then omitting the intercept would not be a good choice.

 2. If the response variable Y is discrete:

When the explanatory variable X is continuous. In this case, the fitted is a logistic regression, modeling the probability of success (Y = 1) in terms of $p_i = Pr(Y=1)$:

$logit (p_i) = \beta_0 + \beta_1X_i$

If the model includes an intercept, $ \beta_0$ estimate can be used to estimate the probability of success when X = 0, since $p_i = \frac{\exp{ \beta_0}}{ 1 + exp{ \beta_0}}$. On the other hand, if the estimate of $\beta_1$ is not significant, that implies that the values of X do not influence the chances of success or failure over Y. If the estimate of $ \beta_1$ is significant with a positive (or negative) value, it indicates that an increase in the variable X implies an increase (or decrease) on the probability of success of Y. Note that this interpr etation is the same when the regression is adjusted without an intercept.

When the explanatory variable is categorical (let's assume it as a dichotomous variable). In this case, by fitting a regression line including the intercept a dummy variable representing the first level of the variable is created and the model is defined as

$logit (p_i) = \beta_0 + \beta_1D_{1i}$

The interpretation of this model is as follows: for units in the first level of X, $logit(p_i) = \beta_0 + \beta_1$. For units in the second level of X, $logit(p_i) = \beta_0$. Thus, if $ \beta_1$ is significant, it indicates that $logit(p_i)$ is different between levels of variable X, and we can conclude that X does have an important influence on Y.

On the other hand, if the intercept is not taken into account, two dummies are produced (representing X levels) and the model is formulated as

$logit (p_i) = \beta_0D_{1i} + \beta_1D_{2i}$

For this pattern, estimates of $\beta_0$ and $\beta_1$ represent the values of $logit(p_i)$ in the two levels of X. Thus, the significance of X over Y cannot be estimated via $\beta_1$ or $\beta_2$. Those coefficients give no information on the influence of X on Y.

In summary, we can conclude that when the explanatory variable is continuous, the interpr etation of $\beta_1$ does not change if the intercept is included (or excluded). Although when the explanatory variable is discrete, we must consider whether the model includes or not the intercept, since the interpr etation of $\beta_1$ changes. Also, if what you want is to know the influence of X on Y, it is necessary to include the intercept. That can only be achieved if the model considers the intercept, and putting aside (just for a moment) automated procedures.

domingo, 30 de octubre de 2016

Data Literacy

Once again I have decided to pimp my blog. This time is a significant change: the name. This blog began a long time ago. It was 2006; I was 22-year-old, and I was enrolled in a Master of Science in Statistics. I had plenty of doubts about statistics, data science, and uncertainty (fortunately, some of those questions remain) and I decided to solve them by blogging.

Then it was born this blog with the name "Apuntes de Estadística." Ten years later, this blog has become an important space for researchers, teachers, and students who, like myself, want to solve questions about variability and statistics.

Now, after hundreds of posts, a hundred thousand visitors per year, and being migrated from Spanish to English, from WordPress to Blogger, it is time for me to revisit the very own name of this blog. Lately, posts are not notes anymore; I am not only answering basic questions but discussing modeling, forecasting, data and even statistics epistemology.

As a consequence of what I currently do in my job, now I am strongly convinced (more than ever in my life) that the time predicted by H.G. Wells has not arrived yet. Statistical thinking is not part of our culture. It should be encrusted in every one of us, but it is not. I will put in just my two cents to build an effective citizenship through data literacy: the ability to communicate information from data.