## miércoles, 8 de agosto de 2018

$Es decir que, bajo el segundo escenario de calibración, todas la personas dentro del hogar comparten los mismos pesos de muestreo y además estos pesos son iguales al peso del hogar. Esta propiedad sólo se presenta en el segundo escenario. Es más, bajo el primer escenario, se garantiza que los hombres y las mujeres (dentro de un mismo hogar) tengan diferentes factores de expansión. Lo anterior, desde un punto de vista teórico no reviste ningún inconveniente, pero hay quienes quisieran conservar aquellas propiedades de los esquemas de muestreo en los factores de expansión finales. Estevao y Sarndal (2006) presentan algunas propiedades teóricas del segundo escenario. Referencias Silva, PL. d N. 2004. «Calibration estimation: when and why, how much and how».Riode Janeiro: Instituto Brasileiro de Geografia e Estatística. Estevao, Victor, y Carl-Erik Särndal. 2006. «Survey Estimates by Calibration on Com-plex Auxiliary Information».International Statistical Review / Revue Internationale deStatistique74 (2): 127-47. ## lunes, 27 de noviembre de 2017 ### Scatter plots in survey sampling When it comes to analyzing survey data, you have to take into account the stochastic structure of the sample that was selected to obtain the data. Plots and graphics should not be an exception. The main aim of such studies is to try to infer about how the behavior of the outcomes of interest in the finite population. For example, you may want to know how many people are in poverty. How about counting the poor people in the sample? I know, it is not a good idea. You have to use the sampling weights to estimate the whole number of poor people in the population. The total error paradigm follows the same approach when estimating any parameter of the finite population: you have to use the sampling weights in order to obtain unbiased estimators. The idea behind this paradigm is the representative principle. If a person$k$is included in the sample with a sampling weight$w_k$, she/he represents to himself and$w_k-1$remaining people. That way, you obtain design-unbiasedness. I am not a big fan of using descriptive sample statistics to analyze survey data because it can mask the reality, and the fact is that person$k$is included in the sample, but he/she is not alone. Behind person$k$there are other people, not included in the sample, and you have to realize that. So, let’s apply that principle to scatter plots. I am using the Lucy population from the TeachingSampling package to recreate my idea. The following code is used to draw a$\pi PS$sample. library(TeachingSampling) data(Lucy) # The inclusion probability is proportional to Income # The selected sample of size n=400 n <- 400 set.seed(123) res <- S.piPS(n, Lucy$Income)
sam <- res[,1]
# The sample is stored in an data.sample
data.sample <- Lucy[sam, ]
attach(data.sample)


The sampling weights will be stored in the data.sample data in the column wk. They will be useful to reproduce our finite popultaion from the sample data.

# Pik.s is the inclusion probability of units in the sample
data.sample$Pik.s <- res[,2] # wk is the sampling weight data.sample$wk <- 1/data.sample$Pik.s  Now, lets make a plot of the sampling data (with just 400 observations). I recall that this scenario is somehow misleading, because we want to know the behavior of the variables in the finite population. library(ggplot2) ggplot(data = data.sample, aes(x = Income, y = Employees)) + geom_point()  The first option that comes to mind is to include the sampling weights in the points of the scatter plot. However, this approach is not appealing to me, because it is not straightforward from this plot to visuzlize the entire finite population. ggplot(data = data.sample, aes(x = Income, y = Employees, size = wk)) + geom_point()  In order to make the finite population scatter plot from the survey sample, I will replicate the rows of the data.sample object as many times as the sampling weight wk. I am using the mefa::rep function to achieve this goal. So, the newLucy object is an intent to mimic the finite population by using the selected sample. library(mefa) newLucy <- NULL for(i in 1:nrow(data.sample)){ newLucy <- rbind(newLucy, rep(data.sample[i, ], round(data.sample$wk[i])))
}

newLucy <- as.data.frame(newLucy)


Now, with the newLucy population, I will make a scatter plot. Now, as I am replicating the rows of the sample data, I will add a jitter to avoid overplotting of the points in the scatter plot. This way, this plot (with 2396 observations) looks as if it would come from the finite population.

ggplot(data = newLucy, aes(x = Income, y = Employees)) +
geom_point() + geom_jitter(width = 15, height = 15)


## martes, 21 de noviembre de 2017

### dplyr and the design effect in survey samples

Blogdown entry here.

For those guys like me who are not such R geeks, this trick could be of interest. The package dplyr can be very useful when it comes to data manipulation and you can extract valuable information from a data frame. For example, when using if you want to count how many humans have a particular hair color, you can run the following piece of code:

library(dplyr)

starwars %>% filter(species == "Human") %>%
group_by(hair_color) %>%
summarise(n = n())
hair_colorn
auburn 1
auburn, grey 1
auburn, white 1
black 8
blond 3
brown 14
brown, grey 1
grey 1
none 3
white 2

As a result the former query gives you a data frame and you can use it to make another query. For example, if you want to know the average number of individuals in the data frame you can use the summarise twice:

library(dplyr)

starwars %>% filter(species == "Human") %>%
group_by(hair_color) %>%
summarise(n = n()) %>%
summarise(x.b = mean(n))
x.b
3.5

Now, turning our attention to statistics, it is known that, when dealing with sample surveys, one measure of interest is the design effect defined as

$Deff \approx 1 + (\bar{m} - 1)\rho$

where $\bar{m}$ is the average cluster size and $\rho$ is the intraclass correlation coefficient. If you are dealing with survey data and you want to figure out the value of $\bar{m}$ and $\rho$, you can use dplyr. Let’s use the Lucy data of the samplesize4surveys package to show how you can do it.

library(samplesize4surveys)
data(Lucy)

m <- Lucy %>% group_by(Zone) %>%
summarise(n = n()) %>%
summarise(m = mean(n))

rho <- ICC(y = Lucy$Taxes, cl = Lucy$Zone)$ICC DEFF <- 1 + (as.integer(m) - 1) * rho DEFF  ## domingo, 19 de noviembre de 2017 ### Automatic output format in Rmarkdown I am writing a Rmarkdown document with plenty of tables, and I want them in a decent format, e.g. kable. However I don't want to format them one by one. For example, I have created the following data frame in dplyr data2 %>% group_by(uf) %>% summarise(n = n(), ) %>% arrange(desc(n)) One solution to the output format of this data frame would be to name it as an object in R, and then give it a format by using the kable function. t1 <- data2 %>% group_by(uf) %>% summarise(n = n(), ) %>% arrange(desc(n)) knitr::kable(t1) However, if your document has hundreds of these queries and you need a faster way to compile the document, while keeping the kable style automatically, avoiding giving a name to the data frame and even avoiding to call the kable function over that name, you can use the printr package. Just add the following piece of code inside a chunk at the beginning of your document and voilá. library(printr) Now, all of your data frames will have a decent style, and you do not need to worry about this issue. For example, I have knitted a presentation by using printr and the first code in this post, and this is the result: ## jueves, 15 de junio de 2017 ### Sampling weights and multilevel modeling in R So many things have been said about weighting, but on my personal view of statistical inference processes, you do have to weight. From a single statistic until a complex model, you have to weight, because of the probability measure that induces the variation of the sample comes from an (almost always) complex sampling design that you should not ignore. Weighting is a complex issue that has been discussed by several authors in recent years. The social researchers have no found consensus about the appropriateness of the use of weighting when it comes to the fit of statistical models. Angrist and Pischke (2009, p. 91) claim that few things are as confusing to applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we read the section of the Stata manual on weighting with some dismay. Anyway, despite the fact that researchers do not have consensus on when to weight, the reality is that you have to be careful when doing so. For example, when it comes to estimating totals, means or proportions, you can use the inverse probability as a way for weighting, and it looks like every social researcher agrees to weight in order to estimate this kind of descriptive statistics. The rationale behind this practice is that you suppose that every unit belonging to the sample represents itself and many others that were not selected in the sample. When using weights to estimate parameter models, you have to keep in mind the nature of the sampling design. For example, when it comes to estimates multilevel parameters, you have to take into account not only the final sampling unit weights but also the first sampling unit weights. For example, let’s assume that you have a sample of students, selected from a national frame of schools. Then, we have two sets of weights, the first one regarding schools (notice that one selected school represents itself as well as others not in the sample) and the second one regarding students. Let’s assume that in the finite population we have 10.000 students and 40 schools. For the sake of my example, let's consider that you have selected 500 students allocated in 8 schools. For the sake of easiness, let’s think that a simple random sample is used (I know, this kind of sampling design is barely used) to select students. Think about it, if you take into account only the student’s weights to fit your multilevel model, you will find that you are estimating parameters with an expanded sample that represents 10.000 students that are allocated in a sample of just eight schools. So, any conclusion stated will be wrong. For example, when performing a simple analysis of variance, the percentage of variance explained by the schools will be extremely low, because of you are expanding the sample of schools. Now, if you take into account both sets of weights (students and schools), you will find yourself fitting a model with expanded samples that represent 10.000 students and 40 schools (which is good). Unfortunately, as far as I know, the R suitcase lacks of a package that performs this kind of design-based inference to fitting multilevel models. So, right about now, we can unbiasedly estimate model parameters, but when it comes to estimate standard errors (from a design-based perspective) we need to use other computational resources and techniques like bootstrapping or Jackknife. According to the assumption of independence, most of the applied statistical methods cannot be used to analyze this kind of data directly due to dependency among sampled observation units. Inaccurate standard errors may be produced if no adjustment is made when analyzing complex survey data When it comes to educational studies (based on large-assessment tests), we can distinguish (at least) four set of weights: total student weight, student house-weight, student senate-weight and school weight. TIMMS team claims that total student weight is appropriate for single-level student-level analyses. Student house weight, also called normalized weight, is used when analyses are sensitive to sample size. Student house weight is essentially a linear transformation of total student weight so that the sum of the weights is equal to the sample size. Student Senate weight is used when analyses involve more than one country because it is total student weight scaled in such a way that all students’ senate weights sum to 500 (or 1000) in each country. School weight should be used when analyzing school-level data, as it is the inverse of the probability of selection for the selected school. ## R workshop We will use the student house-weight to fit a multilevel model. As stated before, the sum of these weights is equal to the sample. For the R workshop, we will use PISA 2012 data (available in the OECD website). I have done a filter for the Colombian case and saved this data to be directly compatible with R (available here). Let’s load the data into R. rm(list = ls()) library(dplyr) library(ggplot2) library(lme4) setwd("/your working directory") load("PisaCol.RData") head(names(PisaCol)) summary(PisaCol$STRATUM)

Now, we create an object containing the student house-weights and summarize some results based on that set of weights. Notice that the total student weights are stored in the column W_FSTUWT of the PISA database. I recall you that I am working with the first plausible value of the mathematics test and that score will be defined as our (dependent) variable of interest for the modeling.

n <- nrow(PisaCol)
PisaCol$W_HOUSEWHT <- n * PisaCol$W_FSTUWT / sum(PisaCol$W_FSTUWT) PisaCol %>% group_by(STRATUM) %>% summarise(avg1 = weighted.mean(PV1MATH, w = W_HOUSEWHT), avg2 = weighted.mean(PV2MATH, w = W_HOUSEWHT))  We use the function lmer of the lme4 package to obtain the estimation of the model coefficients in the null model (where schools are defined as independent variables). ################## ### Null model ### ################## HLM0 <- lmer(PV1MATH ~ (1 | SCHOOLID), data = PisaCol, weights = W_HOUSEWHT) coef(HLM0) summary(HLM0) # 62.81% of the variance is due to students # 37.19% of the variance is due to schools 100 * 3569 / (3569 + 2113)  As you may know, the PISA index of economic, social and cultural status has a strong relationship to student achievement, so it is a good idea to control for this variable in a more refined model. ################# ### ESCS mdel ### ################# HLM1 <- lmer(PV1MATH ~ ESCS + (1 + ESCS | SCHOOLID), data = PisaCol, weights = W_HOUSEWHT) coef(HLM1) summary(HLM1) # After contoling for ESCE, 34.58% of the variance is due to schools 100 * (96.12 + 1697.36) / (3392.58 + 96.12 + 1697.36) So then, in summary: we have 3569 units of within-schools variance (63%), after controlling for ESCE that figure turns out to 3392 units (student background explains 5% of that variation). We have 2113 (37%) units of between-school variances, after controlling for ESCE that figure turns out to 1793 (student background explains 15% of that variation). The following code makes a graph that summarizes the relationship of the student achievement with ESCE. ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) + theme_minimal() + geom_point() + theme(legend.position="none") ggplot(data = PisaCol, aes(x = ESCS, y = PV1MATH, size = W_HOUSEWHT)) + geom_point(aes(colour = SCHOOLID)) + theme(legend.position="none") ## lunes, 17 de abril de 2017 ### Small Area Estimation 101 Small area estimation (SAE) has become a widely used technique in official statistics since the last decade of past century. When the sample size is not enough to provide reliable estimates at a very particular level, the power of models and auxiliary information must be applied with no hesitation. In a nutshell, SAE tries to exploits similarity and borrows strength from available information. I will write some posts to present, step by step, the fundamentals of SAE and how it can be implemented in the R software. The first post (this one you are reading now) is about basic concepts such as sampling and databases, the second and third posts will deal with direct and indirect estimates, respectively; the fourth post will introduce model-assisted estimation; and finally, the fifth post will deal with the Fay-Harriot method. Let's begin. First of all, The Australian Bureau of Statistics declares that small area estimation refers to methods of producing sufficiently reliable estimates for geographic areas that are too fine to obtain with precision, using direct survey estimation methods. By direct estimation, we mean classical design-based survey estimation methods that utilize only the sample units contained in each small area. Small area estimation methods are used to overcome the problem of small samples sizes to produce small area estimates that improve the quality of direct survey estimates obtained from the sample in each small area. The more sophisticated of these methods work by taking advantage of various relationships in the data, and involve, either implicitly or explicitly, a statistical model to describe these relationships. Now, I want to reproduce a clarifying explanation from Dr. Little (paraphrasing Groves) about SAE, and its use in survey sampling. They claim that regression estimates provide relatively precise predictions for small areas from a survey that account for the differences between areas of characteristics included as predictors in the survey but do not account for differences in characteristics not included in the study. Direct estimates for each area are unique to the area and hence take into account both observed and unobserved relevant characteristics; however, they have low precision in areas where the sample size is small. The SAE model combines the regression estimate and direct estimate for each area in a sensible way, balancing bias and precision. For this technique to succeeds, Longford 2005 claims that areas that are known to be similar to one another should be receiving similar estimates, rather than estimates independent of one another. The degree to which similarity can or should be imposed can be chosen from statistical grounds by minimizing the overall discrepancy (mean squared error). Rahman 2008 emphasizes that SAE uses data from similar domains to estimate the statistics in a particular small area of interest, and this ‘borrowing of strength’ is justified by assuming a model which relates the small area statistics. SAE is the process of using statistical models to link survey outcome or response variables to a set of predictor variables known for small areas to predict small area-level estimates. Traditional area-specific estimates may not provide enough statistical precision because of small sample observations in small geographical regions. In such situation, it may be worth checking whether it is possible to use indirect estimation approaches based on the linking models. ## R workshop This code will not produce any small area estimation, but it will help to introduce basic concepts of sampling. We will use the BigLucy database from the TeachingSampling package to illustrate how to obtain a probabilistic sample from a finite population. This database is about deals with some economic variables for a population of 85296 companies spread out into 100 counties (areas) in a particular year of some fake country. The aim of the exercise is 1) to select a stratified sample according to the size of the companies and 2) obtain accurate estimates of the total income within each of the 100 counties. Note that the parameters of interest are given by: $$t_{y,d} = \sum_{k \in U_d} y_k$$ Where$t_{y,d}$denotes the total income of the$d$-th county,$y_k$is the income of the$k$-th company belonging the$d$-th county. The whole population of companies into the county is noted by$U_d$. Thus, this code computes the total income for each county along with the number of companies belonging each county. Finally, this parameters are saved in a new database named Results ############################# ##### Setting things up ##### ############################# setwd(“/wherever your prefer location is") rm(list = ls()) set.seed(2017) library(TeachingSampling) library(dplyr) data("BigLucy") summary(BigLucy$Level)
levels(BigLucy$Zone) Total <- BigLucy %>% group_by(Zone) %>% summarise(Income. = sum(Income)) %>% arrange(Zone) N <- BigLucy %>% group_by(Zone) %>% summarise(N.county = n()) %>% arrange(Zone) Results <- data.frame(N, Total$Income.)

#Checking the population total
(Total <- sum(Results$Total.Income.))  Now, once the parameters of the finite populations are computed, the time has come for us to select a sample. The sampling design we will use is an stratified sampling by taking advantage of the classification of each company into the database according to its level: small, medium and large. The selected sample will be stored into an object named data.sample and the sampling weights will be saved in another object called FEX. ####################################### ##### Drawing a stratified sample ##### ####################################### # Level is the stratifying variable summary(BigLucy$Level)
attach(BigLucy)
# Defines the size of each stratum
N1 <- summary(Level)[[1]]
N2 <- summary(Level)[[2]]
N3 <- summary(Level)[[3]]
N1;N2;N3
Nh <- c(N1,N2,N3)
# Defines the sample size at each stratum

n1 <- round(N1 * 0.05)
n2 <- round(N2 * 0.05)
n3 <- round(N3 * 0.05)
(nh<-c(n1,n2,n3))
# Draws a stratified sample
sam <- S.STSI(Level, Nh, nh)

data.sam <- BigLucy[sam,]

data.sam$FEX <- NULL data.sam$FEX[data.sam$Level == "Big"] <- Nh[1] / nh[1] data.sam$FEX[data.sam$Level == "Medium"] <- Nh[2] / nh[2] data.sam$FEX[data.sam$Level == "Small"] <- Nh[3] / nh[3] dim(data.sam) save(data.sam, file = "data.sam.RData")  In summary, we have drawn a sample of 4265 companies: 145 big companies, 1290 medium companies and, 2830 small companies. Each of these companies is spread out through the 100 counties in the country. In the Results database will be stored the information about those counties (how many companies are in each county and, how many companies are in the sample for each county.) along with different estimations for the total income of the counties. ###################### ##### In summary ##### ###################### n <- data.sam %>% group_by(Zone) %>% summarise(n.county = n()) %>% arrange(Zone) Results$n.county <- n$n.county Results <- Results[c("Zone", "N.county", "n.county", "Total.Income.")] save(Results, file = "Results.RData")  In the following post, we will use this sample to estimate the total income for each of the hundred zones in BigLucy’s population. ## sábado, 28 de enero de 2017 ### Regression to the mean (or at the end, people are not as smart as you could expect) Francis Galton very cleverly coined the term "regression to (or towards) the mean" meaning that if a variable is shown extreme in a first measurement, then the following observed values of that very variable will tend to get closer to the average of its distribution. The classical example is height: a tall child will have (on average) parents less tall than himself. Moreover, extremely small parents tend to have children who are smaller than average, but in both cases, the children tend to be closer to the mean than were their parents (Senn, 2011). Ok, this story should be widely known by the readers of this blog. However, I want to put forward another point of view. This is from Rolf Tarrach (former president of Luxemburg University) who has written a book on logical reasoning entitled <<The Pleasure of Deciding>>. He claims that the regression to the mean is a phenomenon that occurs not only in body measuring but also in cognitive measuring. That is: smart parents will tend to have children who are not as smart as expected. So if you consider yourself as an intelligent person and you have decided to share your life with a smart mate, it is very likely that your children won't be smarter than you two. So, do not expect your children to be geniuses. That fact goes against the common idea that insists in requesting to children of smart parents to be even more intelligent. Of course, there are some exemptions such as the Bach family or the Bernoulli family. But, those are isolated deviations from the normality of real life. I want to finish with this story about mathematician Bernard Shaw and dancer Isadora Duncan. She told him: “Would it not be wonderful if we could have a child who had your brains and my beauty?” He replied: “Yes, but suppose the child had your brains and my beauty!” PS: About geniuses, Koenker (1998) claims that Galton not only managed to invent Regression in one plot but also a bivariate kernel density estimation. ## lunes, 16 de enero de 2017 ### Multilevel regression with poststratification (Gelman's MrP) in R - What is this all about? Multilevel regression with poststratification (MrP) is a useful technique to predict a parameter of interest within small domains through modeling the mean of the variable of interest conditional on poststratification counts. This method (or methods) was first proposed by Gelman and Little (1997) and is widely used in political science where the voting intention is modeling conditional on the interaction of classification variables. The aim fo this methodology is to provide reliable estimates on strata based on census counts. For those who have some background on survey sampling, this method should look very similar to the Raking method, where sampling weights are adjusted due to known census cell counts. However, a significant difference with Raking is that MrP is a model-based approach, rather than a design-based method. This way, even in the presence of a (maybe complex) survey design, MrP does not take it into account for inference. In other words, sampling design will be considered as ignorable. So, the probability measure that governs the whole inference is based on modeling the voting intention (variable of interest) to demographic categories (auxiliary variables). Is this a major technical issue to ignore the complex survey design? Yes, because in any case we are considering a probability measure to draw the sample. However, when it comes to voting intention (a major area where this technique is used), we rarely find a sophisticated complex design. Moreover, this kind of studies is barely based on probabilistic polls. So, if the survey lacks a proper sampling plan, it is always better to model the response variable. Therefore, the ultimate goal of this technique it to estimate a parameter of interest (totals, means, proportions, etc.) for all of the strata (domains, categories or subgroups) in a finite population. From now on, let's assume that: 1. a population is divided into$H$strata of interest (for example states), 2. the parameters of interest are the means (same rules apply for proportions) in each strata$\theta_h$($h=1, \ldots, H$), 3. every stratum is cross-classified by some demographics$j \in H$(from now on defined as post-srata), besides every population count$N_j$is known, and 4. all of the population means$\mu_j$can be estimated by using some statistical technique, such as multilevel regression. This way, the mean in stratum$h$($\theta_h$), is defined as a function of means in post-strata$j$($\mu_j$), and post-strata counts ($N_j$): $$\theta_h = \frac{\sum_{j \in h} N_j \mu_j }{\sum_{j \in h} N_j}$$ The first part of MrP is defined by a multilevel regression (MR). This kind of models is a particular case of mixed effect models. The core components of this phase are • the variable of interest (for example, voting intention), • the auxiliary variables (classification on census demographic cells) and, • the random effects (strata of interest, that usually are states or counties). The second part of MrP is cell poststratification (P), and the predicted response variable is aggregated into states and adjusted by corresponding poststratification weights. ## R workshop Let’s consider the Lucy database of the TeachingSampling package. This population contains some economic variables of 2396 industrial companies in a particular year. Assume that we want to estimate the mean income of industries by each of the five existing zones (strata of interest) on that database. This way, our parameters of interest are$\theta_1, \ldots, \theta_5$. Now, we also know that the population is divided into three levels (small, medium and big industries) and we have access to the total number of industries within each cross group. That is, we know exactly how many small industries are on each of the five zones, and how many medium industries are on each of the five zones, and so on. The following code shows how to load the database and obtain the cell counts. > rm(list = ls()) > set.seed(123) > > library(TeachingSampling) > library(dplyr) > library(lme4) > > data("Lucy") > # Number of industries per level > table(Lucy$Level)

Big Medium  Small
83    737   1576
> # Number of industries per zone
> table(Lucy$Zone) A B C D E 307 727 974 223 165 > # Size of post-strata > (Np <- table(Lucy$Level, Lucy$Zone)) A B C D E Big 30 13 1 16 23 Medium 180 121 111 187 138 Small 97 593 862 20 4 Of course, this technique works over a selected sample. That's why we are going to select a random sample of size$n = 1000$. We can also create some tables showing the counts on the sample. > # A sample is selected > SLucy <- sample_n(Lucy, size = 1000) > table(SLucy$Level)

Big Medium  Small
33    280    687
> table(SLucy$Zone) A B C D E 130 295 426 86 63 The first step of MRP is Multilevel Regression in order to estimate post-strata means. The following code shows how to estimate them by using the lmer function. The object Mupred contains the corresponding$\mu_j$($j$is defined for each level) regarding each stratum (zones). > # Step 1: <<MR>> - Multilevel regression > M1 <- lmer(Income ~ Level + (1 | Zone), data = SLucy) > coef(M1)$Zone
(Intercept) LevelMedium LevelSmall
A    1265.596   -579.1851  -893.8958
B    1138.337   -579.1851  -893.8958
C    1189.285   -579.1851  -893.8958
D    1248.658   -579.1851  -893.8958
E    1284.322   -579.1851  -893.8958

attr(,"class")
[1] "coef.mer"
> SLucy$Pred <- predict(M1) > > # Summary > grouped <- group_by(SLucy, Zone, Level) > sum <- summarise(grouped, mean2 = mean(Pred)) > (Mupred <- matrix(sum$mean2, ncol = 5, nrow = 3))
[,1]      [,2]      [,3]      [,4]      [,5]
[1,] 1265.5959 1138.3370 1189.2852 1248.6575 1284.3224
[2,]  686.4107  559.1518  610.1000  669.4724  705.1373
[3,]  371.7001  244.4412  295.3894  354.7618  390.4267


Now we have estimated the post-strata means, it’s time to weight every strata by their corresponding counts in order to obtain an estimate of the mean income by zone. As we know each post-strata size, we simply use the function aggregate to obtain the MRP estimator of the parameters of interest.

> # Step 2: <<P>> - Post-stratification
> # Mean income estimation per zone
> colSums(Np * Mupred) / table(Lucy$Zone) A B C D E 643.5724 312.8052 332.1726 682.8031 778.2428 How accurate was the estimation? It was good compared to the true parameters on the finite population. > # True parameters > aggregate(Lucy$Income, by = list(Lucy$Zone), FUN = mean) Group.1 x 1 A 652.2834 2 B 320.7469 3 C 331.0195 4 D 684.9821 5 E 767.3879  ## domingo, 1 de enero de 2017 ### 3PL models viewed through the lens of total probability theorem (updated) As I currently am the NPM for PISA in Colombia, I must assist to several meetings dealing with the proper implementation of this assessment in my country. Few of them are devoted to the analysis of this kind of data (coming from IRT models). As usual, OECD has hired organizations with high technical standards. The institute that handles all this data and take part in the analysis is ETS (Educational Testing Services). PISA and ETS are changing from Rasch models to 2PL models. That involves a significant technical effort to maintain comparability along time. I had the opportunity to talk with some experts from ETS last year, and I formulated them the following question: ¿why to consider 2PL models instead of 3PL models? Well, the answer was not easy, and I am not pretending to explain it here in detail, in part because I am still convinced about the advantages of 3PL models that exceed those of 2PL models. However, they yielded me to a recent paper entitled Is There Need for the 3PL Model? Guess What? The article was written by Mathias von Davier from ETS. I liked the way von Davier showed 3PL models, as coming from a total probability setup involving 2PL models. Consider the following hierarchical structure: first, the test-taker decides whether he/she is answering that item by guessing or not; then, he/she uses his/her ability to found the correct response. So, the stochastic process behind this structure can be easily shown in a tree diagram: Remember that 3PL models can be written as: $$P_{3PL}(x=1) = P(Guess) + P(NoGuess) \times P_{2PL}(x=1|NoGuess)$$ Note that, per the model, once the student has chosen to answer by guessing, a correct answer is always found (kind of weird, isn’t it?). So, a major criticism against 3PL models is related to this last point. In R, you can estimate 3PL models by using the mirt package. So, for example, when using the LSAT7 data on the second item, we can estimate this guessing parameter. library(mirt) data <- expand.table(LSAT7) md2 <- mirt(data, 1, itemtype = '3PL', IRTpars = TRUE) coef(md2, IRTpars = TRUE)$Item.2


We found that the guessing parameter is estimated as 0.295. This way, the model is specified as:

$$P_{3PL}(x=1) = 0.295 + 0.705 \times P_{2PL}(x=1|NoGuess)$$.

PD: Alexander Robitzsch pointed me out to this paper (Aitkin, 2006) where an alternative 3PL has been proposed which aims to address the critique.

## sábado, 24 de diciembre de 2016

### Computing Sample Size for Variance Estimation

The R package samplesize4surveys contains functions that allow to calculate sample sizes for estimating proportions, means, difference of proportions and even difference of two means. It also permits the calculation of sample error and power level for a fixed sample size.

Here four functions are introduced for the estimation of a population variance and for conducting statistical hypothesis testing on this parameter of interest. Right away is the description of these functions:

1. Function ss4S2 allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular value of the coefficient of variation or the relative margin of error. Additionally, it offers to the user the option of mapping the coefficient of variation and the margin of error as a function of the sample size, to make easier the decision about $n$.
2. Function ss4S2H allows calculating the sample size for estimating $s^2_{y_U}$ subject to a particular power level to detect a population variance greater than the value set in the null hypothesis. It also offers to the user the option of mapping the power level in function of the sample size.
3. Function e4S2 allows calculating the coefficient of variation and the margin of error for a particular sample size. It also allows obtaining a mapping similar to the one of ss4S2.
4. Function b4S2 allow calculating the power level for a fixed sample size. It also allows obtaining a mapping similar to the one of ss4S2H

In order to use the above functions it is necessary to install and call the package that contains them in the Comprehensive R Archive Network (CRAN). That for, it is required to type the following code lines from the console:

install.packages("samplesize4surveys")
library(samplesize4surveys)


For example, the following code line gives the necessary sample size to estimate the variance of a characteristic of interest in a finite population (with a coefficient of kurtosis of one) to reach an estimated coefficient of variation of maximum 5% and a relative margin of error of 3%

ss4S2(N = 10000, K = 1, CV = 0.05, me = 0.03, DEFF = 2, plot = TRUE)



On the other hand, as the package is in constant update, the authors have arranged a repository in which users can use the newest features and interact with the academic community to correct possible errors in computer codes and improve the efficiency of functions, among others. In order to access to this version control, it is necessary to type the following lines from R.

library(devtools)
install_github("psirusteam/samplesize4surveys")

In this paper you can find the mathematical background behind those R functions.

## sábado, 3 de diciembre de 2016

### Highlighting R code for the web

When blogging about statistics and R, it is very useful to differentiate the body text to R code. I used to manage this issue by highlighting the code and pretty-R was a valuable instrument from Revolutions Analytics to accomplish this. However, as you may know, Microsoft acquired that company, and now this feature (dressing R code for the web) is not available anymore.

After some searching, I found this online syntax highlighter and it seems to work pretty well. Besides, it allows you to select from different styles, and you can even choose among a lot of computational languages.

### How important is that variable?

When modeling any phenomena by including explanatory variables that highly relates the variable of interest, one question arises: which of the auxiliary variables have a higher influence on the response? I am not writing about significance testing or something like this. I am just thinking like a researcher who wants to know the ranking of variables that influence the response and their related weight.

There are a variety of methods that try to answer that question. The one inducing this thread is very simple: isolate units from variables. Assume a linear model with the following structure (for the sake of simplicity, assume only two explanatory variables):

$$y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$$

If you assume this model as true and $\beta_i > 0$, then the influence of variable $x_i$, over response $y$ could be found when isolating measure units from variables. Then, one could fit a model over the standardized variables (explanatory and response) and then directly comparing the regression coefficients. Another way to do this is by means of the following expression:

$$I(i) = \frac{\beta_i}{sd(\beta_i)} = \beta_i\frac{ sd(x_i)}{sd(y)}$$

For example, let's consider the following model $y = -500 x_1 + 50 x_2 + \varepsilon$, then the relative importance of the first and second variable is around 500/(500+50) = 0.9, and 50/(500+50) = 0.1, respectively. The following code shows how to perform this simple analysis in R.

n <- 10000

x1 <- runif(n)
x2 <- runif(n)
y <- -500 * x1 + 50 * x2 + rnorm(n)

model <- lm(y ~ 0 + x1 + x2)

# 1a. Standardized betas
summary(model)$coe[,2] sd.betas <- summary(model)$coe[,2]
betas <- model$coefficients imp <- abs(betas)/sd.betas imp <- imp/sum(imp) imp # 1b. Standardized betas imp1 <- abs(model$coefficients[1] * sd(x1)/sd(y))

## domingo, 30 de octubre de 2016

### Data Literacy

Once again I have decided to pimp my blog. This time is a significant change: the name. This blog began a long time ago. It was 2006; I was 22-year-old, and I was enrolled in a Master of Science in Statistics. I had plenty of doubts about statistics, data science, and uncertainty (fortunately, some of those questions remain) and I decided to solve them by blogging.

Then it was born this blog with the name "Apuntes de Estadística." Ten years later, this blog has become an important space for researchers, teachers, and students who, like myself, want to solve questions about variability and statistics.

Now, after hundreds of posts, a hundred thousand visitors per year, and being migrated from Spanish to English, from WordPress to Blogger, it is time for me to revisit the very own name of this blog. Lately, posts are not notes anymore; I am not only answering basic questions but discussing modeling, forecasting, data and even statistics epistemology.

As a consequence of what I currently do in my job, now I am strongly convinced (more than ever in my life) that the time predicted by H.G. Wells has not arrived yet. Statistical thinking is not part of our culture. It should be encrusted in every one of us, but it is not. I will put in just my two cents to build an effective citizenship through data literacy: the ability to communicate information from data.