# Sampling Distribution

In this lab, you will learn about sampling distribution through a data simulation.

To illustrate the concept of sampling distribution, we will look at a population with a known mean ($$\mu$$) and standard deviation ($$\sigma$$). Remember that you almost never know actual population parameters in real life. This example is only for illustrative purpose.

Let’s consider a dataset containing mood scores from a fictitious student population of 30,538 people. Load this dataset.

library(psych)
library(ggplot2) # for plots
library(gridExtra) #for plots
head(pop)
##   ID     mood
## 1  1 3.447279
## 2  2 1.103520
## 3  3 6.134288
## 4  4 5.865394
## 5  5 4.479369
## 6  6 4.789060

Now, let’s imagine that you conduct a mood survey on 50 samples, randomly chosen from this population.

survey <- sample(pop$mood, 50) describe(survey) # sample statistics ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.22 1.31 3.26 3.24 1.31 0.34 5.58 5.24 -0.2 -0.74 0.18 pop_mean <- mean(pop$mood) # Calculate population mean for later use.

Notice that your sample statistics (e.g., mean and SD) were a bit different from the population.

describe(pop$mood) # Population parameters ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 30538 3.46 1.4 3.46 3.46 1.4 -2.32 9.17 11.5 0 0.05 0.01 These deviations are due to smpling error that occurs during a random sampling process. ## Sampling, sampling, sampling, ….. Suppose that you repeat the survey again with 50 people for 10 more times. ##  "Survey 1" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.73 1.52 3.77 3.71 1.29 -0.59 8.15 8.73 0.09 0.67 0.22 ##  "Survey 2" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.76 1.25 3.73 3.79 0.96 0.03 6.75 6.72 -0.21 0.39 0.18 ##  "Survey 3" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.24 1.45 3.39 3.28 1.7 -0.35 5.73 6.08 -0.3 -0.69 0.21 ##  "Survey 4" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.43 1.4 3.31 3.4 1.24 -0.36 7.21 7.56 0.18 0.42 0.2 ##  "Survey 5" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.23 1.2 3.16 3.24 1.44 0.94 5.29 4.35 -0.03 -1.08 0.17 ##  "Survey 6" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.52 1.54 3.96 3.6 1.52 -0.06 6.21 6.27 -0.43 -0.8 0.22 ##  "Survey 7" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.25 1.62 3.14 3.22 1.45 -0.46 7.27 7.73 0.12 0 0.23 ##  "Survey 8" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.62 1.26 3.62 3.57 1.21 1.08 7.71 6.63 0.64 1.13 0.18 ##  "Survey 9" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.27 1.74 3.21 3.29 1.84 -1.32 6.91 8.23 -0.15 -0.17 0.25 ##  "Survey 10" ## vars n mean sd median trimmed mad min max range skew kurtosis se ## X1 1 50 3.23 1.19 3.18 3.22 1.06 0.83 6.22 5.39 0.14 -0.27 0.17 As you can see, statistics of each survey (such as $$\bar{X}$$) varied from sample to sample. Nonetheless, most $$\bar{X}s$$ were close to $$\mu$$. Behind the scene, we recored the means of each survey sample into a variable m. # Here are the sample means of the ten surveys. m ##  3.733677 3.764214 3.240662 3.431969 3.229464 3.521675 3.247002 3.620472 3.271393 3.234137 You can plot them to see the sampling distribution of the means. For each sample, the sample mean was not exactly the same as the population mean due to sampling error. Most of the time, the sample means $$\bar{X}$$ were quite close to the population mean $$\mu$$. But occassionaly, you might get the sample means quite far away from the population means. The distribution of the sample means is called, the sampling distribution of the sample mean or sampling distribution for short. According to the central limit theorem, the sampling distribution will be normally distributed or a bell-shaped curve. We will demonstrate this with a simulation. Now imagine that you can keep conducting a survey of 50 people again and again for 10,000 times. We will record each sample mean into a variable M. # Sample 10,000 times M <- vector(mode = "numeric", 10000) for (i in 1:length(M)) { s <- sample(pop$mood, 50)
M[i] <- mean(s)
}
head(M)
##  3.390363 3.604792 3.719652 3.709169 3.314616 3.639776

We could plot the sampling distribution. Now recall the mean and SD of the population.

mean(pop$mood) sd(pop$mood)
##  3.455544
##  1.395792

These are the mean and SD of the sampling distribution.

mean(M)
sd(M) #SD of sampling distribtion is SE. 
##  3.457466
##  0.197344

The SD of the sample distribution gives you an idea how large is the sampling error. This is called the standard error of the mean or SE.

Remember that we have another way to estimate a standard error, $$SE = \frac{\sigma}{\sqrt{n}}$$. This should be similar to our simulated SE sd(M).

se <- sd(pop$mood)/sqrt(50) #estimated SE se ##  0.1973948 Now remember that in a normal distrbituion, 95% of the data fall between ±1.96 SD. Therefore, in sampling distribution, 95% of sample means will be between ±1.96 SE. This means that 9,500 from 10,000 samples will give you a sample mean between [3.07, 3.84]. In other words, any values beyond that interval is very unlikely (less than 5%). Here is how you calculate the lower limit and upper limit, $$\mu ± 1.96SE$$. LL <- mean(pop$mood) + (-1.96*se)
se <- sd(pop$mood)/sqrt(50) #N = 50 z <- (x_bar - mu)/se z ##  -2.510419 The critical z-value for $$\alpha = .05$$ is ±1.96. Our z value is much lower than -1.96. Therefore, we know that this sample mean of 2.96 is statistically significantly lower than the population mean of 3.46. You can find the p-value for this z-test by looking up the z-table. You will find that the p-value was lower than our $$\alpha$$ level at .05. Therefore, we rejected the null hypothesis and concluded that the sample mean was significantly lower than the population mean. # Visualizing ## Survey 3: Use an R function Let’s conduct a third survey with 120 samples. We will use function z.test from the BSDA package. The main arguments in z.test are x = data, mu = population mean to test, sigma.x = population standard deviation. THe function will take care of the rest. library(BSDA) sample3 ##  3.946005 2.392582 5.284441 2.995409 3.536231 1.629863 4.765260 3.345787 3.212542 3.988457 1.422167 2.943371 2.953872 3.679483 5.612414 ##  3.087106 3.116696 2.689957 3.792538 3.644504 2.727769 4.651456 1.639729 3.425729 3.196400 4.930349 4.830121 4.284867 3.662840 5.199233 ##  3.251279 3.300393 4.716639 3.651345 4.386038 3.637865 4.489620 2.929944 3.266484 3.673868 4.351942 3.886374 4.283988 5.418004 3.548466 ##  3.428728 4.001654 4.560274 2.747881 4.840485 4.915413 3.979009 4.815378 1.896307 4.320700 5.155028 3.100960 3.914574 3.910657 4.819158 ##  3.162266 4.240176 2.983290 4.184927 3.453364 4.557898 5.282044 3.187205 3.909801 3.696107 1.827307 4.462634 5.104467 3.038866 3.712330 ##  2.856146 3.004790 3.767059 4.943250 4.045457 7.626747 4.301139 3.156699 3.257084 5.260395 4.901558 4.202120 3.593405 2.087163 5.091897 ##  2.898755 4.560726 3.193165 4.122850 3.474212 4.885495 2.349220 3.000059 5.671505 3.470049 4.504829 3.880489 4.613108 5.583185 3.954864 ##  4.615402 3.406789 3.186280 5.154894 3.737324 3.504483 5.107426 1.513148 4.430989 2.106780 4.214694 2.488468 3.362696 2.314621 3.315720 z.test(x = sample3, mu = mean(pop$mood), sigma.x = sd(pop\$mood))
##
##  One-sample z-Test
##
## data:  sample3
## z = 2.7887, p-value = 0.005291
## alternative hypothesis: true mean is not equal to 3.455544
## 95 percent confidence interval:
##  3.561144 4.060613
## sample estimates:
## mean of x
##  3.810879