Sampling Distribution

In this lab, you will learn about sampling distribution through a data simulation.

To illustrate the concept of sampling distribution, we will look at a population with a known mean (\(\mu\)) and standard deviation (\(\sigma\)). Remember that you almost never know actual population parameters in real life. This example is only for illustrative purpose.

Let’s consider a dataset containing mood scores from a fictitious student population of 30,538 people. Load this dataset.

library(psych)
library(ggplot2) # for plots
library(gridExtra) #for plots
pop <- read.csv("data/RES_STAT_Lab7_data.csv")
head(pop)
##   ID     mood
## 1  1 3.447279
## 2  2 1.103520
## 3  3 6.134288
## 4  4 5.865394
## 5  5 4.479369
## 6  6 4.789060

Now, let’s imagine that you conduct a mood survey on 50 samples, randomly chosen from this population.

survey <- sample(pop$mood, 50)
describe(survey) # sample statistics
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 50 3.22 1.31   3.26    3.24 1.31 0.34 5.58  5.24 -0.2    -0.74 0.18
pop_mean <- mean(pop$mood) # Calculate population mean for later use.

Notice that your sample statistics (e.g., mean and SD) were a bit different from the population.

describe(pop$mood) # Population parameters
##    vars     n mean  sd median trimmed mad   min  max range skew kurtosis   se
## X1    1 30538 3.46 1.4   3.46    3.46 1.4 -2.32 9.17  11.5    0     0.05 0.01

These deviations are due to smpling error that occurs during a random sampling process.

Sampling, sampling, sampling, …..

Suppose that you repeat the survey again with 50 people for 10 more times.

## [1] "Survey 1"
##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 50 3.73 1.52   3.77    3.71 1.29 -0.59 8.15  8.73 0.09     0.67 0.22
## [1] "Survey 2"
##    vars  n mean   sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 50 3.76 1.25   3.73    3.79 0.96 0.03 6.75  6.72 -0.21     0.39 0.18
## [1] "Survey 3"
##    vars  n mean   sd median trimmed mad   min  max range skew kurtosis   se
## X1    1 50 3.24 1.45   3.39    3.28 1.7 -0.35 5.73  6.08 -0.3    -0.69 0.21
## [1] "Survey 4"
##    vars  n mean  sd median trimmed  mad   min  max range skew kurtosis  se
## X1    1 50 3.43 1.4   3.31     3.4 1.24 -0.36 7.21  7.56 0.18     0.42 0.2
## [1] "Survey 5"
##    vars  n mean  sd median trimmed  mad  min  max range  skew kurtosis   se
## X1    1 50 3.23 1.2   3.16    3.24 1.44 0.94 5.29  4.35 -0.03    -1.08 0.17
## [1] "Survey 6"
##    vars  n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
## X1    1 50 3.52 1.54   3.96     3.6 1.52 -0.06 6.21  6.27 -0.43     -0.8 0.22
## [1] "Survey 7"
##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 50 3.25 1.62   3.14    3.22 1.45 -0.46 7.27  7.73 0.12        0 0.23
## [1] "Survey 8"
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 50 3.62 1.26   3.62    3.57 1.21 1.08 7.71  6.63 0.64     1.13 0.18
## [1] "Survey 9"
##    vars  n mean   sd median trimmed  mad   min  max range  skew kurtosis   se
## X1    1 50 3.27 1.74   3.21    3.29 1.84 -1.32 6.91  8.23 -0.15    -0.17 0.25
## [1] "Survey 10"
##    vars  n mean   sd median trimmed  mad  min  max range skew kurtosis   se
## X1    1 50 3.23 1.19   3.18    3.22 1.06 0.83 6.22  5.39 0.14    -0.27 0.17

As you can see, statistics of each survey (such as \(\bar{X}\)) varied from sample to sample. Nonetheless, most \(\bar{X}s\) were close to \(\mu\).

Behind the scene, we recored the means of each survey sample into a variable m.

# Here are the sample means of the ten surveys.
m
##  [1] 3.733677 3.764214 3.240662 3.431969 3.229464 3.521675 3.247002 3.620472 3.271393 3.234137

You can plot them to see the sampling distribution of the means.

For each sample, the sample mean was not exactly the same as the population mean due to sampling error. Most of the time, the sample means \(\bar{X}\) were quite close to the population mean \(\mu\). But occassionaly, you might get the sample means quite far away from the population means. The distribution of the sample means is called, the sampling distribution of the sample mean or sampling distribution for short.

According to the central limit theorem, the sampling distribution will be normally distributed or a bell-shaped curve. We will demonstrate this with a simulation.

Now imagine that you can keep conducting a survey of 50 people again and again for 10,000 times. We will record each sample mean into a variable M.

# Sample 10,000 times
M <- vector(mode = "numeric", 10000)
for (i in 1:length(M)) {
  s <- sample(pop$mood, 50)
  M[i] <- mean(s)
}
head(M)
## [1] 3.390363 3.604792 3.719652 3.709169 3.314616 3.639776

We could plot the sampling distribution.

Now recall the mean and SD of the population.

mean(pop$mood)
sd(pop$mood)
## [1] 3.455544
## [1] 1.395792

These are the mean and SD of the sampling distribution.

mean(M)
sd(M) #SD of sampling distribtion is SE. 
## [1] 3.457466
## [1] 0.197344

The SD of the sample distribution gives you an idea how large is the sampling error. This is called the standard error of the mean or SE.

Remember that we have another way to estimate a standard error, \(SE = \frac{\sigma}{\sqrt{n}}\). This should be similar to our simulated SE sd(M).

se <- sd(pop$mood)/sqrt(50) #estimated SE
se
## [1] 0.1973948

Now remember that in a normal distrbituion, 95% of the data fall between ±1.96 SD.

Therefore, in sampling distribution, 95% of sample means will be between ±1.96 SE.

This means that 9,500 from 10,000 samples will give you a sample mean between [3.07, 3.84].

In other words, any values beyond that interval is very unlikely (less than 5%).

Here is how you calculate the lower limit and upper limit, \(\mu ± 1.96SE\).

LL <- mean(pop$mood) + (-1.96*se)
UL <- mean(pop$mood) + (1.96*se)
LL
UL
## [1] 3.06865
## [1] 3.842437

Probability of Sampling Distribution

Let’s assume that we conduct a survey again, but this time from an unknown population. Therefore, we are not sure if this sample came from our student population in the example above.

Sample 1

Suppose that your new survey sample has M = 3.57 (N = 50).

The sample mean of 3.57 is still within the 95% of the sampling distribution. We therefore accept that the sample mean of 3.57 did not differ from the population mean of 3.46. It is possible that sampling error occurs and we get the sample mean slightly larger than the population mean.

Sample 2

Let’s consider another sample of M = 2.96 (N = 50).

Now the sample mean lies outside the 95% interval, suggesting that this case happen less than 5 out of 100 times. It is very unlikely to obtain the sample mean this low given that the sample comes from this population.

At this point, we are confident to infer that this sample may NOT come from this student population. Could we be wrong? Yes, but less than 5% of the times. This 5% possibility of being wrong is called Type I error. Despite that, we are quite confident that this sample is unlikely from the student population.

Null Hypothesis Testing

Now that we know the sampling distribution of the mean will be normally distributed, we do not need to simulate the sampling distribution. Instead, we could compare our data to the z-distribution.

If a sample came from the same population, the sample mean should be the same as the population mean. We call this a null hypothesis.

\[H_0: \bar{X} = \mu\] or \[H_0: \bar{X} - \mu = 0\] To test the difference between the sample mean and the population mean (when\(\mu\) and \(\sigma\) are known), we use z-test \[z = \frac{\bar{X}-{\mu}}{SE} \] and \[SE = \frac{SD}{\sqrt{N}}\]

We will conduct a z-test on Sample 2. Recall that

x_bar <- 2.96
mu <- mean(pop$mood)
se <- sd(pop$mood)/sqrt(50) #N = 50
z <- (x_bar - mu)/se
z
## [1] -2.510419

The critical z-value for \(\alpha = .05\) is ±1.96. Our z value is much lower than -1.96. Therefore, we know that this sample mean of 2.96 is statistically significantly lower than the population mean of 3.46.

You can find the p-value for this z-test by looking up the z-table.

You will find that the p-value was lower than our \(\alpha\) level at .05.

Therefore, we rejected the null hypothesis and concluded that the sample mean was significantly lower than the population mean.

Visualizing

Survey 3: Use an R function

Let’s conduct a third survey with 120 samples.

We will use function z.test from the BSDA package. The main arguments in z.test are x = data, mu = population mean to test, sigma.x = population standard deviation. THe function will take care of the rest.

library(BSDA)
sample3
##   [1] 3.946005 2.392582 5.284441 2.995409 3.536231 1.629863 4.765260 3.345787 3.212542 3.988457 1.422167 2.943371 2.953872 3.679483 5.612414
##  [16] 3.087106 3.116696 2.689957 3.792538 3.644504 2.727769 4.651456 1.639729 3.425729 3.196400 4.930349 4.830121 4.284867 3.662840 5.199233
##  [31] 3.251279 3.300393 4.716639 3.651345 4.386038 3.637865 4.489620 2.929944 3.266484 3.673868 4.351942 3.886374 4.283988 5.418004 3.548466
##  [46] 3.428728 4.001654 4.560274 2.747881 4.840485 4.915413 3.979009 4.815378 1.896307 4.320700 5.155028 3.100960 3.914574 3.910657 4.819158
##  [61] 3.162266 4.240176 2.983290 4.184927 3.453364 4.557898 5.282044 3.187205 3.909801 3.696107 1.827307 4.462634 5.104467 3.038866 3.712330
##  [76] 2.856146 3.004790 3.767059 4.943250 4.045457 7.626747 4.301139 3.156699 3.257084 5.260395 4.901558 4.202120 3.593405 2.087163 5.091897
##  [91] 2.898755 4.560726 3.193165 4.122850 3.474212 4.885495 2.349220 3.000059 5.671505 3.470049 4.504829 3.880489 4.613108 5.583185 3.954864
## [106] 4.615402 3.406789 3.186280 5.154894 3.737324 3.504483 5.107426 1.513148 4.430989 2.106780 4.214694 2.488468 3.362696 2.314621 3.315720
z.test(x = sample3, mu = mean(pop$mood), sigma.x = sd(pop$mood))
## 
##  One-sample z-Test
## 
## data:  sample3
## z = 2.7887, p-value = 0.005291
## alternative hypothesis: true mean is not equal to 3.455544
## 95 percent confidence interval:
##  3.561144 4.060613
## sample estimates:
## mean of x 
##  3.810879