In this lab, you will learn about sampling distribution through a data simulation.

To illustrate the concept of sampling distribution, we will look at a population with a known mean (\(\mu\)) and standard deviation (\(\sigma\)). Remember that you almost never know actual population parameters in real life. This example is only for illustrative purpose.

Let’s consider a dataset containing mood scores from a fictitious student *population* of 30,538 people. Load this dataset.

```
library(psych)
library(ggplot2) # for plots
library(gridExtra) #for plots
pop <- read.csv("data/RES_STAT_Lab7_data.csv")
head(pop)
```

```
## ID mood
## 1 1 3.447279
## 2 2 1.103520
## 3 3 6.134288
## 4 4 5.865394
## 5 5 4.479369
## 6 6 4.789060
```

Now, let’s imagine that you conduct a mood survey on 50 samples, randomly chosen from this population.

```
survey <- sample(pop$mood, 50)
describe(survey) # sample statistics
```

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.22 1.31 3.26 3.24 1.31 0.34 5.58 5.24 -0.2 -0.74 0.18
```

`pop_mean <- mean(pop$mood) # Calculate population mean for later use.`

Notice that your sample statistics (e.g., mean and SD) were a bit different from the population.

`describe(pop$mood) # Population parameters`

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 30538 3.46 1.4 3.46 3.46 1.4 -2.32 9.17 11.5 0 0.05 0.01
```

These deviations are due to *smpling error* that occurs during a random sampling process.

Suppose that you repeat the survey again with 50 people for 10 more times.

```
## [1] "Survey 1"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.73 1.52 3.77 3.71 1.29 -0.59 8.15 8.73 0.09 0.67 0.22
## [1] "Survey 2"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.76 1.25 3.73 3.79 0.96 0.03 6.75 6.72 -0.21 0.39 0.18
## [1] "Survey 3"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.24 1.45 3.39 3.28 1.7 -0.35 5.73 6.08 -0.3 -0.69 0.21
## [1] "Survey 4"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.43 1.4 3.31 3.4 1.24 -0.36 7.21 7.56 0.18 0.42 0.2
## [1] "Survey 5"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.23 1.2 3.16 3.24 1.44 0.94 5.29 4.35 -0.03 -1.08 0.17
## [1] "Survey 6"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.52 1.54 3.96 3.6 1.52 -0.06 6.21 6.27 -0.43 -0.8 0.22
## [1] "Survey 7"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.25 1.62 3.14 3.22 1.45 -0.46 7.27 7.73 0.12 0 0.23
## [1] "Survey 8"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.62 1.26 3.62 3.57 1.21 1.08 7.71 6.63 0.64 1.13 0.18
## [1] "Survey 9"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.27 1.74 3.21 3.29 1.84 -1.32 6.91 8.23 -0.15 -0.17 0.25
## [1] "Survey 10"
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.23 1.19 3.18 3.22 1.06 0.83 6.22 5.39 0.14 -0.27 0.17
```

As you can see, *statistics* of each survey (such as \(\bar{X}\)) varied from sample to sample. Nonetheless, most \(\bar{X}s\) were close to \(\mu\).

Behind the scene, we recored the means of each survey sample into a variable `m`

.

```
# Here are the sample means of the ten surveys.
m
```

`## [1] 3.733677 3.764214 3.240662 3.431969 3.229464 3.521675 3.247002 3.620472 3.271393 3.234137`

You can plot them to see the sampling distribution of the means.

For each sample, the sample mean was not exactly the same as the population mean due to *sampling error*. Most of the time, the sample means \(\bar{X}\) were quite close to the population mean \(\mu\). But occassionaly, you might get the sample means quite far away from the population means. The distribution of the sample means is called, the **sampling distribution of the sample mean** or **sampling distribution** for short.

According to the ** central limit theorem**, the sampling distribution will be normally distributed or a bell-shaped curve. We will demonstrate this with a simulation.

Now imagine that you can keep conducting a survey of 50 people again and again for 10,000 times. We will record each sample mean into a variable `M`

.

```
# Sample 10,000 times
M <- vector(mode = "numeric", 10000)
for (i in 1:length(M)) {
s <- sample(pop$mood, 50)
M[i] <- mean(s)
}
head(M)
```

`## [1] 3.390363 3.604792 3.719652 3.709169 3.314616 3.639776`

We could plot the sampling distribution.

Now recall the mean and SD of the *population*.

```
mean(pop$mood)
sd(pop$mood)
```

```
## [1] 3.455544
## [1] 1.395792
```

These are the mean and SD of the *sampling distribution*.

```
mean(M)
sd(M) #SD of sampling distribtion is SE.
```

```
## [1] 3.457466
## [1] 0.197344
```

The SD of the sample distribution gives you an idea how large is the sampling error. This is called the **standard error of the mean** or *SE*.

Remember that we have another way to estimate a standard error, \(SE = \frac{\sigma}{\sqrt{n}}\). This should be similar to our simulated *SE* `sd(M)`

.

```
se <- sd(pop$mood)/sqrt(50) #estimated SE
se
```

`## [1] 0.1973948`

Now remember that in a normal distrbituion, 95% of the data fall between ±1.96 SD.

Therefore, in sampling distribution, 95% of sample means will be between ±1.96 SE.

This means that 9,500 from 10,000 samples will give you a sample mean between [3.07, 3.84].

In other words, any values beyond that interval is very unlikely (less than 5%).

Here is how you calculate the lower limit and upper limit, \(\mu ± 1.96SE\).

```
LL <- mean(pop$mood) + (-1.96*se)
UL <- mean(pop$mood) + (1.96*se)
LL
UL
```

```
## [1] 3.06865
## [1] 3.842437
```

Let’s assume that we conduct a survey again, but this time from an unknown population. Therefore, we are not sure if this sample came from our student population in the example above.

Suppose that your new survey sample has *M* = 3.57 (*N* = 50).

The sample mean of 3.57 is still within the 95% of the sampling distribution. We therefore accept that the sample mean of 3.57 did not differ from the population mean of 3.46. It is possible that sampling error occurs and we get the sample mean slightly larger than the population mean.

Let’s consider another sample of *M* = 2.96 (*N* = 50).

Now the sample mean lies outside the 95% interval, suggesting that this case happen less than 5 out of 100 times. It is very unlikely to obtain the sample mean this low *given that* the sample comes from this population.

At this point, we are confident to *infer* that *this sample may* *NOT**come from this student population*. Could we be wrong? Yes, but less than 5% of the times. This 5% possibility of being wrong is called **Type I error**. Despite that, we are quite confident that this sample is unlikely from the student population.

Now that we know the *sampling distribution of the mean* will be normally distributed, we do not need to simulate the sampling distribution. Instead, we could compare our data to the *z*-distribution.

If a sample came from the same population, the sample mean should be the same as the population mean. We call this a *null hypothesis*.

\[H_0: \bar{X} = \mu\] or \[H_0: \bar{X} - \mu = 0\] To test the difference between the sample mean and the population mean (when\(\mu\) and \(\sigma\) are known), we use *z*-test \[z = \frac{\bar{X}-{\mu}}{SE} \] and \[SE = \frac{SD}{\sqrt{N}}\]

We will conduct a *z*-test on Sample 2. Recall that

```
x_bar <- 2.96
mu <- mean(pop$mood)
se <- sd(pop$mood)/sqrt(50) #N = 50
z <- (x_bar - mu)/se
z
```

`## [1] -2.510419`

The critical *z*-value for \(\alpha = .05\) is ±1.96. Our *z* value is much lower than -1.96. Therefore, we know that this sample mean of 2.96 is statistically significantly lower than the population mean of 3.46.

You can find the *p*-value for this *z*-test by looking up the *z*-table.

You will find that the *p*-value was lower than our \(\alpha\) level at .05.

Therefore, we rejected the null hypothesis and concluded that the sample mean was significantly lower than the population mean.

Let’s conduct a third survey with 120 samples.

We will use function `z.test`

from the `BSDA`

package. The main arguments in `z.test`

are `x =`

data, `mu =`

population mean to test, `sigma.x =`

population standard deviation. THe function will take care of the rest.

```
library(BSDA)
sample3
```

```
## [1] 3.946005 2.392582 5.284441 2.995409 3.536231 1.629863 4.765260 3.345787 3.212542 3.988457 1.422167 2.943371 2.953872 3.679483 5.612414
## [16] 3.087106 3.116696 2.689957 3.792538 3.644504 2.727769 4.651456 1.639729 3.425729 3.196400 4.930349 4.830121 4.284867 3.662840 5.199233
## [31] 3.251279 3.300393 4.716639 3.651345 4.386038 3.637865 4.489620 2.929944 3.266484 3.673868 4.351942 3.886374 4.283988 5.418004 3.548466
## [46] 3.428728 4.001654 4.560274 2.747881 4.840485 4.915413 3.979009 4.815378 1.896307 4.320700 5.155028 3.100960 3.914574 3.910657 4.819158
## [61] 3.162266 4.240176 2.983290 4.184927 3.453364 4.557898 5.282044 3.187205 3.909801 3.696107 1.827307 4.462634 5.104467 3.038866 3.712330
## [76] 2.856146 3.004790 3.767059 4.943250 4.045457 7.626747 4.301139 3.156699 3.257084 5.260395 4.901558 4.202120 3.593405 2.087163 5.091897
## [91] 2.898755 4.560726 3.193165 4.122850 3.474212 4.885495 2.349220 3.000059 5.671505 3.470049 4.504829 3.880489 4.613108 5.583185 3.954864
## [106] 4.615402 3.406789 3.186280 5.154894 3.737324 3.504483 5.107426 1.513148 4.430989 2.106780 4.214694 2.488468 3.362696 2.314621 3.315720
```

`z.test(x = sample3, mu = mean(pop$mood), sigma.x = sd(pop$mood))`

```
##
## One-sample z-Test
##
## data: sample3
## z = 2.7887, p-value = 0.005291
## alternative hypothesis: true mean is not equal to 3.455544
## 95 percent confidence interval:
## 3.561144 4.060613
## sample estimates:
## mean of x
## 3.810879
```