```
# Load packages for this script.
library(ggplot2)
library(BSDA)
library(psych)
```

Most of the times, we are dealing with a sample with unknown population parameters (i.e., unknown \(\mu\) and \(\sigma\)). Therefore, we use the sample mean \(\bar{X}\) to estimate the parameter \(\mu\). However, due to sampling error, we know that our \(\bar{X}\) is not exactly the same as \(\mu\). Instead, we are looking for an interval of values that would contain a true population mean. This is called a **confidence interval**.

Recall that in a sampling distribution, 95% of sample means will fall between ±1.96 *SE* from the population mean. If we put that ±1.96 *SE* around a sample mean in that 95% range, that interval will also contain the population mean. That is, \[\text{95% CI} = \bar{X} ± 1.96SE\] If we could repeatedly sample from a population, a 95% confidence interval around \(\bar{X}\) will contain \(\mu\) for 95% for the times.

Let’s demonstrate with a simulation. Imagine that we survey a population with \(\mu = 50\) and \(\sigma\) = 9.

- We survey 900 random samples and calculate \(\bar{X}\) for each sample,
- then calculate a 95% CI around each sample mean and,
- repeat 1 & 2 for 100 times.

We will find that each sample has a slightly different \(\bar{X}\). But this \(\bar{X}\) *dances* around \(\mu\). We run a simulation in the background and record the statistics of each survey into `sim`

. Then plot each survey’s CI in a vertical order.

`head(sim)`

```
## survey_id mean sd se LL UL contain
## 1 1 50.34664 9.329415 0.3109805 49.73713 50.95615 TRUE
## 2 2 50.16078 9.001203 0.3000401 49.57271 50.74885 TRUE
## 3 3 49.93210 8.707903 0.2902634 49.36319 50.50100 TRUE
## 4 4 49.66867 9.214685 0.3071562 49.06666 50.27069 TRUE
## 5 5 49.94813 8.661217 0.2887072 49.38227 50.51399 TRUE
## 6 6 49.59184 8.573338 0.2857779 49.03173 50.15196 TRUE
```

For 95% of the surveys, the CIs cover the population mean (blue intervals). In other words, we can be confident that, most of the times, the interval of this size will include the population mean. The CI provides us an estimate of the population mean.

Suppose that we have a sample with *M* = 50.5, *SD* = 9, *N* = 900.

```
m <- 50.5
sd <- 9
n <- 900
se <- sd/sqrt(n)
```

A 95% interval would be ±1.96 *SE* around the mean.

```
LL <- m - (1.96 * se) # Lower limit
UL <- m + (1.96 * se) # Upper limit
print(paste0("95% CI [", LL,", ", UL, "]"))
```

`## [1] "95% CI [49.912, 51.088]"`

Does this sample come from the population with \(\mu = 50\)?

or \(H_0: \bar{X} = \mu = 50\)

We could plot the sampling distribution of \(\mu = 50\).

Note that

- The 95% CI includes the population mean (50).
- The sample mean (50.5) is within the 95% area of the sampling distribution.

If the 95% CI of the sample mean include our hypothesized value (e.g., 50), it also means that *if the population mean is 50, there is more than 5% chance that this sample comes from this population*. In other words, you **cannot** conclude that the sample mean (50.5) is statistically significantly different from 50.

You can also do this in *z*-test.

`zsum.test(m, sd, n, mu = 50)`

```
##
## One-sample z-Test
##
## data: Summarized x
## z = 1.6667, p-value = 0.09558
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
## 49.91201 51.08799
## sample estimates:
## mean of x
## 50.5
```

However, most of the time we are dealing with

- unkown population variance \(\sigma^2\)
- small sample size

In these situations, *z*-test is not appropriate because it would *underestimate* the sampling error. The *t*-statistics was invented to address this issue.

Take a look at the theoretical distributions below.

You can see that *t* distributions are flatter than the *z* distribution. The shape of *t* depends on degrees of freedom (*df*) to account for errors that occur with a smaller sample size.

The flatter shape changes area under the curve and, therefore, changes the critical values that mark 95%.

For *df* = 10, the critical *t* = ±2.228.

For *df* = 300, the critical *t* = ±1.968.

The critical values of *t* will approach those of *z* as sample size increases.

Now let’s consider a study that ask students to read two hours per day: *M*_{dailyreading} = 104.44 minutes, *SD* = 62.33, *N* = 54.

We will call this variable `reading`

.

`reading`

```
## [1] 130 52 194 85 151 63 61 92 48 106 156 116 138 123 112 205 229 138 87 0 26 157 48 91 152 183 21 104 203 88 102 71 204 99 53
## [36] 157 36 77 40 69 134 309 87 0 75 80 11 117 41 68 134 164 109 44
```

`psych::describe(reading)`

```
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 54 104.44 62.33 95.5 100.73 63.01 0 309 309 0.72 0.63 8.48
```

We would like to test whether this sample mean is different from our hypothesis that students read 2 hours (120 minutes) per day. \[H_0: \mu = 120\] or \[H_0: \mu - 120 = 0\]

We would test this against the *t*-distribution \[t = \frac{\bar{X} - 120}{SE}\] and \(SE = \frac{SD}{\sqrt{N}}\)

```
n <- length(reading)
se <- sd(reading)/sqrt(n)
t <- (mean(reading)-120)/se
se
## [1] 8.48222
t
## [1] -1.833901
```

You can look up the *t* value in the *t* distribution table. You will find that this sample average reading time was statistically lower than 120 minutes.

We can also construct a confidence interval for this sample mean. However, we cannot use ±1.96 because we are using *t* distribution now. The function `qt`

will give us a quantile of *t* for a given *p*. For a two-tail test, we want the *t* value associated with *p* = .025 and *p* = .975.

```
df <- n - 1 # degrees of freedom
LL <- mean(reading) + (qt(.025, df) * se) #Lower limit
UL <- mean(reading) + (qt(.975, df) * se) #Upper limit
LL
## [1] 87.43127
UL
## [1] 121.4576
```

The 95% CI did not include the hypothesized value (120). In this case, we conclude that this sample’s reading time mean was significanty lower than 120 minutes that we asked them to do. In other words, this group of students failed to keep up with reading two hourse per day.

`t.test`

functionThe *t*-test can be performed with a function `t.test`

. For a one sample *t*-test, you will need two arguments in `t.test(x, mu)`

, where `x`

is your data and `mu`

is a hypothesized mean to test against.

`t.test(reading, mu = 120) #default is a two-tailed test. `

```
##
## One Sample t-test
##
## data: reading
## t = -1.8339, df = 53, p-value = 0.07229
## alternative hypothesis: true mean is not equal to 120
## 95 percent confidence interval:
## 87.43127 121.45762
## sample estimates:
## mean of x
## 104.4444
```

`t.test(reading, mu = 120, alternative = "less") #alternative lets you run a one-tailed test with "less" or "greater".`

```
##
## One Sample t-test
##
## data: reading
## t = -1.8339, df = 53, p-value = 0.03614
## alternative hypothesis: true mean is less than 120
## 95 percent confidence interval:
## -Inf 118.6447
## sample estimates:
## mean of x
## 104.4444
```