# Load packages for this script.
library(ggplot2)
library(BSDA)
library(psych)

Confidence Interval

Most of the times, we are dealing with a sample with unknown population parameters (i.e., unknown \(\mu\) and \(\sigma\)). Therefore, we use the sample mean \(\bar{X}\) to estimate the parameter \(\mu\). However, due to sampling error, we know that our \(\bar{X}\) is not exactly the same as \(\mu\). Instead, we are looking for an interval of values that would contain a true population mean. This is called a confidence interval.

Recall that in a sampling distribution, 95% of sample means will fall between ±1.96 SE from the population mean. If we put that ±1.96 SE around a sample mean in that 95% range, that interval will also contain the population mean. That is, \[\text{95% CI} = \bar{X} ± 1.96SE\] If we could repeatedly sample from a population, a 95% confidence interval around \(\bar{X}\) will contain \(\mu\) for 95% for the times.

Let’s demonstrate with a simulation. Imagine that we survey a population with \(\mu = 50\) and \(\sigma\) = 9.

  1. We survey 900 random samples and calculate \(\bar{X}\) for each sample,
  2. then calculate a 95% CI around each sample mean and,
  3. repeat 1 & 2 for 100 times.

We will find that each sample has a slightly different \(\bar{X}\). But this \(\bar{X}\) dances around \(\mu\). We run a simulation in the background and record the statistics of each survey into sim. Then plot each survey’s CI in a vertical order.

head(sim)
##   survey_id     mean       sd        se       LL       UL contain
## 1         1 50.34664 9.329415 0.3109805 49.73713 50.95615    TRUE
## 2         2 50.16078 9.001203 0.3000401 49.57271 50.74885    TRUE
## 3         3 49.93210 8.707903 0.2902634 49.36319 50.50100    TRUE
## 4         4 49.66867 9.214685 0.3071562 49.06666 50.27069    TRUE
## 5         5 49.94813 8.661217 0.2887072 49.38227 50.51399    TRUE
## 6         6 49.59184 8.573338 0.2857779 49.03173 50.15196    TRUE

For 95% of the surveys, the CIs cover the population mean (blue intervals). In other words, we can be confident that, most of the times, the interval of this size will include the population mean. The CI provides us an estimate of the population mean.

Manual calculation of a CI

Suppose that we have a sample with M = 50.5, SD = 9, N = 900.

m <- 50.5
sd <- 9
n <- 900
se <- sd/sqrt(n)

A 95% interval would be ±1.96 SE around the mean.

LL <- m - (1.96 * se) # Lower limit
UL <- m + (1.96 * se) # Upper limit
print(paste0("95% CI [", LL,", ", UL, "]"))
## [1] "95% CI [49.912, 51.088]"

Does this sample come from the population with \(\mu = 50\)?

or \(H_0: \bar{X} = \mu = 50\)

We could plot the sampling distribution of \(\mu = 50\).

Note that

  1. The 95% CI includes the population mean (50).
  2. The sample mean (50.5) is within the 95% area of the sampling distribution.

If the 95% CI of the sample mean include our hypothesized value (e.g., 50), it also means that if the population mean is 50, there is more than 5% chance that this sample comes from this population. In other words, you cannot conclude that the sample mean (50.5) is statistically significantly different from 50.

You can also do this in z-test.

zsum.test(m, sd, n, mu = 50)
## 
##  One-sample z-Test
## 
## data:  Summarized x
## z = 1.6667, p-value = 0.09558
## alternative hypothesis: true mean is not equal to 50
## 95 percent confidence interval:
##  49.91201 51.08799
## sample estimates:
## mean of x 
##      50.5

t-test

However, most of the time we are dealing with

  1. unkown population variance \(\sigma^2\)
  2. small sample size

In these situations, z-test is not appropriate because it would underestimate the sampling error. The t-statistics was invented to address this issue.

Take a look at the theoretical distributions below.

You can see that t distributions are flatter than the z distribution. The shape of t depends on degrees of freedom (df) to account for errors that occur with a smaller sample size.

The flatter shape changes area under the curve and, therefore, changes the critical values that mark 95%.

For df = 10, the critical t = ±2.228.
For df = 300, the critical t = ±1.968.
The critical values of t will approach those of z as sample size increases.

One sample t-test

Now let’s consider a study that ask students to read two hours per day: Mdailyreading = 104.44 minutes, SD = 62.33, N = 54.

We will call this variable reading.

reading
##  [1] 130  52 194  85 151  63  61  92  48 106 156 116 138 123 112 205 229 138  87   0  26 157  48  91 152 183  21 104 203  88 102  71 204  99  53
## [36] 157  36  77  40  69 134 309  87   0  75  80  11 117  41  68 134 164 109  44
psych::describe(reading)
##    vars  n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 54 104.44 62.33   95.5  100.73 63.01   0 309   309 0.72     0.63 8.48

We would like to test whether this sample mean is different from our hypothesis that students read 2 hours (120 minutes) per day. \[H_0: \mu = 120\] or \[H_0: \mu - 120 = 0\]

We would test this against the t-distribution \[t = \frac{\bar{X} - 120}{SE}\] and \(SE = \frac{SD}{\sqrt{N}}\)

n <- length(reading)
se <- sd(reading)/sqrt(n)
t <- (mean(reading)-120)/se
se
## [1] 8.48222
t
## [1] -1.833901

You can look up the t value in the t distribution table. You will find that this sample average reading time was statistically lower than 120 minutes.

Confidence interval

We can also construct a confidence interval for this sample mean. However, we cannot use ±1.96 because we are using t distribution now. The function qt will give us a quantile of t for a given p. For a two-tail test, we want the t value associated with p = .025 and p = .975.

df <- n - 1 # degrees of freedom
LL <- mean(reading) + (qt(.025, df) * se) #Lower limit
UL <- mean(reading) + (qt(.975, df) * se) #Upper limit
LL
## [1] 87.43127
UL
## [1] 121.4576

The 95% CI did not include the hypothesized value (120). In this case, we conclude that this sample’s reading time mean was significanty lower than 120 minutes that we asked them to do. In other words, this group of students failed to keep up with reading two hourse per day.

t.test function

The t-test can be performed with a function t.test. For a one sample t-test, you will need two arguments in t.test(x, mu), where x is your data and mu is a hypothesized mean to test against.

t.test(reading, mu = 120) #default is a two-tailed test. 
## 
##  One Sample t-test
## 
## data:  reading
## t = -1.8339, df = 53, p-value = 0.07229
## alternative hypothesis: true mean is not equal to 120
## 95 percent confidence interval:
##   87.43127 121.45762
## sample estimates:
## mean of x 
##  104.4444
t.test(reading, mu = 120, alternative = "less") #alternative lets you run a one-tailed test with "less" or "greater".
## 
##  One Sample t-test
## 
## data:  reading
## t = -1.8339, df = 53, p-value = 0.03614
## alternative hypothesis: true mean is less than 120
## 95 percent confidence interval:
##      -Inf 118.6447
## sample estimates:
## mean of x 
##  104.4444