library(ggplot2)

# Pearson’s Correlation Coefficient

Pearson’s r is a standardized covariance that measures a direction and magnitude of linear relationship between two variables.

$r = \frac{cov_{xy}}{s_x s_y}$ We will use a built-in R dataset, mtcars, for this demonstration.

data(mtcars) # Load mtcars into a data frame.
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Let’s run a correlation analysis on cars’ miles per gallon (mpg) and horse power (hp).

## Manual Calculation

First, calculate $$cov_{xy}$$. $cov_{xy} = \frac{\sum{(X - \bar{X})(Y - \bar{Y})}}{N - 1}$

covxy <- (sum((mtcars$mpg - mean(mtcars$mpg)) * (mtcars$hp - mean(mtcars$hp)))) / (nrow(mtcars) -1)
covxy
## [1] -320.7321
# Or use cov() function
cov(mtcars$mpg, mtcars$hp)
## [1] -320.7321

Then, divide $$cov_{xy}$$ by $$s_x s_y$$

covxy / (sd(mtcars$mpg) * sd(mtcars$hp))
## [1] -0.7761684

## cor() and cor.test() function

We can use base R cor() function.

cor(mtcars$mpg, mtcars$hp)
## [1] -0.7761684

However, cor() does not give us a significance testing. If we need a p-value and CI, use cor.test().

cor.test(mtcars$mpg, mtcars$hp)
##
##  Pearson's product-moment correlation
##
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor
## -0.7761684

## Other types of correlation

The base R functions can perform three types of correlation by specfiying method = "pearson", "kendall" , or "spearman" in the cor() or cor.test() functions.

• Pearson’s correlation (r) is suitable to test a linear relationship between two continuous variables. Although r assumes that the variables are normally distributed, it is very robust against non-normal continuous data. However, r is sensitive to outliers.
• Spearman’s rank correlation or Spearman’s rho ($$\rho$$) is a non-parametric test of association for ordinal (ranked) data. It is suitable for a monotonic relationship (both linear and nonlinear). It is also less sensitive to outliers.
• Kendall’s rank correlation or Kendall’s tau ($$\tau$$) is a non-parametric test of dependence for ordinal variables. Kendall’s tau is usually smaller than Spearman’s rho. But it is less sensitive to error and more accurate in small samples. It is generally preferred over Spearmean’s rho.

Let’s plot mpg and hp

The relationship shows a negative monotonic trend (i.e., when one variable increases, the other decreases), BUT not quite linear (blue line).

cor.test(mtcars$mpg, mtcars$hp, method = "pearson")
##
##  Pearson's product-moment correlation
##
## data:  mtcars$mpg and mtcars$hp
## t = -6.7424, df = 30, p-value = 1.788e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8852686 -0.5860994
## sample estimates:
##        cor
## -0.7761684
cor.test(mtcars$mpg, mtcars$hp, method = "spearman")
## Warning in cor.test.default(mtcars$mpg, mtcars$hp, method = "spearman"): Cannot compute exact p-value with ties
##
##  Spearman's rank correlation rho
##
## data:  mtcars$mpg and mtcars$hp
## S = 10337, p-value = 5.086e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho
## -0.8946646
cor.test(mtcars$mpg, mtcars$hp, method = "kendall")
## Warning in cor.test.default(mtcars$mpg, mtcars$hp, method = "kendall"): Cannot compute exact p-value with ties
##
##  Kendall's rank correlation tau
##
## data:  mtcars$mpg and mtcars$hp
## z = -5.871, p-value = 4.332e-09
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau
## -0.7428125

Because the relationship was not quite linear, Spearman’s rho was larger than Pearson’s r. Nonetheless, r was still pretty good at capturing the relationship. Kendall’s $$\tau$$ value may look smaller, but this was because of the way it was calculated. It actually has lower p-value than Pearson’s r.

The assumption of linearity and the problem of outliers are critical to Pearson’s r. Therefore, it is important to visualize your data, so that you can select an appropriate correlation test.

# Scatter Plot

A bivariate analysis can be visualized with a scatter plot. We will learn how to do it in Base R and ggplot2.

## Base R

# Base R: Scatter plot
plot(mtcars$hp, mtcars$mpg)

## ggplot2

For ggplot(data =, mapping = aes() ), we will define a dataset and aesthetic mapping.

• data = is a data frame that contain variables to be plotted. We put mtcars here.
• mapping = aes() specifies aesthetics of the plot, i.e., which variables will be plot on which axes. In this plot, we map hp on the x-axis and mpg on the y-axis.
• Then we close the ggplot() function. However, we are not done yet. We will add a layer (using +) of geometry. A geometry function defines how data points are represented. To create a scatter plot, we will use geom_point().
• Then + the classic theme to make the plot looks APA-styled.

Note that you can store a ggplot object into an R variable (e.g., p) for later use.

# GGplot
library(ggplot2)
p <- ggplot(mtcars, mapping = aes(x = hp, y = mpg)) + #define data and mapping aesthetics
geom_point() + #point geometry for scatter plot
theme_classic()
p 

## Grouping with color or shape

For an advanced plot with another grouping variable (e.g., am: automatic = 0; manual transmission = 1), we can use color or shape of the dot to represent the grouping variable. We specify color or shape in aes(). Because am is a categorical variable, we need to make sure to treat it as a factor with as.factor().

# Color
ggplot(data = mtcars, aes(x = hp, y = mpg, color = as.factor(am))) +
geom_point() +
theme_classic()

# Shape
ggplot(data = mtcars, aes(x = hp,
y = mpg,
shape = as.factor(am))) +
geom_point() +
theme_classic()

# Correlation Matrix (Table)

If we are looking at multiple pairs of correlation, we can run a correlation matrix.

corr_dat <- mtcars[, c("mpg", "cyl", "disp", "hp", "wt", "gear")] # Crate a data frame with variables to be analyzed.
cor(corr_dat)
##             mpg        cyl       disp         hp         wt       gear
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684 -0.8676594  0.4802848
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475  0.7824958 -0.4926866
## disp -0.8475514  0.9020329  1.0000000  0.7909486  0.8879799 -0.5555692
## hp   -0.7761684  0.8324475  0.7909486  1.0000000  0.6587479 -0.1257043
## wt   -0.8676594  0.7824958  0.8879799  0.6587479  1.0000000 -0.5832870
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043 -0.5832870  1.0000000
# Note that the upper half above a diagonal line is a mirror image of the lower half. 

However, cor() does not give us significance testing. We will explore several packages that offer a function for correlation table.

## psych package

The psych package has a correlation matrix function called corr.test(). Note that psych’s function use double r in corr.test(). The options for method = are "pearson" (default), "spearman", and "kendall".

library("psych")
psych::corr.test(corr_dat) # Actually, you do not need to include "psych::" before "corr.test"". However, it is a good practice to identify which package you are using.
## Call:psych::corr.test(x = corr_dat)
## Correlation matrix
##        mpg   cyl  disp    hp    wt  gear
## mpg   1.00 -0.85 -0.85 -0.78 -0.87  0.48
## cyl  -0.85  1.00  0.90  0.83  0.78 -0.49
## disp -0.85  0.90  1.00  0.79  0.89 -0.56
## hp   -0.78  0.83  0.79  1.00  0.66 -0.13
## wt   -0.87  0.78  0.89  0.66  1.00 -0.58
## gear  0.48 -0.49 -0.56 -0.13 -0.58  1.00
## Sample Size
## [1] 32
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
##       mpg cyl disp   hp wt gear
## mpg  0.00   0    0 0.00  0 0.01
## cyl  0.00   0    0 0.00  0 0.01
## disp 0.00   0    0 0.00  0 0.00
## hp   0.00   0    0 0.00  0 0.49
## wt   0.00   0    0 0.00  0 0.00
## gear 0.01   0    0 0.49  0 0.00
##
##  To see confidence intervals of the correlations, print with the short=FALSE option

Note that values above the diagonal line are identical to values below the diagonal line.

To make it easier to read the table, we can output only the bottome half with lowerCor().

psych::lowerCor(corr_dat) # Show only lower half.
##      mpg   cyl   disp  hp    wt    gear
## mpg   1.00
## cyl  -0.85  1.00
## disp -0.85  0.90  1.00
## hp   -0.78  0.83  0.79  1.00
## wt   -0.87  0.78  0.89  0.66  1.00
## gear  0.48 -0.49 -0.56 -0.13 -0.58  1.00
#corr.test(corr_dat, method = "kendall") # For Kendall's tau

## jamovi’s jmv package

The jmv package has a nice correlation matrix function that produce a good looking table. In corrMatrix(),
data = a dataframe,
vars = a vector of variable names,
pearson = TRUE (default) or FALSE
spearman = TRUE or FALSE (default),
kendall = TRUE or FALSE (default),
n = TRUE or FALSE (default), show number of cases,
ci = TRUE or FALSE (default), show confidence interval,
plots = TRUE or FALSE (default), plot a visualization,
plotStats = TRUE or FALSE (default), stats in a plot,
plotDens = TRUE or FALSE (default), plot densities.

# install.packages("jmv")
library("jmv")
corrMatrix(data = mtcars, vars = c("mpg", "cyl", "disp", "hp", "wt", "gear"))
##
##  CORRELATION MATRIX
##
##  Correlation Matrix
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────
##                           mpg           cyl           disp          hp            wt            gear
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────
##    mpg     Pearson's r             —
##            p-value                 —
##
##    cyl     Pearson's r    -0.8521620             —
##            p-value        < .0000001             —
##
##    disp    Pearson's r    -0.8475514     0.9020329             —
##            p-value        < .0000001    < .0000001             —
##
##    hp      Pearson's r    -0.7761684     0.8324475     0.7909486             —
##            p-value         0.0000002    < .0000001    < .0000001             —
##
##    wt      Pearson's r    -0.8676594     0.7824958     0.8879799     0.6587479             —
##            p-value        < .0000001     0.0000001    < .0000001     0.0000415             —
##
##    gear    Pearson's r     0.4802848    -0.4926866    -0.5555692    -0.1257043    -0.5832870            —
##            p-value         0.0054009     0.0041733     0.0009636     0.4930119     0.0004587            —
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────

Note that the table only shows the bottom half to make it easier to read.

## apaTables package

The apaTables package will help you create a nice looking APA formatted table. Also, the function can save an output table into a document file with a filename option.
For a correlation table, we will use apa.cor.table().

# install.packages("apaTables")
library(apaTables)
apa.cor.table(corr_dat, filename = "APA_Corr_Table.doc") # This will save a Word doc to the working directory.
##
##
## Means, standard deviations, and correlations with confidence intervals
##
##
##   Variable M      SD     1            2            3            4           5
##   1. mpg   20.09  6.03
##
##   2. cyl   6.19   1.79   -.85**
##                          [-.93, -.72]
##
##   3. disp  230.72 123.94 -.85**       .90**
##                          [-.92, -.71] [.81, .95]
##
##   4. hp    146.69 68.56  -.78**       .83**        .79**
##                          [-.89, -.59] [.68, .92]   [.61, .89]
##
##   5. wt    3.22   0.98   -.87**       .78**        .89**        .66**
##                          [-.93, -.74] [.60, .89]   [.78, .94]   [.40, .82]
##
##   6. gear  3.69   0.74   .48**        -.49**       -.56**       -.13        -.58**
##                          [.16, .71]   [-.72, -.17] [-.76, -.26] [-.45, .23] [-.77, -.29]
##
##
## Note. M and SD are used to represent mean and standard deviation, respectively.
## Values in square brackets indicate the 95% confidence interval.
## The confidence interval is a plausible range of population correlations
## that could have caused the sample correlation (Cumming, 2014).
##  * indicates p < .05. ** indicates p < .01.
## 
# run getwd() to check your working directory. 

# Visualizing Correlation Matrix

You can also visualize a correlation matrix. The visualization helps you examine the relationship among variables as well as detecting any assumption violations. It serves a diagnostic purpose. Normally, you would not include these plots in your paper.

There are several packages with a function to visualize a correlation matrix. Here are some of them.

## Base R

The pairs function will produce a matrix of scatter plots. You could examine whether assumptions of linearity and homoscedasticity hold true for any pairs.

pairs(corr_dat)

## corrplot

The corrplot package create a nice visulization with a color scheme. The corrplot(corr, method = , type = , p.mat = ) first argument, corr, must be a correlation matrix, NOT a data frame. Therfore, we need to, first, compute a correlation matrix. We will use psych::corr.test() here.

The option method refers to how the correlation will be visualized. There are method = c("circle", "square", "ellipse", "number", "shade", "color", "pie"). You can see more info in ?corrplot.

The option type = c("full", "lower", "upper") tells the function whether you want a full matrix, an upper, or a lower half.

The option p.mat takes a p value table and show whether correlation coefficients are significant.

# First, we need to compute a correlation matrix.
cor_mat <- psych::corr.test(corr_dat) # Use corr.test() to create a list of output matrices.
str(cor_mat)
## List of 14
##  $r : num [1:6, 1:6] 1 -0.852 -0.848 -0.776 -0.868 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:6] "mpg" "cyl" "disp" "hp" ...
##   .. ..$: chr [1:6] "mpg" "cyl" "disp" "hp" ... ##$ n     : num 32
##  $t : num [1:6, 1:6] Inf -8.92 -8.75 -6.74 -9.56 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:6] "mpg" "cyl" "disp" "hp" ...
##   .. ..$: chr [1:6] "mpg" "cyl" "disp" "hp" ... ##$ p     : num [1:6, 1:6] 0.00 6.11e-10 9.38e-10 1.79e-07 1.29e-10 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$: chr [1:6] "mpg" "cyl" "disp" "hp" ... ## .. ..$ : chr [1:6] "mpg" "cyl" "disp" "hp" ...
##  $p.adj : num [1:15] 7.34e-09 1.03e-08 2.70e-11 1.25e-06 3.48e-08 ... ##$ se    : num [1:6, 1:6] 0 0.0955 0.0969 0.1151 0.0908 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$: chr [1:6] "mpg" "cyl" "disp" "hp" ... ## .. ..$ : chr [1:6] "mpg" "cyl" "disp" "hp" ...
##  $sef : num 0.186 ##$ adjust: chr "holm"
##  $sym : logi TRUE ##$ ci    :'data.frame':  15 obs. of  4 variables:
##   ..$lower: num [1:15] -0.926 -0.923 -0.885 -0.934 0.158 ... ## ..$ r    : num [1:15] -0.852 -0.848 -0.776 -0.868 0.48 ...
##   ..$upper: num [1:15] -0.716 -0.708 -0.586 -0.744 0.71 ... ## ..$ p    : num [1:15] 6.11e-10 9.38e-10 1.79e-07 1.29e-10 5.40e-03 ...
##  $ci2 :'data.frame': 15 obs. of 5 variables: ## ..$ lower: num [1:15] -0.926 -0.923 -0.885 -0.934 0.158 ...
##   ..$r : num [1:15] -0.852 -0.848 -0.776 -0.868 0.48 ... ## ..$ upper: num [1:15] -0.716 -0.708 -0.586 -0.744 0.71 ...
##   ..$p : num [1:15] 6.11e-10 9.38e-10 1.79e-07 1.29e-10 5.40e-03 ... ## ..$ p.adj: num [1:15] 7.34e-09 1.03e-08 1.25e-06 1.68e-09 1.25e-02 ...
##  $ci.adj:'data.frame': 15 obs. of 2 variables: ## ..$ lower.adj: num [1:15] -0.946 -0.944 -0.911 -0.953 0.107 ...
##   ..$upper.adj: num [1:15] -0.624 -0.617 -0.49 -0.657 0.735 ... ##$ stars : chr [1:6, 1:6] "1***" "-0.85***" "-0.85***" "-0.78***" ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$: chr [1:6] "mpg" "cyl" "disp" "hp" ... ## .. ..$ : chr [1:6] "mpg" "cyl" "disp" "hp" ...
##  $Call : language psych::corr.test(x = corr_dat) ## - attr(*, "class")= chr [1:2] "psych" "corr.test" cor_mat$r # The first element of the list is a correlation matrix.
##             mpg        cyl       disp         hp         wt       gear
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684 -0.8676594  0.4802848
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475  0.7824958 -0.4926866
## disp -0.8475514  0.9020329  1.0000000  0.7909486  0.8879799 -0.5555692
## hp   -0.7761684  0.8324475  0.7909486  1.0000000  0.6587479 -0.1257043
## wt   -0.8676594  0.7824958  0.8879799  0.6587479  1.0000000 -0.5832870
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043 -0.5832870  1.0000000
cor_mat$p # The fourth element is a p value matrix.  ## mpg cyl disp hp wt gear ## mpg 0.000000e+00 7.335225e-09 1.031836e-08 1.251485e-06 1.682146e-09 0.012519892 ## cyl 6.112687e-10 0.000000e+00 2.704257e-11 3.477861e-08 9.740536e-07 0.012519892 ## disp 9.380327e-10 1.802838e-12 0.000000e+00 6.428411e-07 1.711247e-10 0.003854368 ## hp 1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00 2.487496e-04 0.493011907 ## wt 1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05 0.000000e+00 0.002293300 ## gear 5.400948e-03 4.173297e-03 9.635921e-04 4.930119e-01 4.586601e-04 0.000000000 # install.packages("corrplot") library(corrplot) corrplot(cor_mat$r, method = "number", type = "upper", p.mat = cor_mat$p) #show cor values. corrplot(cor_mat$r, method = "ellipse", type = "lower", p.mat = cor_mat\$p) # show ellipses

## psych

The psych package’s pairs.panels function will give you correlation coefficients on an upper half, scatter plots on a lower half, and distribution density plots on a diagonal. The method option can be "pearson", "spearman", or "kendall". The plots are useful for detecting assumption violations.

library(psych)
pairs.panels(corr_dat) # input is a data frame.

## jmv

Another nice package for plotting is GGally. You can read about it at http://ggobi.github.io/ggally/articles/ggpairs.html. However, in this demonstration, we will call a plot from jmv::corrMatrix. Because the jmv package implements GGally for its plotting option, this will save us time to call just one function. The down side of using jmv is that you do not have many options to customize your plots.

corrMatrix(data = mtcars,
vars = c("mpg", "cyl", "disp", "hp", "wt", "gear"),
plots = TRUE, # scatter plots on lower half.
plotStats = TRUE, # cor coeff on upper half.
plotDens = TRUE) # density plots on diagonal.
##
##  CORRELATION MATRIX
##
##  Correlation Matrix
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────
##                           mpg           cyl           disp          hp            wt            gear
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────
##    mpg     Pearson's r             —
##            p-value                 —
##
##    cyl     Pearson's r    -0.8521620             —
##            p-value        < .0000001             —
##
##    disp    Pearson's r    -0.8475514     0.9020329             —
##            p-value        < .0000001    < .0000001             —
##
##    hp      Pearson's r    -0.7761684     0.8324475     0.7909486             —
##            p-value         0.0000002    < .0000001    < .0000001             —
##
##    wt      Pearson's r    -0.8676594     0.7824958     0.8879799     0.6587479             —
##            p-value        < .0000001     0.0000001    < .0000001     0.0000415             —
##
##    gear    Pearson's r     0.4802848    -0.4926866    -0.5555692    -0.1257043    -0.5832870            —
##            p-value         0.0054009     0.0041733     0.0009636     0.4930119     0.0004587            —
##  ──────────────────────────────────────────────────────────────────────────────────────────────────────────