Load Data

Save a data file and an R script in the same directory. Set working directory to Source File Location. setwd(dirname(rstudioapi::getSourceEditorContext()$path)) Then, load the data.

solar_data <- read.csv("data/RES_STAT_Lab5_Data.csv")
head(solar_data) ## view the first part of the data frame.
##   id              user deci    sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8           inno att1 att2 att3 att4 att5 att6 att7 att8
## 1  1 regular residence    2 female  43    1    1    1    1    1    1    1    1 early majority    2    3    3    1    3    3    3    3
## 2  2 regular residence    2 female  51    1    1    1    1    1    1    1    1       laggards    4    4    3    3    2    2    2    1
## 3  3 regular residence    2   male  28    4    4    4    4    2    3    4    2 early adopters    2    5    5    3    4    5    4    4
## 4  4 regular residence    2 female  32    1    1    1    1    1    1    1    1  late majority    3    4    4    4    5    5    4    3
## 5  5 regular residence    2   male  38    4    4    4    3    3    3    3    4     innovators    4    4    4    3    5    5    5    3
## 6  6 regular residence    2   male  51    1    1    1    1    1    1    1    1 early adopters    5    5    3    2    5    5    5    4
##              interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7 re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7
## 1        probably not       1       2       2       0       2       2       2       2       0       0       0       0       0       0       0
## 2      definitely not       3       3       2       2       1       1       1       0       0       0       0       0       0       0       0
## 3           likely to       1       4       4       2       3       4       3       3       3       3       3       3       1       2       3
## 4       probably will       2       3       3       3       4       4       3       2       0       0       0       0       0       0       0
## 5           likely to       3       3       3       2       4       4       4       2       3       3       3       2       2       2       2
## 6 definitely going to       4       4       2       1       4       4       4       3       0       0       0       0       0       0       0
##   re_kno8 attitude knowledge
## 1       0    1.625     0.000
## 2       0    1.625     0.000
## 3       1    3.000     2.375
## 4       0    3.000     0.000
## 5       3    3.125     2.500
## 6       0    3.250     0.000

Calculate z-scores

For each X, its standardized score, z, can be calculated as \[z = \frac{X-\bar{X}}{s}\]

Manual method

You can calculate z-score step by step. First, the top part of the equation \(X-\bar{X}\) is called mean centering. That is, the value X is subtracted by the mean of X, X - mean(X).
The mean of X is now the “center” (0) of the distribution. Each \(X-\bar{X}\) represents how far each X is from the \(\bar{X}\), i.e., a deviation score.

Then, the centered value is scaled to the SD of X, X - mean(X)/sd(X). When the score is scaled to SD. It means that one unit change in the z-score represents one SD change in the raw score.

To standardize a variable, you will calculate z-scores for each row in that column. Let put it the calculated values in z_kno_m variable.
z for z-scores; kno for knowledge; and m for manual calculation.

solar_data$z_kno_m <- (solar_data$knowledge - mean(solar_data$knowledge))/sd(solar_data$knowledge)
head(solar_data$z_kno_m)
## [1] -0.837499 -0.837499  1.532611 -0.837499  1.657354 -0.837499

scale() function

The scale function will center the value by substracting the column mean. Then it scales the centered value by the column standard deviation.
Let create a variable name z_kno_f. f for function.

solar_data$z_kno_f <- scale(solar_data$knowledge)
head(solar_data$z_kno_f)
##           [,1]
## [1,] -0.837499
## [2,] -0.837499
## [3,]  1.532611
## [4,] -0.837499
## [5,]  1.657354
## [6,] -0.837499

Are all values in the manual method equal to values from scale()?

all(solar_data$z_kno_m == solar_data$z_kno_f)
## [1] TRUE

Both methods lead to the same results.

Outliers

Use z-score to determine outliers

There are many methods to identify outliers. In this example, any values outside ±3 SD are considered outliers. Because z-scores are in a unit of SD, any z beyond ±3 will be marked as outliers.

outliers <- solar_data$z_kno_f > 3 | solar_data$z_kno_f < -3
solar_data[outliers, ]
##      id              user deci  sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8           inno att1 att2 att3 att4 att5 att6 att7 att8
## 167 211 regular residence    2 male  47    5    5    5    5    5    5    5    5 early adopters    4    4    5    4    5    5    5    4
## 264 332 regular residence    2 male  30    5    5    5    5    5    5    5    5 early adopters    5    5    5    5    5    5    5    5
##          interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7 re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8
## 167     likely to       3       3       4       3       4       4       4       3       4       4       4       4       4       4       4       4
## 264 probably will       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4       4
##     attitude knowledge  z_kno_m  z_kno_f
## 167      3.5         4 3.154265 3.154265
## 264      4.0         4 3.154265 3.154265

z-score by group

z_kno_f was calculated from the entire data set. This would be appropriate if we are looking at a homogenous group. However, it is not a case here. Our data contain subgroups of innovators inno. Therefore, we should find outilers within each group instead.

We will use ave() to calculate the z-score by group. ave is normally used to calculate means for subsets of X. However, you can specified the argument FUN = to other functions, such as sd() or scale().

The first argument in ave() is a numeric object to be calculated (knowledge). The second argument is a grouping variable (inno). The third argument FUN = is a function to by applied for each factor level (FUN = scale).

solar_data$z_kno_g <- ave(solar_data$knowledge, solar_data$inno, FUN = scale)
# wrong <- ave(solar_data$z_kno_f, solar_data$inno, FUN = scale)
head(solar_data$z_kno_g)
## [1] -0.9203839 -0.5926129  0.9015546 -0.7532102  1.4462915 -1.0470435
# head(wrong)

To check whether the code works correctly, we will calculate the mean and SD of the z-scores for each innovator group. The mean should be 0 and SD should be 1 for each group.

We will use tapply to apply mean() to each inno group. In tapply, the first argument is an input value. The second argument is a grouping variable (a factor). The third argument is a function to apply.

tapply(solar_data$z_kno_g, solar_data$inno, mean)
## early adopters early majority     innovators       laggards  late majority 
##  -6.455541e-17  -6.450725e-17   4.979358e-17   2.100642e-19  -2.301871e-17
# The values contain very small decimals. Let's round them up to make them easier to read.
round(tapply(solar_data$z_kno_g, solar_data$inno, mean))
## early adopters early majority     innovators       laggards  late majority 
##              0              0              0              0              0
# Then calculate the SD
tapply(solar_data$z_kno_g, solar_data$inno, sd)
## early adopters early majority     innovators       laggards  late majority 
##              1              1              1              1              1

With the mean = 0 and SD = 1 for each innovator group, it seems that the z-score-by-group code works correctly.

Outliers by group

Now we find outliers that have z-score beyond ±3. Because the z-scores were calculated for each innovator group, any values beyond ±3 are outliers for that group.

outliers_g <- solar_data$z_kno_g > 3 | solar_data$z_kno_g < -3
solar_data[outliers_g, ]
##      id            user deci    sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8     inno att1 att2 att3 att4 att5 att6 att7 att8     interest
## 115 136 small residence    2 female  55    3    3    4    3    3    3    4    3 laggards    4    4    5    4    4    5    5    5 probably not
## 166 210 small residence    2   male  37    4    4    4    4    4    4    4    4 laggards    3    3    3    3    3    3    3    3    likely to
##     re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7 re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude
## 115       3       3       4       3       3       4       4       4       2       2       3       2       2       2       3       2      3.5
## 166       2       2       2       2       2       2       2       2       3       3       3       3       3       3       3       3      2.0
##     knowledge  z_kno_m  z_kno_f  z_kno_g
## 115      2.25 1.407868 1.407868 3.038717
## 166      3.00 2.156324 2.156324 4.249160

In this case, we will remove outliers from the data. We will choose rows that are NOT outliers !outliers_g.

solar_data_new <- solar_data[!outliers_g, ] # choose rows that are not outliers and choose all columns.
nrow(solar_data) # Number of observation in the original data. 
## [1] 304
nrow(solar_data_new) # The number of observation should be reduced by 2 cases.
## [1] 302

Descriptives for each group

We will use describe() from package psych to calculate means and SDs by each inno group. The first argument of by is an input data, knowledge. The second argument is a grouping factor, inno. The third argument is a function to apply, describe.

library(psych)
by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)
## solar_data_new$inno: early adopters
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 43 1.28 1.22   1.12    1.15 1.67   0   4     4 0.58    -0.78 0.19
## -------------------------------------------------------------------------------------------------------------- 
## solar_data_new$inno: early majority
##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 117 0.93 1.01   0.62    0.79 0.93   0 3.5   3.5 0.87    -0.36 0.09
## -------------------------------------------------------------------------------------------------------------- 
## solar_data_new$inno: innovators
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 17 1.08 0.98   1.12    1.06 1.3   0 2.5   2.5  0.2    -1.66 0.24
## -------------------------------------------------------------------------------------------------------------- 
## solar_data_new$inno: laggards
##    vars  n mean  sd median trimmed mad min  max range skew kurtosis   se
## X1    1 46 0.27 0.4      0     0.2   0   0 1.25  1.25  1.3     0.37 0.06
## -------------------------------------------------------------------------------------------------------------- 
## solar_data_new$inno: late majority
##    vars  n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 79 0.71 0.94   0.25    0.54 0.37   0 3.5   3.5 1.37     0.85 0.11
# We will create an object from the code above for future use.
kno_table <- by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)

The object kno_table from by() and describe() is a list. We will access its mean and SD values to calculate effect sizes.

Effect sizes

An effect size is \(\frac{X_1-X_2}{SD}\).

We will need means of the two groups and an SD.

kno_table$innovators$mean # mean of the innovators group
## [1] 1.080882
kno_table$laggards$mean # mean of the laggards group
## [1] 0.2690217
# For SD we will use the SD from the whole data set.
sd_all <- sd(solar_data_new$knowledge)
sd_all
## [1] 0.9942528

Now we can calculate an effect size for each comparison.

effect_size1 <- (kno_table$innovators$mean - kno_table$laggards$mean)/sd_all
effect_size1
## [1] 0.8165535
effect_size2 <- (kno_table$innovators$mean - kno_table$`early adopters`$mean)/sd_all
effect_size2
## [1] -0.1964092
effect_size3 <- (kno_table$`early majority`$mean - kno_table$laggards$mean)/sd_all
effect_size3
## [1] 0.659985
effect_size4 <- (kno_table$`early majority`$mean - kno_table$`late majority`$mean)/sd_all
effect_size4
## [1] 0.2176035