Chapter 2 Lab 2: Descriptive Statistics

Describing comic sensibility is near impossible. It’s sort of an abstract silliness, that sometimes the joke isn’t the star. —Dana Carvey

The purpose of this lab is to show you how to compute basic descriptive statistics, including measures of central tendency (mean, mode, median) and variation (range, variance, standard deviation).

2.1 General Goals

  1. Compute measures of central tendency using software
  2. Compute measures of variation using software
  3. Ask some questions of a data set using descriptive statistics

2.1.1 Important info

We will be using data from the gapminder project. You can download a small snippet of the data in .csv format from this link (note this dataset was copied from the gapminder library for R) gapminder.csv. If you are using R, then you can install the gapminder package. This method is described later in the R section.

2.2 R

2.2.1 Descriptives basics in R

We learned in lecture and from the textbook that data we want to use ask and answer questions often comes with loads of numbers. Too many numbers to look at all at once. That’s one reason we use descriptive statistics. To reduce the big set of numbers to one or two summary numbers that tell use something about all of the numbers. R can produce descriptive statistics for you in many ways. There are base functions for most of the ones that you want. We’ll go over some R basics for descriptive statistics, and then use our new found skills to ask some questions about real data.

2.2.1.1 Making numbers in R

In order to do descriptive statistics we need to put some numbers in a variable. You can also do this using the c() command, which stands for combine

my_numbers <- c(1,2,3,4)

There a few other handy ways to make numbers. We can use seq() to make a sequence. Here’s making the numbers from 1 to 100

one_to_one_hundred <- seq(1,100,1)

We can repeat things, using rep. Here’s making 10 5s, and 25 1s:

rep(10,5)
## [1] 10 10 10 10 10
rep(1,25)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
all_together_now <- c(rep(10,5),rep(1,25)) 

2.2.1.2 Sum

Let’s play with the number 1 to 100. First, let’s use the sum() function to add them up

one_to_one_hundred <- seq(1,100,1)
sum(one_to_one_hundred)
## [1] 5050

2.2.1.3 Length

We put 100 numbers into the variable one_to_one_hundred. We know how many numbers there are in there. How can we get R to tell us? We use length() for that.

length(one_to_one_hundred)
## [1] 100

2.2.2 Central Tendency

2.2.2.1 Mean

Remember the mean of some numbers is their sum, divided by the number of numbers. We can compute the mean like this:

sum(one_to_one_hundred)/length(one_to_one_hundred)
## [1] 50.5

Or, we could just use the mean() function like this:

mean(one_to_one_hundred)
## [1] 50.5

2.2.2.2 Median

The median is the number in the exact middle of the numbers ordered from smallest to largest. If there are an even number of numbers (no number in the middle), then we take the number in between the two (decimal .5). Use the median function. There’s only 3 numbers here. The middle one is 2, that should be the median

median(c(1,2,3))
## [1] 2

2.2.2.3 Mode

R does not a base function for the Mode. You would have to write one for yourself. Here is an example of writing your own mode function, and then using it. Note I searched how to do this on Google, and am using the mode defined by this answer on stack overflow

Remember, the mode is the most frequently occurring number in the set. Below 1 occurs the most, so the mode will be one.

my_mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

my_mode(c(1,1,1,1,1,1,1,2,3,4))
## [1] 1

2.2.3 Variation

We often want to know how variable the numbers are. We are going to look at descriptive statistics to describe this such as the range, variance, the standard deviation, and a few others.

First, let’s remind ourselves what variation looks like (it’s when the numbers are different). We will sample 100 numbers from a normal distribution (don’t worry about this yet), with a mean of 10, and a standard deviation of 5, and then make a histogram so we can see the variation around 10..

sample_numbers <- rnorm(100,10,5)
hist(sample_numbers)

2.2.3.1 range

The range is the minimum and maximum values in the set, we use the range function.

range(sample_numbers)
## [1] -2.591563 22.856565

2.2.3.2 var = variance

We can find the sample variance using var. Note, divides by (n-1)

var(sample_numbers)
## [1] 26.4788

2.2.3.3 sd = standard deviation

We find the sample standard deviation us SD. Note, divides by (n-1)

sd(sample_numbers)
## [1] 5.145755

Remember that the standard deviation is just the square root of the variance, see:

sqrt(var(sample_numbers))
## [1] 5.145755

2.2.3.4 All Descriptives

Let’s put all of the descriptives and other functions so far in one place:

sample_numbers <- rnorm(100,10,5)

sum(sample_numbers)
## [1] 1092.573
length(sample_numbers)
## [1] 100
mean(sample_numbers)
## [1] 10.92573
median(sample_numbers)
## [1] 10.90372
my_mode(sample_numbers)
## [1] 7.98613
range(sample_numbers)
## [1] -0.308353 24.749855
var(sample_numbers)
## [1] 25.30758
sd(sample_numbers)
## [1] 5.030664

2.2.4 Descriptives by conditions

Sometimes you will have a single variable with some numbers, and you can use the above functions to find the descriptives for that variable. Other times (most often in this course), you will have a big data frame of numbers, with different numbers in different conditions. You will want to find descriptive statistics for each the sets of numbers inside each of the conditions. Fortunately, this is where R really shines, it does it all for you in one go.

Let’s illustrate the problem. Here I make a date frame with 10 numbers in each condition. There are 10 conditions, each labelled, A, B, C, D, E, F, G, H, I, J.

scores <- rnorm(100,10,5)
conditions <- rep(c("A","B","C","D","E","F","G","H","I","J"), each =10)
my_df <- data.frame(conditions,scores)

If you look at the my_df data frame, you will see it has 100 rows, there are 10 rows for each condition with a label in the conditions column, and 10 scores for each condition in the scores column. What if you wanted to know the mean of the scores in each condition? You would want to find 10 means.

The slow way to do it would be like this:

mean(my_df[my_df$conditions=="A",]$scores)
## [1] 12.59648
mean(my_df[my_df$conditions=="B",]$scores)
## [1] 10.6128
mean(my_df[my_df$conditions=="C",]$scores)
## [1] 8.266488
# and then keep going

Nobody wants to do that! Not, me I stopped doing it that way, you should to.

2.2.4.1 group_by and summarise

We can easily do everything all at once using the group_by and summarise function from the dplyr package. Just watch

library(dplyr)

my_df %>%
  group_by(conditions) %>%
  summarise(means = mean(scores))
## # A tibble: 10 x 2
##    conditions means
##    <fct>      <dbl>
##  1 A          12.6 
##  2 B          10.6 
##  3 C           8.27
##  4 D           6.70
##  5 E          10.2 
##  6 F           9.12
##  7 G           8.21
##  8 H          10.1 
##  9 I          11.6 
## 10 J          10.2

A couple things now. First, the print out of this was ugly. Let’s fix that. we put the results of our code into a new variable, then we use knitr::kable to print it out nicely when we knit the document

summary_df <- my_df %>%
               group_by(conditions) %>%
               summarise(means = mean(scores))

knitr::kable(summary_df)
conditions means
A 12.596478
B 10.612800
C 8.266488
D 6.704745
E 10.170008
F 9.120068
G 8.205138
H 10.091734
I 11.555082
J 10.234129

2.2.4.2 multiple descriptives

The best thing about the dplyr method, is that we can add more than one function, and we’ll get more than one summary returned, all in a nice format, let’s add the standard deviation:

summary_df <- my_df %>%
               group_by(conditions) %>%
               summarise(means = mean(scores),
                         sds = sd(scores))

knitr::kable(summary_df)
conditions means sds
A 12.596478 9.352925
B 10.612800 4.476998
C 8.266488 4.555700
D 6.704745 5.011566
E 10.170008 3.875537
F 9.120068 5.636339
G 8.205138 5.865190
H 10.091734 5.256813
I 11.555082 4.874282
J 10.234129 3.873318

We’ll add the min and the max too:

summary_df <- my_df %>%
               group_by(conditions) %>%
               summarise(means = mean(scores),
                         sds = sd(scores),
                         min = min(scores),
                         max = max(scores))

knitr::kable(summary_df)
conditions means sds min max
A 12.596478 9.352925 -1.0708193 27.46081
B 10.612800 4.476998 2.2941837 18.44375
C 8.266488 4.555700 0.0053070 14.46609
D 6.704745 5.011566 0.8117465 14.21022
E 10.170008 3.875537 4.7036641 16.04143
F 9.120068 5.636339 1.1128198 21.33737
G 8.205138 5.865190 -2.4686723 14.91742
H 10.091734 5.256813 0.2377578 14.80595
I 11.555082 4.874282 3.3809716 17.21664
J 10.234129 3.873318 0.4334203 14.07733

2.2.5 Describing gapminder

Now that we know how to get descriptive statistics from R, we cam do this will some real data. Let’s quickly ask a few question about the gapminder data.

library(gapminder)
gapminder_df <- gapminder

Note: The above code will only work if you have installed the gapminder package. Make sure you are connected to the internet, then choose the Packages tab from the bottom right panel, and choose install. Thens search for gapminder, choose it, and install it.

2.2.5.1 What are some descriptive for Life expectancy by continent?

Copy the code from the last part of descriptives using dplyr, then change the names like this:

summary_df <- gapminder_df %>%
               group_by(continent) %>%
               summarise(means = mean(lifeExp),
                         sds = sd(lifeExp),
                         min = min(lifeExp),
                         max = max(lifeExp))

knitr::kable(summary_df)
continent means sds min max
Africa 48.86533 9.150210 23.599 76.442
Americas 64.65874 9.345088 37.579 80.653
Asia 60.06490 11.864532 28.801 82.603
Europe 71.90369 5.433178 43.585 81.757
Oceania 74.32621 3.795611 69.120 81.235

2.2.6 Generalization Exercise

(1 point - Pass/Fail)

Complete the generalization exercise described in your R Markdown document for this lab.

  1. What is the mean, standard deviation, minimum and maximum life expectancy for all the gapminder data (across all the years and countries). Hint: do not use group_by

  2. What is the mean, standard deviation, minimum and maximum life expectancy for all of the continents in 2007, the most recent year in the dataset. Hint: add another pipe using filter(year==2007) %>%

2.2.7 Writing assignment

(2 points - Graded)

Complete the writing assignment described in your R Markdown document for this lab. When you have finished everything. Knit the document and hand in your stuff (you can submit your .RMD file to blackboard if it does not knit.)

Your writing assignment is to answer these questions in full sentences using simple plain langauge:

  1. Define the mode.
  2. Explain what would need to happen in order for a set of numbers to have two modes
  3. Define the median
  4. Define the mean
  5. Define the range
  6. When calculating the standard deviation, explain what the difference scores represent
  7. Explain why the difference scores are squared when calculating the standard deviation
  8. If one set of numbers had a standard deviation of 5, and another had a standard deviation of 10, which set of numbers would have greater variance, explain why.

Rubric

General grading.

  • You will receive 0 points for missing answers (say, if you do not answer question c, then you will receive 0 out .25 points for that question)
  • You must write in complete sentences. Point form sentences will be given 0 points.
  • Completely incorrect answers will receive 0 points.
  • If your answer is generally correct but very difficult to understand and unclear you may receive half points for the question

2.3 Excel

How to do it in Excel

2.4 SPSS

In this lab, we will use SPSS to calculate a variety of descriptive statistics. SPSS allows us to specify which statistics we would like calculated and produce them all in one output table. Here, we will learn to:

  1. Calculate descriptive statistics
  2. Graph data using a histogram
  3. Editing graphs

Let’s begin with a short data set {x= 1, 1, 4, 1, 2, 5, 7}

Suppose we want to calculate the measures of central tendency (mean, median, and mode) as well as variability (range, standard deviation, and variance). First, we will have to enter our data into the SPSS spreadsheet. There are 7 measurements, so we will need 7 rows of data (I have also changed the name of this variable in Variable View to x:

2.4.1 Calculating Descriptive Statistics


From here, go to the top menu and choose Analyze, then Descriptive Statistics and then Frequencies:

A new window will ask you to specify for which variables you want descriptives statistics calculated. Use the arrow to move the x variable from the left-hand to the right-hand field.

Now, click the Statistics button. This will open a new window containing a list of statistics. You can choose as many as you want to be calculated. We will choose mean, median mode, range, standard deviation, and variance.

Then click Continue, and then OK. SPSS will produce a set of two output tables: one containing the descriptive statistics we have chosen, and the other a frequency table–a list of all the possible values in our data set and their corresponding frequencies.

2.4.2 Descriptive Statistics and Histograms


Now let’s use a real dataset to calculate the same measures of central tendency and variability as in the last example, but with the addition of a histogram to visualize a distribution to relate back to the descriptive statistics. Here is a link to the life expectancy dataset we used for our graphing tutorial. It is named life_expectancy.sav.

Suppose we wanted to know about life expectancy (around the world) in 2018. This will include calculating descriptive statistics, as well as graphing a histogram to examine the distribution of our data. SPSS often has neat shortcuts to graph data within other menu options. For example, in the process of asking SPSS to produce a table of descriptive statistics, we can also add a histogram to the output without having to go to the Graphs menu.

First, we go to Analyze, then Descriptive Statistics, then Frequencies.

A window will appear asking us to indicate which variable to use. We will scroll all the way down in the list of variables on the left, choose 2018 [V220], and then move it into the field on the right using the arrow.

As in the previous example, we will click the Statistics button next, and choose our measures of central tendency (mean, median, mode) as well as variability (range, standard deviation, variance).

Once we click continue, we are back to the previous window. Here, we can conveniently ask SPSS to insert a histogram. Let’s click on the Charts button:

This will open a new window that asks us to specify the type of chart we would like. Let’s select Histogram, and check off Show normal curve on histogram:

Click Continue, and then OK.

SPSS will produce a table of the requested descriptive statistics, a frequency table (we will ignore this for now because it is long and not relevant to this exercise), and finally, a histogram showing the distribution of life expectancy (in years) with a normal distribution superimposed for comparison.

Something to think about: What do the mean, median, and mode indicate about the shape of the distribution? Is it confirmed when we look at the histogram? How does the shape of this distribution compare to the symmetrical normal distribution which is superimposed over it?

2.4.3 Editing graphs


For future reference, we will take this opportunity to make edits to the appearance of this graph. It is helpful to know how to manipulate not just the superficial, cosmetic appearance of data visualizations, but also the components of graphs that communicate important information. One feature to consider is the scale of our axes.

To modify the scale in a graph, first double-click on the graph itself in the output window. A Chart Editor should pop up.

Now, hover your mouse over one of the values on the x-axis, and click down once. You will see all the x-axis values suddenly highlight:

Now that you know how to select the entire axis, double click on it. A properties window will pop up:

Notice that in the default tab, you can change the fill color of components of your graph. For now, we will select the Scale tab instead:

Notice in this tab, we can change the minimum, maximum, and increments of our scale. Lets change the major increment to 5:

Click Apply. Then Close. Now, close your Chart Editor window by simply x-ing it out. The SPSS output window will now contain your edited graph:

2.4.4 Practice Problems


  1. Using the life expectancy data set, produce a table of output showing the descriptive statistics (measures of central tendency and variability) for both years 1800 and 1934 (during the Great Depression).

  2. Plot histograms of life expectancy for both years. How are these distributions different? (Hint: Plot these on the same axes so that they are comparable).

2.5 JAMOVI

How to do it in JAMOVI