Star (-) Watch (-)

Introduction to R


One way to tidy data is to reshape it so that it adheres to the three rules of tidy data. While base R has several functions aimed at reshaping data, we will use the reshape2 package by Hadley Wickham, as it provides a simple and consistent set of functions to reshape data.


In the simplest terms, reshaping data is like doing a pivot table in excel, where you shuffle columns, rows and values. Let us start by tidying the pew dataset.

pew <- read.delim(
  file = "data/pew.txt",
  header = TRUE,
  stringsAsFactors = FALSE,
  check.names = F

We can tidy this data using the melt function in the reshape2 package.

pew_tidy <- melt(
  data = pew,
  id = "religion", = "income", = "frequency"

religion income frequency
Agnostic <$10k 27
Atheist <$10k 12
Buddhist <$10k 27
Catholic <$10k 418
Don’t know/refused <$10k 15
Evangelical Prot <$10k 575

Exercise 1

Now that you know how to melt a data frame, use the same idea to tidy the tb dataset. Think about how you can split gender and age after melting the data

tb <- read.csv(
  file = "data/tb.csv",
  header = TRUE, 
  stringsAsFactors = FALSE

Solution 1


# set column `new_sp` to NULL and clean up column names
tb$new_sp = NULL
names(tb) <- gsub("new_sp_", "", names(tb))

# Use na.rm = TRUE to remove missing observations
tb_tidy <- melt(
  data = tb,
  id = c("iso2", "year"), = "gender_age", = "cases",
  na.rm = TRUE

# split gender_age into gender and age group
tb_tidy <- mutate(tb_tidy,
  gender = sub("^([m|f])(.*)$", "\\1", gender_age),
  age = sub("^([m|f])(.*)$", "\\2", gender_age),
  gender_age = NULL
tb_tidy <- tb_tidy[c('iso2', 'year', 'gender', 'age', 'cases')]

There is one more step of cleaning required to assign the age groups more meaningful names. For example, 04 stands for 0-4.

Exercise 2

Use the same principles to clean the weather dataset.

weather <- read.delim(
 file = "data/weather.txt",
 stringsAsFactors = FALSE

Solution 2

We first melt the data and convert the day variable into numbers.

weather_tidy <- melt(
  data = weather,  
  id = 1:4, = "day", = "temparature",
  na.rm = TRUE
weather_tidy <- mutate(weather_tidy,
  day = sub("^d", "", day)  

We still have the issue of variables TMAX and TMIN being present in rows. We can fix this using the dcast function in reshape2. Type ?dcast to read up the documentation on this function.

weather_tidy2 <- dcast(
  data = weather_tidy,
  formula = id + year + month + day ~ element, 
  value.var = "temparature"

Exercise 3

Tidy the dataset billboards.csv! Note that this dataset suffers from messiness of other types. So you will need to think beyond what was done in the previous exercises.

Solution 3

We start by reading the dataset using read.csv. I don't like using periods in column names, and so let us convert periods to underscores.

billboards <- read.csv(
  file = "",
  stringsAsFactors = FALSE
names(billboards) <- gsub("\\.", "_", names(billboards))
billboards <- mutate(billboards,
  artist_inverted = iconv(artist_inverted, "MAC", "UTF-8"),
billboards_tidy <- melt(billboards, 
  id = 1:7, = "week", = "rank",
  na.rm = TRUE
billboards_tidy <- mutate(billboards_tidy,
  week = as.numeric(gsub("^x([[:digit:]]+).*", "\\1", week))