joelkuiper.eu

A very short introduction to R

Why R

A language that doesn’t affect the way you think about programming, is not worth knowing. — Epigrams in Programming

The R programming language is an interesting thing. Derived from S it was made with statistics in mind. It is often thought of as a statistical package, like SPSS, rather than a programming language. Therefore it is mostly taught in combination with an introductory statistics course.

In a sense it is helpful to think of it in this way: R is an interactive statistics tool with a programming language. This makes it immensely powerful and expressive in statistics. My favorite example of this is The Zen of R, with just a single line you can do a complicated analysis.

Unfortunately it also has a reputation of being hard to learn. This is true, and it is true for a lot of things. But equating ease of mastery with worth learning would be a mistake. Often the hardest things are the most flexible or adaptable. Go watch Design, Composition and Performance by Rich Hickey (or Hammock Driven Development) if you’re bored sometime.

The basics

Code School does an excellent job at presenting the basics of R interactively, and I would recommend starting there if you have no experience at all. If you yearn for more after that then swirl also gives a very nice interactive way of learning R, from within R.

Some Exercises

Start an R session a load the iris example data set by typing data(iris). Show the data by typing iris. The data contains 50 samples of three types of Iris flowers, measured along four variables. It’s a classic.

Warning some online tutorials recommend using attach, this is not advisable. attach has the side effect of altering the search path and this can easily lead to the wrong object of a particular name being found. You can use with instead for convenience sometimes.

Exercises

  1. Check the type of the iris data set (what can you tell about this type?)
  2. (a) Select only the rows of the Virginica flowers and assign it to virginica
  3. (b) Select only the Sepal Length of the Virginica flowers and assign it
  4. Calculate the mean, median, variance and standard deviation of the Virginica Sepal Length
  5. Select only the numerical columns
  6. Calculate the mean of all the numerical variables
  7. Can you find a clever way to do the same for all the flowers?
class(iris)
str(iris)

virginica <- iris[iris$Species == 'virginica',]

virginica.sl <- virginica$Sepal.Length
# Or both at the same time
virginica.sl <- iris[iris$Species == 'virginica', "Sepal.Length"]

mean(virginica.sl)
median(virginica.sl)
var(virginica.sl)
sd(virginica.sl)

# These are the same
virginica[1:4] # columns from 1 to 4
virginica[-5] # columns without the 5th
# only the numeric ones
virginica.num <- virginica[, sapply(virginica, is.numeric)]

# These are the same in this case,
# but radically different sometimes (see ?apply)
sapply(virginica.num, mean)
apply(virginica.num, 2, mean)

# Clever
sapply(levels(iris$Species), function(species) {
    apply(iris[iris$Species == species, -5], 2, mean)
})

# Better
aggregate(iris[,1:4], by=list("Species" = iris$Species), mean)

Usually the powerful summary function will give decent descriptive statistics and often works as a sanity check. Here is the result of the summary(iris):

Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

However as Anscombes’ Quartet adequately shows: descriptive statistics are an incredibly poor way of gaining understanding of your data, visualizing it is always better. The plots below have almost identical descriptive statistics and regression lines.

R packages are required to have runnable examples, you can run them with example() or read them in the help().

data(anscombe)
example(anscombe)

anscombe.png

Exercises

  1. Plot the iris data set
  2. Plot the data without the species column and color coded by species
  3. (a) Plot the Petal.Length against the Petal.Width
  4. (b) Calculate and plot a linear model in the same graph (should look like the plot below)
plot(iris)
plot(iris[,-5], col=iris$Species)

# Species is a factor, which maps categorical data to integers,
# by converting to a numerical vector we can +1
# which gives nicer colors
species <- as.numeric(iris$Species) + 1
plot(iris[,-5], col=species)

# Plot is a generic function which dispatches by data type
# in this case data.frame which uses 'pairs'
plot(iris[,-5], col=species, lower.panel=NULL)
# The same as
pairs(iris[,-5], col=species, lower.panel=NULL)

# pch indicates the plot character (i.e. the symbol)
with(iris,
     plot(Petal.Length, Petal.Width, pch=as.numeric(Species)))

petal.lm <- lm(Petal.Width ~ Petal.Length, data=iris)
abline(petal.lm, col="red")

summary(petal.lm)

petal.png

It is clear that species matters for both the weight and the length of the petal. As an exercise can you find out if the difference in petal length between two species is significant?

x1 <- iris[as.numeric(iris$Species) == 1,]
x2 <- iris[as.numeric(iris$Species) == 2,]
iris.test <- t.test(x1$Sepal.Length, x2$Sepal.Length)
print(iris.test)
paste("p-value:", iris.test$p.value)
p-value: 3.74674261398384e-17

Ideally we would like to be able to visualize all four variables at once, unfortunately our brains are not great at visualizing high dimensional spaces. Luckily dimensionality reduction techniques such as Principal Component Analysis (PCA) allow us to project higher dimensional data onto lower dimensions, usually without too much loss of information.

Exercises

  1. Can you find a way of doing PCA on the iris data set? Tip: ?topic finds help pages of a topic (exact match), ??term searches more loosely
  2. Is there a way of knowing how many components are needed?
  3. Can you plot the PCA in 2D?
iris.pca <- princomp(iris[-5])
# check the sanity with summary
summary(iris.pca)
# plot the variances

plot(iris.pca, type = "l")
# looks like we only need two components

biplot(iris.pca)

# This also works for better colors
cols <- c('red', 'blue', 'green')
# plotting just the scores in 2D
with(iris
   , plot(iris.pca$scores[,c(1,2)]
        , col=cols[Species]
        , pch=as.numeric(Species)))

pca.png

The vegan package offers all sorts of interesting visualizations and analysis for multivariate data. Here a visualization of the convex hull and centroid are given after PCA, but it’s merely there to show that there is a package for everything in R (there is an R package for that (modeled on Apple 2009 ad)).

install.packages("vegan")
library("vegan")

iris.c <- scale(iris[ ,1:4])
pca <- rda(iris.c)

plot(pca, type = 'n', display = 'sites')

points(pca, display='sites', col = cols[iris$Species], pch = 16)

ordihull(pca, groups=iris$Species)
ordispider(pca, groups = iris$Species, label = TRUE)

vegan-pca.png

If after all the built-in functionality and R packages you still need or want more, then you can, of course, write your own functions. R is a functional programming language inspired by Scheme. If you squint a bit it looks a lot like Javascript (which is, incidentally, also inspired by Scheme).

Being a functional programming language means that functions can be variables, arguments, and can be returned from other functions. I’ll leave those exercises to someone else.

Programming tips

Object oriented programming

R does not have a classical class or prototypical object orientation system. There is no real inheritance in any way. It does offer methods for dynamic dispatch using the S3 and S4 but they work more like naming conventions in practice. As an example of S3:

# Create a constructor
Employee <- function(name, email) {
    say <- function(msg)  {
        paste(name, " <", email, ">", " says: ", msg, sep="")
    }
    info <- list(email=email, say=say)
    class(info) <- "employee" # set the class
    # the return is usually optional,
    # R returns the result of the last call by default
    return(info)
}

print.employee <- function(x) {
    cat(paste("Hello"
            , x$name
            , "your known email address is:"
            , x$email)
     , "\n")
}

jane <- Employee("Jane", "jane@example.com")

class(jane)
# [1] "employee"
jane$say("Hi!")
# "Jane <jane@example.com> says: Hi!"
print(jane)
# Hello Jane your known email address is: jane@example.com

By using the print.{entity} notation any object of class {entity} is dispatched to that print function. We call the print method therefore generic. There are various built-in generic functions (plot being another obvious one). It is often convenient to use lists as the backing format for your own data types. Lists are named and can have different data types, even functions.

If you do “need” inheritance then the proto package provides prototypical inheritance, modeled on Self.

Loops vs apply

The general convention is that loops in R are a code smell. Thinking in loops makes it hard to use the vector, matrix (and tensor) based functions that R loves and has optimized. Furthermore loops, or rather the imperative programming style, will make you miss the potentially great abstractions of higher-order functions. Loops in R can also be really slow, especially if not used properly. That being said, they do exist and you can use them. They are not inherently bad.

I wish I could say: “in every case you see a loop you can also use an apply”. Unfortunately the core package apply, sapply, lapply, mapply, tapply are also a jungle. In some cases you can consider using plyr instead, in other cases a loop is simply the most convenient way of expressing something.

Knowing when to use what is not something that can be summarized in a simple credo. What does help in learning “the R way” is becoming familiar with functional programming in general. Learn you a Haskell provides a nice introduction, The Little Schemer is also a classic. Besides a functional programming language R is also an Array Programming language. While somewhat of a historical curiosity learning some basic APL (or J, “Ruins of forgotten empires”) might give interesting insights. The video below shows how different thinking in terms of transformations on vectors and matrices can be compared to loops and assignments (imperative programming).

Performance

The performance of R is not great. If performance is a concern somehow then consider writing the hot pieces in native code, C or Fortran, for which R offers excellent support. This can often result in speed-ups of up-to 100x. To profile R code you can use Rprof.

When working with large files it can help to memory map them. The packages ff and bigmemory are designed for this.

R is also single threaded, packages exist that offer various forms of parallelism but always use common sense when applying them.

Do note that when I/O is a bottleneck: R offers integration with SQL databases, they are designed for fast and sane (indexed) data access!

Recommended packages

Everything by Hadley Wickham, the man is a genius: ggplot2, dplyr, plyr. Note that the consensus nowadays seems to be that data.table should be used instead of data.frames and makes all the apply stuff, and even plyr redundant. I have yet to try data.table, but “R the good parts” goes into considerable depth why it is great.

Resources