# A very short introduction to R

## Why R

A language that doesn’t affect the way you think about programming, is not worth knowing. — Epigrams in Programming

The R programming language is an interesting thing. Derived from S it was made with statistics in mind. It is often thought of as a statistical package, like SPSS, rather than a programming language. Therefore it is mostly taught in combination with an introductory statistics course.

In a sense it is helpful to think of it in this way: R is an interactive statistics tool with a programming language. This makes it immensely powerful and expressive in statistics. My favorite example of this is The Zen of R, with just a single line you can do a complicated analysis.

Unfortunately it also has a reputation of being hard to learn.
This is true, and it is true for a lot of things.
But equating *ease of mastery* with *worth learning* would be a mistake.
Often the hardest things are the most flexible or adaptable.
Go watch Design, Composition and Performance by Rich Hickey (or Hammock Driven Development) if you’re bored sometime.

## The basics

Code School does an excellent job at presenting the basics of R interactively, and I would recommend starting there if you have no experience at all. If you yearn for more after that then swirl also gives a very nice interactive way of learning R, from within R.

## Some Exercises

Start an R session a load the iris example data set by typing `data(iris)`

.
Show the data by typing `iris`

. The data contains 50 samples of
three types of Iris flowers, measured along four variables. It’s a
classic.

**Warning** some online tutorials recommend using `attach`

, this is
not advisable. `attach`

has the side effect of altering the search
path and this can easily lead to the wrong object of a particular
name being found. You can use `with`

instead for convenience sometimes.

**Exercises**

- Check the type of the iris data set (what can you tell about this type?)
- (a) Select only the rows of the
*Virginica*flowers and assign it to`virginica`

- (b) Select only the Sepal Length of the
*Virginica*flowers and assign it - Calculate the mean, median, variance and standard deviation of the
*Virginica*Sepal Length - Select only the numerical columns
- Calculate the mean of all the numerical variables
- Can you find a clever way to do the same for all the flowers?

class(iris) str(iris) virginica <- iris[iris$Species == 'virginica',] virginica.sl <- virginica$Sepal.Length # Or both at the same time virginica.sl <- iris[iris$Species == 'virginica', "Sepal.Length"] mean(virginica.sl) median(virginica.sl) var(virginica.sl) sd(virginica.sl) # These are the same virginica[1:4] # columns from 1 to 4 virginica[-5] # columns without the 5th # only the numeric ones virginica.num <- virginica[, sapply(virginica, is.numeric)] # These are the same in this case, # but radically different sometimes (see ?apply) sapply(virginica.num, mean) apply(virginica.num, 2, mean) # Clever sapply(levels(iris$Species), function(species) { apply(iris[iris$Species == species, -5], 2, mean) }) # Better aggregate(iris[,1:4], by=list("Species" = iris$Species), mean)

Usually the powerful `summary`

function will give decent descriptive statistics and often
works as a sanity check. Here is the result of the `summary(iris)`

:

Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|

Min. :4.300 | Min. :2.000 | Min. :1.000 | Min. :0.100 |

1st Qu.:5.100 | 1st Qu.:2.800 | 1st Qu.:1.600 | 1st Qu.:0.300 |

Median :5.800 | Median :3.000 | Median :4.350 | Median :1.300 |

Mean :5.843 | Mean :3.057 | Mean :3.758 | Mean :1.199 |

3rd Qu.:6.400 | 3rd Qu.:3.300 | 3rd Qu.:5.100 | 3rd Qu.:1.800 |

Max. :7.900 | Max. :4.400 | Max. :6.900 | Max. :2.500 |

However as Anscombes’ Quartet adequately shows: descriptive statistics are an incredibly
poor way of gaining understanding of your data, visualizing it is always better. The plots
below have almost *identical* descriptive statistics and regression lines.

R packages are required to have runnable examples,
you can run them with `example()`

or read them in the `help()`

.

data(anscombe) example(anscombe)

**Exercises**

- Plot the iris data set
- Plot the data without the species column and color coded by species
- (a) Plot the
`Petal.Length`

against the`Petal.Width`

- (b) Calculate and plot a linear model in the same graph (should look like the plot below)

plot(iris) plot(iris[,-5], col=iris$Species) # Species is a factor, which maps categorical data to integers, # by converting to a numerical vector we can +1 # which gives nicer colors species <- as.numeric(iris$Species) + 1 plot(iris[,-5], col=species) # Plot is a generic function which dispatches by data type # in this case data.frame which uses 'pairs' plot(iris[,-5], col=species, lower.panel=NULL) # The same as pairs(iris[,-5], col=species, lower.panel=NULL) # pch indicates the plot character (i.e. the symbol) with(iris, plot(Petal.Length, Petal.Width, pch=as.numeric(Species))) petal.lm <- lm(Petal.Width ~ Petal.Length, data=iris) abline(petal.lm, col="red") summary(petal.lm)

It is clear that species matters for both the weight and the length of the petal.
As an **exercise** can you find out if the difference in petal length between two species is significant?

x1 <- iris[as.numeric(iris$Species) == 1,] x2 <- iris[as.numeric(iris$Species) == 2,] iris.test <- t.test(x1$Sepal.Length, x2$Sepal.Length) print(iris.test) paste("p-value:", iris.test$p.value)

p-value: 3.74674261398384e-17

Ideally we would like to be able to visualize all four variables at once, unfortunately our brains are not great at visualizing high dimensional spaces. Luckily dimensionality reduction techniques such as Principal Component Analysis (PCA) allow us to project higher dimensional data onto lower dimensions, usually without too much loss of information.

**Exercises**

- Can you find a way of doing PCA on the iris data set?
*Tip:*`?topic`

finds help pages of a topic (exact match),`??term`

searches more loosely - Is there a way of knowing how many components are needed?
- Can you plot the PCA in 2D?

iris.pca <- princomp(iris[-5]) # check the sanity with summary summary(iris.pca) # plot the variances plot(iris.pca, type = "l") # looks like we only need two components biplot(iris.pca) # This also works for better colors cols <- c('red', 'blue', 'green') # plotting just the scores in 2D with(iris , plot(iris.pca$scores[,c(1,2)] , col=cols[Species] , pch=as.numeric(Species)))

The vegan package offers all sorts of interesting visualizations and analysis for multivariate data. Here a visualization of the convex hull and centroid are given after PCA, but it’s merely there to show that there is a package for everything in R (there is an R package for that (modeled on Apple 2009 ad)).

install.packages("vegan") library("vegan") iris.c <- scale(iris[ ,1:4]) pca <- rda(iris.c) plot(pca, type = 'n', display = 'sites') points(pca, display='sites', col = cols[iris$Species], pch = 16) ordihull(pca, groups=iris$Species) ordispider(pca, groups = iris$Species, label = TRUE)

If after all the built-in functionality and R packages you still need or want more, then you can, of course, write your own functions. R is a functional programming language inspired by Scheme. If you squint a bit it looks a lot like Javascript (which is, incidentally, also inspired by Scheme).

Being a functional programming language means that functions can be variables, arguments, and can be returned from other functions. I’ll leave those exercises to someone else.

## Programming tips

### Object oriented programming

R does not have a classical class or prototypical object orientation system. There is no real inheritance in any way. It does offer methods for dynamic dispatch using the S3 and S4 but they work more like naming conventions in practice. As an example of S3:

# Create a constructor Employee <- function(name, email) { say <- function(msg) { paste(name, " <", email, ">", " says: ", msg, sep="") } info <- list(email=email, say=say) class(info) <- "employee" # set the class # the return is usually optional, # R returns the result of the last call by default return(info) } print.employee <- function(x) { cat(paste("Hello" , x$name , "your known email address is:" , x$email) , "\n") } jane <- Employee("Jane", "jane@example.com") class(jane) # [1] "employee" jane$say("Hi!") # "Jane <jane@example.com> says: Hi!" print(jane) # Hello Jane your known email address is: jane@example.com

By using the `print.{entity}`

notation any object of class
`{entity}`

is dispatched to that print function. We call the
`print`

method therefore *generic*. There are various built-in
generic functions (`plot`

being another obvious one). It is often
convenient to use lists as the backing format for your own
data types. Lists are named and can have different data types, even functions.

If you do “need” inheritance then the proto package provides prototypical inheritance, modeled on Self.

### Loops vs apply

The general convention is that loops in R are a code smell. Thinking in loops makes it hard to use the vector, matrix (and tensor) based functions that R loves and has optimized. Furthermore loops, or rather the imperative programming style, will make you miss the potentially great abstractions of higher-order functions. Loops in R can also be really slow, especially if not used properly. That being said, they do exist and you can use them. They are not inherently bad.

I wish I could say: “in every case you see a loop you can also use an apply”.
Unfortunately the core package `apply`

, `sapply`

, `lapply`

, `mapply`

, `tapply`

are also a jungle.
In some cases you can consider using plyr instead, in other cases a loop is simply the most convenient way of
expressing something.

Knowing when to use what is not something that can be summarized in a simple credo. What does help in learning “the R way” is becoming familiar with functional programming in general. Learn you a Haskell provides a nice introduction, The Little Schemer is also a classic. Besides a functional programming language R is also an Array Programming language. While somewhat of a historical curiosity learning some basic APL (or J, “Ruins of forgotten empires”) might give interesting insights. The video below shows how different thinking in terms of transformations on vectors and matrices can be compared to loops and assignments (imperative programming).

### Performance

The performance of R is not great. If performance is a concern somehow then consider writing the hot pieces in native code, C or Fortran, for which R offers excellent support. This can often result in speed-ups of up-to 100x. To profile R code you can use Rprof.

When working with large files it can help to memory map them. The packages ff and bigmemory are designed for this.

R is also single threaded, packages exist that offer various forms of parallelism but always use common sense when applying them.

Do note that when I/O is a bottleneck: R offers integration with SQL databases, they are designed for fast and sane (indexed) data access!

### Recommended packages

Everything by Hadley Wickham, the man is a genius: ggplot2, dplyr, plyr.
Note that the consensus nowadays seems to be that data.table should be used instead of `data.frames`

and makes all the `apply`

stuff, and even `plyr`

redundant.
I have yet to try `data.table`

, but “R the good parts” goes into considerable depth why it is great.

## Resources

- Google’s R Style guide, I might not always agree but it’s nice to have a guide
- The R Inferno, “If you are using R and you think you’re in hell, this is a map for you.” (very long list of gotchas)
- An introduction to R, the official introduction
- Another R tutorial
- Quick-R, good references
- R and Data mining, Examples and Case Studies
- Reference Card