Data visualization with R and ggplot2

This is an introduction to the ggplot2 package of R. It was used during workshop during Inspirational afternoon “Datavisualisation in Sociology and Social Sciences”, held on 12 December 2017 in Brussels.

Goal of the tutorial is to show how the ggplot2 package can be used to inspect and explore data in a visual way. This tutorial is not an introduction to R, RStudio or the tidyverse.

Getting started

The easiest way to get started with R and ggplot2 is to download and install RStudio. RStudio makes working with R much easier in many ways. To get started with R, I can highly recommend the free Datacamp course ‘Introduction to R’. Datacamp also has a course on RStudio, which you can start (but not finish) for free.

The data and the RStudio Notebook for this tutorial are on Github.

Loading the packages

Packages are extensions to the R language that make it easier to accomplish certain tasks in R. The ggplot2 package was created to help you make flexible data visualisations in R: you tell it what data to use, how to map variables to visual features, add extra layers to a visualisation and add styling to your chart.

So the first thing we do is loading the ggplot2 package:

library(ggplot2)

The data

The data for this demo are taken from this page and contains details on all people in Belgium that suffered injuries or died as a result of traffic accidents and the circumstances in which these accidents occured.

The data is offered as separate files, but I combined them, so you can load all the data in one step. I only kept the data of the 5 most recent years as to not make the data file too large. I also removed the columns containing text in French and only kept the Dutch descriptions to further reduce the file size.

With head, we can take a peek at the first 6 rows in the data.

vict <- read.csv("victims.csv")
head(vict)

##       DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_NL MS_VCT
## 1 2012-01-01      14              6                  zondag   0.00
## 2 2012-01-01      18              7                  zondag   1.30
## 3 2012-01-01      11              1                  zondag   1.00
## 4 2012-01-01      12              5                  zondag   1.49
## 5 2012-01-01      17              5                  zondag   0.00
## 6 2012-01-01       7              2                  zondag   1.23
##   MS_SLY_INJ MS_SERLY_INJ MS_MORY_INJ MS_DEAD MS_DEAD_30_DAYS CD_VCT_TYPE
## 1       0.00            0           0       0               0           1
## 2       1.30            0           0       0               0           1
## 3       1.00            0           0       0               0           1
## 4       1.49            0           0       0               0           1
## 5       0.00            0           0       0               0           1
## 6       1.23            0           0       0               0           1
##       TX_VCT_TYPE_DESCR_NL CD_ROAD_USR_TYPE  TX_ROAD_USR_TYPE_DESCR_NL
## 1 Bestuurder of voetganger                1               Personenauto
## 2 Bestuurder of voetganger                1               Personenauto
## 3 Bestuurder of voetganger               14 Motorfiets meer dan 400 cc
## 4 Bestuurder of voetganger                9            Landbouwtraktor
## 5 Bestuurder of voetganger                1               Personenauto
## 6 Bestuurder of voetganger                1               Personenauto
##   CD_ROAD_TYPE                  TX_ROAD_TYPE_DESCR_NL CD_LIGHT_COND
## 1            2 Gewestweg, provincieweg of gemeenteweg             1
## 2            2 Gewestweg, provincieweg of gemeenteweg             4
## 3            1                            Autosnelweg             1
## 4            2 Gewestweg, provincieweg of gemeenteweg             1
## 5            2 Gewestweg, provincieweg of gemeenteweg             2
## 6            2 Gewestweg, provincieweg of gemeenteweg             3
##                  TX_LIGHT_COND_DESCR_NL CD_COLL_TYPE
## 1                   Bij klaarlichte dag            5
## 2      Nacht, geen openbare verlichting            8
## 3                   Bij klaarlichte dag            8
## 4                   Bij klaarlichte dag            8
## 5                 Dageraad - schemering            4
## 6 Nacht, ontstoken openbare verlichting            4
##                   TX_COLL_TYPE_DESCR_NL CD_BUILD_UP_AREA
## 1                    Met een voetganger                1
## 2 E\303\251n bestuurder, geen hindernis                2
## 3 E\303\251n bestuurder, geen hindernis                2
## 4 E\303\251n bestuurder, geen hindernis                2
## 5                           Langs opzij                1
## 6                           Langs opzij                1
##   TX_BUILD_UP_AREA_DESCR_NL CD_AGE_CLS TX_AGE_CLS_DESCR_NL CD_MUNTY_REFNIS
## 1       Binnen bebouwde kom         91     75 jaar en meer           25121
## 2       Buiten bebouwde kom         41      35 tot 39 jaar           25120
## 3       Buiten bebouwde kom         30      25 tot 29 jaar           25120
## 4       Buiten bebouwde kom         43      35 tot 39 jaar           11002
## 5       Binnen bebouwde kom         76      65 tot 69 jaar           11002
## 6       Binnen bebouwde kom         52      45 tot 49 jaar           11002
##            TX_MUNTY_DESCR_NL CD_DSTR_REFNIS     TX_ADM_DSTR_DESCR_NL
## 1 Ottignies-Louvain-la-Neuve          25000    Arrondissement Nijvel
## 2                 Orp-Jauche          25000    Arrondissement Nijvel
## 3                 Orp-Jauche          25000    Arrondissement Nijvel
## 4                  Antwerpen          11000 Arrondissement Antwerpen
## 5                  Antwerpen          11000 Arrondissement Antwerpen
## 6                  Antwerpen          11000 Arrondissement Antwerpen
##   CD_PROV_REFNIS        TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_NL
## 1          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 2          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 3          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 4          10000     Provincie Antwerpen          2000   Vlaams Gewest
## 5          10000     Provincie Antwerpen          2000   Vlaams Gewest
## 6          10000     Provincie Antwerpen          2000   Vlaams Gewest
##   CD_SEX TX_SEX_DESCR_NL
## 1      M          Mannen
## 2      M          Mannen
## 3      F         Vrouwen
## 4      F         Vrouwen
## 5      M          Mannen
## 6      F         Vrouwen

As you can see, a lot of data is registered when someone gets involved in a traffic accident. Let’s see what we can learn from the data.

The first plot

Time to make the first plot. The ggplot syntax takes a little time to get used to, so let’s start small: let’s make a histogram of the age of the traffic accident victims.

NOTE: The data doesn’t contain the exact age of the accident victims. We are using a column in the data that contains a numerical code closely related to the actual age.

ggplot(vict, aes(CD_AGE_CLS)) +
  geom_histogram()

Nice: our first ggplot chart! Let’s explore what’s going on here:

vict is the table (‘dataframe’ in R terminology) that contains the data that we loaded in. We put it as the first parameter in the ggplot function.
the column containing the age of the victims is called CD_AGE_CLS. We put it in the aes() of the ggplot function. aes() stands for “aesthetics” and you could read the code as “map the variable CD_AGE_CLS to the aeshetics of the plot”
after the main ggplot function, we add the type of chart we would like to make. In the ggplot language, this is called a geometry. So we add a histogram geometry to the plot.

From the histogram, we see that the distribution of traffic victims is skewed towards younger people, with a peak around 30 years.

We also see a weird peak to the left of the histogram. We take a look at these records in the data by using `filter’ from the handy dplyr package.

library(dplyr)
head(filter(vict, CD_AGE_CLS < 3))

##       DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_NL MS_VCT
## 1 2012-01-01       3              2                  zondag   1.00
## 2 2012-01-01       9              1                  zondag   0.00
## 3 2012-01-01       9              4                  zondag   1.00
## 4 2012-01-01       6              7                  zondag   0.00
## 5 2012-01-01      12              6                  zondag   0.00
## 6 2012-01-01       7              1                  zondag   1.33
##   MS_SLY_INJ MS_SERLY_INJ MS_MORY_INJ MS_DEAD MS_DEAD_30_DAYS CD_VCT_TYPE
## 1       1.00            0           0       0               0           2
## 2       0.00            0           0       0               0           1
## 3       1.00            0           0       0               0           1
## 4       0.00            0           0       0               0           1
## 5       0.00            0           0       0               0           1
## 6       1.33            0           0       0               0           1
##       TX_VCT_TYPE_DESCR_NL CD_ROAD_USR_TYPE TX_ROAD_USR_TYPE_DESCR_NL
## 1                Passagier                2  Auto voor dubbel gebruik
## 2 Bestuurder of voetganger                1              Personenauto
## 3 Bestuurder of voetganger                1              Personenauto
## 4 Bestuurder of voetganger                1              Personenauto
## 5 Bestuurder of voetganger                1              Personenauto
## 6 Bestuurder of voetganger                1              Personenauto
##   CD_ROAD_TYPE                  TX_ROAD_TYPE_DESCR_NL CD_LIGHT_COND
## 1            2 Gewestweg, provincieweg of gemeenteweg             3
## 2            2 Gewestweg, provincieweg of gemeenteweg             1
## 3            1                            Autosnelweg             2
## 4            2 Gewestweg, provincieweg of gemeenteweg             3
## 5            2 Gewestweg, provincieweg of gemeenteweg             1
## 6            2 Gewestweg, provincieweg of gemeenteweg             3
##                  TX_LIGHT_COND_DESCR_NL CD_COLL_TYPE
## 1 Nacht, ontstoken openbare verlichting            4
## 2                   Bij klaarlichte dag            2
## 3                 Dageraad - schemering            3
## 4 Nacht, ontstoken openbare verlichting            6
## 5                   Bij klaarlichte dag            4
## 6 Nacht, ontstoken openbare verlichting            3
##                   TX_COLL_TYPE_DESCR_NL CD_BUILD_UP_AREA
## 1                           Langs opzij                2
## 2 Frontale botsing (of bij het kruisen)                1
## 3      Langs achteren (of naast elkaar)                2
## 4     Tegen een hindernis op de rijbaan                1
## 5                           Langs opzij                2
## 6      Langs achteren (of naast elkaar)                2
##   TX_BUILD_UP_AREA_DESCR_NL CD_AGE_CLS TX_AGE_CLS_DESCR_NL CD_MUNTY_REFNIS
## 1       Buiten bebouwde kom          2    Niet beschikbaar           11001
## 2       Binnen bebouwde kom          2    Niet beschikbaar           12002
## 3       Buiten bebouwde kom          2    Niet beschikbaar           11053
## 4       Binnen bebouwde kom          2    Niet beschikbaar           11050
## 5       Buiten bebouwde kom          2    Niet beschikbaar           11024
## 6       Buiten bebouwde kom          2    Niet beschikbaar           11004
##   TX_MUNTY_DESCR_NL CD_DSTR_REFNIS     TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS
## 1        Aartselaar          11000 Arrondissement Antwerpen          10000
## 2           Berlaar          12000  Arrondissement Mechelen          10000
## 3        Wuustwezel          11000 Arrondissement Antwerpen          10000
## 4          Wijnegem          11000 Arrondissement Antwerpen          10000
## 5           Kontich          11000 Arrondissement Antwerpen          10000
## 6          Boechout          11000 Arrondissement Antwerpen          10000
##      TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_NL CD_SEX
## 1 Provincie Antwerpen          2000   Vlaams Gewest       
## 2 Provincie Antwerpen          2000   Vlaams Gewest       
## 3 Provincie Antwerpen          2000   Vlaams Gewest       
## 4 Provincie Antwerpen          2000   Vlaams Gewest       
## 5 Provincie Antwerpen          2000   Vlaams Gewest       
## 6 Provincie Antwerpen          2000   Vlaams Gewest       
##    TX_SEX_DESCR_NL
## 1 Niet beschikbaar
## 2 Niet beschikbaar
## 3 Niet beschikbaar
## 4 Niet beschikbaar
## 5 Niet beschikbaar
## 6 Niet beschikbaar

When you check the TX_AGE_CLS_DESCR_NL column of these records, you’ll notice they contain the value ‘Niet beschikbaar’ (‘Not available’). So some records have missing values and got assigned a value of 2 for the CD_AGE_CLS variable.

We’ll ignore that for now, but we’ve already discovered an inconsistency in the data by making a ggplot visulisation!

Man vs women

Let’s take the same histogram and introduce an extra variable: the gender of the victims. This is stored in the CD_SEX column.

ggplot(vict, aes(CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram()

We mapped the gender column to the fill colour aesthetic of the histogram. ggplot added a color legend for us, and as you can see we have one color for females, one for males and two extra colors. On inspecting the data, the two extra colours seem to represent missing values. We filter these records out:

vict <- filter(vict, CD_SEX != " " & CD_SEX != "")

We also got a warning: ggplot tells us that we are using the default of 30 bins, and that we could pick a better value for binwidth. Let’s do that, and set the width of the histogram bins to 1:

ggplot(vict, aes(CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram(binwidth = 1)

A pattern emerges: for values 22, 33, 44, … the the data contains a lot less records then for other values. We are going to ignore this inconsistency in the data too.

The stacking of the bars for men and women make comparisons difficult. Let’s make 2 separate histograms. We use `facet_wrap()’ for that and tell it to make seperate histograms for each sex:

ggplot(vict, aes(x = CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~CD_SEX)

Apart from the height of the histograms, the distributions seem to be quite similar between men and women. But using another geometry (the geom_density, which generates density estimates), we can highlight the relative differences:

ggplot(filter(vict, CD_SEX != " "), aes(x = CD_AGE_CLS, fill = CD_SEX)) +
  geom_density(alpha = 0.5)

The density plots are plotted on top of each other, but by using some transparency (the alpha in geom_density) we can see that girls and older women are more likely to get involved in accidents, while men between 18 and 65 are more likely to get involved in accidents then women of the same age.

Facetting is a very powerfull technique in visual data exploration. It allows you to spot relations accross multiple dimensions. Here we add the extra dimension of the type of victim (driver vs passanger).

ggplot(vict, aes(x = CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram(binwidth = 1) +
  facet_grid(CD_SEX ~ TX_VCT_TYPE_DESCR_NL)

Road user types

In the same way that ggplot does the counting for us when it generates a histogram, it also does the counting for us when making a bar chart. So we change the numerical variable we used for the histograms (victims’ age) to a categorical variable (the road user type):

ggplot(vict, aes(TX_ROAD_USR_TYPE_DESCR_NL)) +
  geom_bar()

That is not really readable, so let’s fix that by rotating the chart 90 degrees with coord_flip(), so we’ll have non-overlapping, horizontal category labels.

ggplot(vict, aes(TX_ROAD_USR_TYPE_DESCR_NL)) +
  geom_bar() +
  coord_flip()

‘Personenauto’ (passenger car) and ‘Fiets’ (bicycle) are the most common road user types involved in traffic accidents.

We want to sort the bars and also add data labels to further improve the legibility of the chart. The easiest way to do this, is to create a new dataframe containing the data we need. So we are going to take the counting out of the hands of ggplot and do it ourselves. We group the victims by road user type and then count the number of victims in each group.

vict.byusrtype <- group_by(vict, TX_ROAD_USR_TYPE_DESCR_NL) %>% summarise(total = n())
head(vict.byusrtype)

## # A tibble: 6 x 2
##   TX_ROAD_USR_TYPE_DESCR_NL total
##                      <fctr> <int>
## 1         Andere voetganger 12989
## 2       Andere weggebruiker   938
## 3  Auto voor dubbel gebruik 14379
## 4                   Autobus  1990
## 5                   Autocar   262
## 6        Bespannen voertuig    83

ggplot(vict.byusrtype, aes(x = TX_ROAD_USR_TYPE_DESCR_NL, y = total)) +
  geom_col()

Notice that we changed from geom_bar to geom_col. geom_bar does the counting for us, while with geom_col you can tell ggplot which numerical values to use to map to the height of the columns. That’s why in aes we specified to use the total for y (which determines the height of the columns).

Now let’s flip the chart again, reorder the bars by total and add the data labels with an new geometry: geom_text.

ggplot(vict.byusrtype, aes(x = reorder(TX_ROAD_USR_TYPE_DESCR_NL, total), y = total)) +
  geom_col() +
  geom_text(aes(label = total)) + 
  coord_flip()

Notice that geom_text also needs x and y coordinates to plot the data labels in the correct position. If these are not specified, geom_text “inherits” these from the aes specified in the main ggplot function. So geometry layers in a ggplot visualisation can all share the same aesthetics or they can use their own by overriding the “default” aesthetics.

Time

Let’s take a look at what time of the day people get involved in traffic accidents.

vict.byhour <- group_by(vict, DT_HOUR) %>% summarise(total = n())
ggplot(vict.byhour, aes(x = DT_HOUR, y = total)) +
  geom_point() +
  geom_line()

Morning and evening rush hours can be easily spotted. We used two new geometries here: geom_line and geom_point. Both inherit the x and y from the main aesthetics.

Now let’s compare weekdays with days in the weekend.

vict.byhourweekday <- group_by(vict, DT_HOUR, TX_DAY_OF_WEEK_DESCR_NL) %>% summarise(total = n())
ggplot(vict.byhourweekday, aes(x = DT_HOUR, y = total, color = TX_DAY_OF_WEEK_DESCR_NL)) +
  geom_line()

We can use another geometry (geom_area) and some facetting to more clearly show that the main differences between weekdays and weekends are the early morning hours (more victims in weekends) and the morning rush hour (less victims in weekends).

ggplot(vict.byhourweekday, aes(x = DT_HOUR, y = total, fill = TX_DAY_OF_WEEK_DESCR_NL)) +
  geom_area() +
  facet_wrap(~TX_DAY_OF_WEEK_DESCR_NL)

We can do the same for the 5 years where we have data for. In order to do this, we have to extract the year from the column that contains the date. We use mutate to calculate a new column.

vict.byhouryear <- mutate(vict, year = format(as.Date(DT_DAY, format="%Y-%m-%d"),"%Y"))
vict.byhouryear <- group_by(vict.byhouryear, DT_HOUR, year) %>% summarise(total = n())
ggplot(vict.byhouryear, aes(x = DT_HOUR, y = total, color = year)) + geom_line()

Patterns for all 5 years are similar, but we see a big shift downwards between 2012 and 2013.

One more: ligting conditions by hour of day.

vict.byhour <- group_by(vict, DT_HOUR, TX_LIGHT_COND_DESCR_NL) %>% summarise(total = n())
ggplot(vict.byhour, aes(x = DT_HOUR, y = total, fill = TX_LIGHT_COND_DESCR_NL)) +
  geom_area()

Note that stacking can make it hard to interpret trends in individual categories. The use of facets or other geometries (like geom_line) can overcome this.

Heatmap of road user types

Let’s combine the road user type and the hour of day to spot trends for each road user type category. We use geom_tile to do that.

vict.byhourtype <- group_by(vict, TX_ROAD_USR_TYPE_DESCR_NL, DT_HOUR) %>% summarise(total = n())
ggplot(vict.byhourtype, aes(x = DT_HOUR, y = TX_ROAD_USR_TYPE_DESCR_NL, fill = total)) +
  geom_tile()

Hm, that’s not very informative: we can only see some trends for cars and bicycles. Becaus of their much higher frequencies, they obscure the daily trend of the other road user types.

So we need relative numbers for each category. And let’s use the viridis color scale for the same plot:

vict.byhourtype <- group_by(vict, TX_ROAD_USR_TYPE_DESCR_NL, DT_HOUR) %>% summarise(total = n()) %>% mutate(relative = total/sum(total))
library(viridis)
ggplot(vict.byhourtype, aes(x = DT_HOUR, y = TX_ROAD_USR_TYPE_DESCR_NL, fill = relative)) +
  geom_tile() +
  scale_fill_viridis(direction = -1)

So people riding horses (‘Ruiter’) don’t get involved in accidents during the night, but especially around 4-5 pm :-)

Seasonality by road user types

We create a new variable containing the month and then plot the number of victims by road user type over the months. We facet the results and allow the Y axis to be independent (“free”) between the different plots.

vict.bymonthtype <- mutate(vict, month = format(as.Date(DT_DAY, format="%Y-%m-%d"),"%m")) %>% group_by(TX_ROAD_USR_TYPE_DESCR_NL, month) %>% summarise(total = n())
ggplot(vict.bymonthtype, aes(x = month, y = total, group = TX_ROAD_USR_TYPE_DESCR_NL)) +
  geom_line() +
  facet_wrap(~TX_ROAD_USR_TYPE_DESCR_NL, scales = "free")

Map

The municipality where accidents occured is also present in the data, so we can make a maps too.

We start with a shapefile (a geographical file format) of all the 589 Belgian municipalities and convert it to something ggplot2 can work with. We use the tidy function from the broom package. We also need some extra packages to load the shapefile correctly in R.

library(rgeos)
library(maptools)
library(broom)

##readShapeSpatial reads the shapefile and loads into R
mun <- readShapeSpatial("geodata/municip_simp.shp")

##The tidy function takes the loaded geodata and converts it into a dataframe that can be plugged into ggplot. We make sure that every municipality is identified by its unique code (the "NISCODE") and we make a new column in the dataframe (called "id") with that code. We will use this code to join data to the data frame with the municipalities
mun.tidy <- tidy(mun, region = "NISCODE") %>% mutate(id = as.numeric(id))

ggplot() + 
  geom_polygon(data = mun.tidy, aes(x = long, y = lat, group = group))

Now let’s calculate the total number of deadly victims of traffic accidents for every municipality and plot that on the map. We also set the the map to have equal coordinate scales for x and y (the map data uses the Lambert72 projections, which has meters as units for the coordinates).

NOTE: it is actually a bad idea to use absolute numbers on these kind of maps. Numbers should be scaled (eg to the population or the road length in every municipality).

##The CD_MUNTY_REFNIS contains the same codes to identify a municipality as the geographical data
vict.deads <- group_by(vict, CD_MUNTY_REFNIS) %>% summarise(deads = sum(MS_DEAD_30_DAYS))

##We change the column names of the new data frame. The column containing the municipality code becomes "id", so we can join the data to the map data
colnames(vict.deads) <- c("id", "deads")

mun.deads <- left_join(mun.tidy, vict.deads, by = "id")

ggplot() +
  geom_polygon(data = mun.deads, aes(x = long, y = lat, group = group, fill = deads)) +
  scale_fill_viridis(direction = -1)  +
  coord_equal()

We can also facet maps:

vict.deadsyear <- mutate(vict, year = format(as.Date(DT_DAY, format="%Y-%m-%d"),"%Y")) %>% group_by(CD_MUNTY_REFNIS, year) %>% summarise(deads = sum(MS_DEAD_30_DAYS))
colnames(vict.deadsyear) <- c("id", "year", "deads")
mun.deadsyear <- left_join(mun.tidy, vict.deadsyear, by = "id")
ggplot() + geom_polygon(data = mun.deadsyear, aes(x = long, y = lat, group = group, fill = deads)) +
  scale_fill_viridis(direction = -1)  +
  coord_equal() +
  facet_wrap(~year)

Wrap up

Once you went up the learning curve, ggplot2 can help you to take a look at your data from possible angles. By mapping variables to different aesthetics and by using different geometries, you’ll be able to spot anomalies, trends and outliers in your data.

ggplot2 can be extended with other packages (eg from the Tidyverse to slice and dice your data, or geospatial packages for making maps). There are also plugins for ggplot for working with specific data (like graph data), for animating your plots, making them interactive or for making specific visualisations (like alluvial diagrams).

ggplot output can also be styled very flexibly with plot theming. Generated visualisations can then be saved to different file formats with ggsave.

More resources

Data Visualization for Social Science: A practical introduction with R and ggplot2

Datacamp course ‘Data Visualization with ggplot2’

Data Visualization with ggplot2: Cheat Sheet