This is an introduction to the ggplot2 package of R. It was used during workshop during Inspirational afternoon “Datavisualisation in Sociology and Social Sciences”, held on 12 December 2017 in Brussels.

Goal of the tutorial is to show how the ggplot2 package can be used to inspect and explore data in a visual way. This tutorial is not an introduction to R, RStudio or the tidyverse.

Getting started

The easiest way to get started with R and ggplot2 is to download and install RStudio. RStudio makes working with R much easier in many ways. To get started with R, I can highly recommend the free Datacamp course ‘Introduction to R’. Datacamp also has a course on RStudio, which you can start (but not finish) for free.

The data and the RStudio Notebook for this tutorial are on Github.

Loading the packages

Packages are extensions to the R language that make it easier to accomplish certain tasks in R. The ggplot2 package was created to help you make flexible data visualisations in R: you tell it what data to use, how to map variables to visual features, add extra layers to a visualisation and add styling to your chart.

So the first thing we do is loading the ggplot2 package:

library(ggplot2)

The data

The data for this demo are taken from this page and contains details on all people in Belgium that suffered injuries or died as a result of traffic accidents and the circumstances in which these accidents occured.

The data is offered as separate files, but I combined them, so you can load all the data in one step. I only kept the data of the 5 most recent years as to not make the data file too large. I also removed the columns containing text in French and only kept the Dutch descriptions to further reduce the file size.

With head, we can take a peek at the first 6 rows in the data.

vict <- read.csv("victims.csv")
head(vict)
##       DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_NL MS_VCT
## 1 2012-01-01      14              6                  zondag   0.00
## 2 2012-01-01      18              7                  zondag   1.30
## 3 2012-01-01      11              1                  zondag   1.00
## 4 2012-01-01      12              5                  zondag   1.49
## 5 2012-01-01      17              5                  zondag   0.00
## 6 2012-01-01       7              2                  zondag   1.23
##   MS_SLY_INJ MS_SERLY_INJ MS_MORY_INJ MS_DEAD MS_DEAD_30_DAYS CD_VCT_TYPE
## 1       0.00            0           0       0               0           1
## 2       1.30            0           0       0               0           1
## 3       1.00            0           0       0               0           1
## 4       1.49            0           0       0               0           1
## 5       0.00            0           0       0               0           1
## 6       1.23            0           0       0               0           1
##       TX_VCT_TYPE_DESCR_NL CD_ROAD_USR_TYPE  TX_ROAD_USR_TYPE_DESCR_NL
## 1 Bestuurder of voetganger                1               Personenauto
## 2 Bestuurder of voetganger                1               Personenauto
## 3 Bestuurder of voetganger               14 Motorfiets meer dan 400 cc
## 4 Bestuurder of voetganger                9            Landbouwtraktor
## 5 Bestuurder of voetganger                1               Personenauto
## 6 Bestuurder of voetganger                1               Personenauto
##   CD_ROAD_TYPE                  TX_ROAD_TYPE_DESCR_NL CD_LIGHT_COND
## 1            2 Gewestweg, provincieweg of gemeenteweg             1
## 2            2 Gewestweg, provincieweg of gemeenteweg             4
## 3            1                            Autosnelweg             1
## 4            2 Gewestweg, provincieweg of gemeenteweg             1
## 5            2 Gewestweg, provincieweg of gemeenteweg             2
## 6            2 Gewestweg, provincieweg of gemeenteweg             3
##                  TX_LIGHT_COND_DESCR_NL CD_COLL_TYPE
## 1                   Bij klaarlichte dag            5
## 2      Nacht, geen openbare verlichting            8
## 3                   Bij klaarlichte dag            8
## 4                   Bij klaarlichte dag            8
## 5                 Dageraad - schemering            4
## 6 Nacht, ontstoken openbare verlichting            4
##                   TX_COLL_TYPE_DESCR_NL CD_BUILD_UP_AREA
## 1                    Met een voetganger                1
## 2 E\303\251n bestuurder, geen hindernis                2
## 3 E\303\251n bestuurder, geen hindernis                2
## 4 E\303\251n bestuurder, geen hindernis                2
## 5                           Langs opzij                1
## 6                           Langs opzij                1
##   TX_BUILD_UP_AREA_DESCR_NL CD_AGE_CLS TX_AGE_CLS_DESCR_NL CD_MUNTY_REFNIS
## 1       Binnen bebouwde kom         91     75 jaar en meer           25121
## 2       Buiten bebouwde kom         41      35 tot 39 jaar           25120
## 3       Buiten bebouwde kom         30      25 tot 29 jaar           25120
## 4       Buiten bebouwde kom         43      35 tot 39 jaar           11002
## 5       Binnen bebouwde kom         76      65 tot 69 jaar           11002
## 6       Binnen bebouwde kom         52      45 tot 49 jaar           11002
##            TX_MUNTY_DESCR_NL CD_DSTR_REFNIS     TX_ADM_DSTR_DESCR_NL
## 1 Ottignies-Louvain-la-Neuve          25000    Arrondissement Nijvel
## 2                 Orp-Jauche          25000    Arrondissement Nijvel
## 3                 Orp-Jauche          25000    Arrondissement Nijvel
## 4                  Antwerpen          11000 Arrondissement Antwerpen
## 5                  Antwerpen          11000 Arrondissement Antwerpen
## 6                  Antwerpen          11000 Arrondissement Antwerpen
##   CD_PROV_REFNIS        TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_NL
## 1          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 2          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 3          20002 Provincie Waals-Brabant          3000    Waals Gewest
## 4          10000     Provincie Antwerpen          2000   Vlaams Gewest
## 5          10000     Provincie Antwerpen          2000   Vlaams Gewest
## 6          10000     Provincie Antwerpen          2000   Vlaams Gewest
##   CD_SEX TX_SEX_DESCR_NL
## 1      M          Mannen
## 2      M          Mannen
## 3      F         Vrouwen
## 4      F         Vrouwen
## 5      M          Mannen
## 6      F         Vrouwen

As you can see, a lot of data is registered when someone gets involved in a traffic accident. Let’s see what we can learn from the data.

The first plot

Time to make the first plot. The ggplot syntax takes a little time to get used to, so let’s start small: let’s make a histogram of the age of the traffic accident victims.

NOTE: The data doesn’t contain the exact age of the accident victims. We are using a column in the data that contains a numerical code closely related to the actual age.

ggplot(vict, aes(CD_AGE_CLS)) +
  geom_histogram()

Nice: our first ggplot chart! Let’s explore what’s going on here:

From the histogram, we see that the distribution of traffic victims is skewed towards younger people, with a peak around 30 years.

We also see a weird peak to the left of the histogram. We take a look at these records in the data by using `filter’ from the handy dplyr package.

library(dplyr)
head(filter(vict, CD_AGE_CLS < 3))
##       DT_DAY DT_HOUR CD_DAY_OF_WEEK TX_DAY_OF_WEEK_DESCR_NL MS_VCT
## 1 2012-01-01       3              2                  zondag   1.00
## 2 2012-01-01       9              1                  zondag   0.00
## 3 2012-01-01       9              4                  zondag   1.00
## 4 2012-01-01       6              7                  zondag   0.00
## 5 2012-01-01      12              6                  zondag   0.00
## 6 2012-01-01       7              1                  zondag   1.33
##   MS_SLY_INJ MS_SERLY_INJ MS_MORY_INJ MS_DEAD MS_DEAD_30_DAYS CD_VCT_TYPE
## 1       1.00            0           0       0               0           2
## 2       0.00            0           0       0               0           1
## 3       1.00            0           0       0               0           1
## 4       0.00            0           0       0               0           1
## 5       0.00            0           0       0               0           1
## 6       1.33            0           0       0               0           1
##       TX_VCT_TYPE_DESCR_NL CD_ROAD_USR_TYPE TX_ROAD_USR_TYPE_DESCR_NL
## 1                Passagier                2  Auto voor dubbel gebruik
## 2 Bestuurder of voetganger                1              Personenauto
## 3 Bestuurder of voetganger                1              Personenauto
## 4 Bestuurder of voetganger                1              Personenauto
## 5 Bestuurder of voetganger                1              Personenauto
## 6 Bestuurder of voetganger                1              Personenauto
##   CD_ROAD_TYPE                  TX_ROAD_TYPE_DESCR_NL CD_LIGHT_COND
## 1            2 Gewestweg, provincieweg of gemeenteweg             3
## 2            2 Gewestweg, provincieweg of gemeenteweg             1
## 3            1                            Autosnelweg             2
## 4            2 Gewestweg, provincieweg of gemeenteweg             3
## 5            2 Gewestweg, provincieweg of gemeenteweg             1
## 6            2 Gewestweg, provincieweg of gemeenteweg             3
##                  TX_LIGHT_COND_DESCR_NL CD_COLL_TYPE
## 1 Nacht, ontstoken openbare verlichting            4
## 2                   Bij klaarlichte dag            2
## 3                 Dageraad - schemering            3
## 4 Nacht, ontstoken openbare verlichting            6
## 5                   Bij klaarlichte dag            4
## 6 Nacht, ontstoken openbare verlichting            3
##                   TX_COLL_TYPE_DESCR_NL CD_BUILD_UP_AREA
## 1                           Langs opzij                2
## 2 Frontale botsing (of bij het kruisen)                1
## 3      Langs achteren (of naast elkaar)                2
## 4     Tegen een hindernis op de rijbaan                1
## 5                           Langs opzij                2
## 6      Langs achteren (of naast elkaar)                2
##   TX_BUILD_UP_AREA_DESCR_NL CD_AGE_CLS TX_AGE_CLS_DESCR_NL CD_MUNTY_REFNIS
## 1       Buiten bebouwde kom          2    Niet beschikbaar           11001
## 2       Binnen bebouwde kom          2    Niet beschikbaar           12002
## 3       Buiten bebouwde kom          2    Niet beschikbaar           11053
## 4       Binnen bebouwde kom          2    Niet beschikbaar           11050
## 5       Buiten bebouwde kom          2    Niet beschikbaar           11024
## 6       Buiten bebouwde kom          2    Niet beschikbaar           11004
##   TX_MUNTY_DESCR_NL CD_DSTR_REFNIS     TX_ADM_DSTR_DESCR_NL CD_PROV_REFNIS
## 1        Aartselaar          11000 Arrondissement Antwerpen          10000
## 2           Berlaar          12000  Arrondissement Mechelen          10000
## 3        Wuustwezel          11000 Arrondissement Antwerpen          10000
## 4          Wijnegem          11000 Arrondissement Antwerpen          10000
## 5           Kontich          11000 Arrondissement Antwerpen          10000
## 6          Boechout          11000 Arrondissement Antwerpen          10000
##      TX_PROV_DESCR_NL CD_RGN_REFNIS TX_RGN_DESCR_NL CD_SEX
## 1 Provincie Antwerpen          2000   Vlaams Gewest       
## 2 Provincie Antwerpen          2000   Vlaams Gewest       
## 3 Provincie Antwerpen          2000   Vlaams Gewest       
## 4 Provincie Antwerpen          2000   Vlaams Gewest       
## 5 Provincie Antwerpen          2000   Vlaams Gewest       
## 6 Provincie Antwerpen          2000   Vlaams Gewest       
##    TX_SEX_DESCR_NL
## 1 Niet beschikbaar
## 2 Niet beschikbaar
## 3 Niet beschikbaar
## 4 Niet beschikbaar
## 5 Niet beschikbaar
## 6 Niet beschikbaar

When you check the TX_AGE_CLS_DESCR_NL column of these records, you’ll notice they contain the value ‘Niet beschikbaar’ (‘Not available’). So some records have missing values and got assigned a value of 2 for the CD_AGE_CLS variable.

We’ll ignore that for now, but we’ve already discovered an inconsistency in the data by making a ggplot visulisation!

Man vs women

Let’s take the same histogram and introduce an extra variable: the gender of the victims. This is stored in the CD_SEX column.

ggplot(vict, aes(CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram()

We mapped the gender column to the fill colour aesthetic of the histogram. ggplot added a color legend for us, and as you can see we have one color for females, one for males and two extra colors. On inspecting the data, the two extra colours seem to represent missing values. We filter these records out:

vict <- filter(vict, CD_SEX != " " & CD_SEX != "")

We also got a warning: ggplot tells us that we are using the default of 30 bins, and that we could pick a better value for binwidth. Let’s do that, and set the width of the histogram bins to 1:

ggplot(vict, aes(CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram(binwidth = 1)

A pattern emerges: for values 22, 33, 44, … the the data contains a lot less records then for other values. We are going to ignore this inconsistency in the data too.

The stacking of the bars for men and women make comparisons difficult. Let’s make 2 separate histograms. We use `facet_wrap()’ for that and tell it to make seperate histograms for each sex:

ggplot(vict, aes(x = CD_AGE_CLS, fill = CD_SEX)) +
  geom_histogram(binwidth = 1) +
  facet_wrap(~CD_SEX)