5 Basics of the `ggplot2` package

In this chapter, we’ll explore the basics of the ggplot2 package. This package is one of the most popular packages of R, and the de facto standard for creating publishable visualizations. A later chapter will present more details of it possibilities.

Whole books have been written about ggplot2 (e.g. ggplot2 - Elegant Statistics for Data Aanalysis); these will not be repeated here. Instead, I have selected the minimal amount of information and examples to get you going in your own visualization endeavours in biomedical research. For that reason, this chapter only deals with the base ggplot() function and its most important usage scenarios.
In my opinion, you are best prepared when first learning the ggplot “language” structure, not the complete listing of possibilities. You can check these out later on your own. If you are interested in what the package has to offer, type help(package="ggplot2") on the console.

5.1 Getting started

Keep the goal in mind

You should always remember the purpose with which you create a plot:

Communicate results in a visual way. The audience consists of other professionals: fellow scientists, students, project managers, CEO’s. The scope is in reports, publications, presentations etc. Your plots should be immaculately annotated - have a title and/or caption, axis labels with physical quantities (e.g. Temperature) and measurement units (e.g. Celsius), and a legend (if relevant).
Create a representation of data for visual inspection. The audience is yourself. This is especially important in Exploratory Data Analysis (EDA). You visualize your data in order to discover patterns, trends, outliers and to generate new questions and hypotheses. The biggest challenge is to select the correct, most appropriate visualization that keeps you moving on your research track.

Besides this, you should of course choose a relevant visualization for your data. For instance, generating a boxplot representing only a few data points is a poor choice, as will a scatterplot for millions of data points almost always be.

To help your imagination and see what is possible you should really browse through The R Graph Gallery. It has code for all the charts in the gallery.

A first plot

Install the packages ggplot2 first, if not already installed.

install.packages("ggplot2")

After installing, you’ll need to load the packages.

library(ggplot2)

Let’s dive right in and create a first plot, and walk through the different parts of this code.

ggplot(data = airquality, mapping = aes(x = Temp, y = Ozone)) + 
  geom_point()

## Warning: Removed 37 rows containing missing values (`geom_point()`).

Figure 5.1: A scatter plot visualizing Ozone as a function of Temperature

There are two chained function calls: ggplot() and geom_point(). They are chained using the + operator. The first function, ggplot(), creates the base layer of the plot. It receives the data and defines how it maps to the two axes. By itself, ggplot(), will not display anything of your data. It creates an empty plot where the axes are defined and have the correct scale:

ggplot(data = airquality, mapping = aes(x = Temp, y = Ozone))

Figure 5.2: An empty plot pane

The next function, geom_point(), builds on the base layer it receives via the + operator and adds a new layer to the plot, a data representation using points.

The geom_point() function encounters rows with missing data and issues a warning (Warning: Removed 37 rows...) but proceeds anyway. There are two ways to prevent this annoying warning message. The first is to put a warning=FALSE statement in the RMarkdown chunk header. This is usually not a good idea because you should be explicit about problem handling when implementing a data analysis workflow because it hinders the reproducibility of your work. Therefore, removing the missing values explicitly is a better solution:

airqual <- na.omit(airquality)
#convert to use month labels instead of numbers
airqual$Month <- as.factor(month.abb[airqual$Month])

ggplot(data = airqual, mapping = aes(x = Temp, y = Ozone)) +
  geom_point()

To obtain a similar plot as created above with “base” R, you would have done something like this:

with(airqual, plot(x = Temp, y = Ozone))

Figure 5.3: The same visualization with base R

You can immediately see why ggplot2 has become so popular. When creating more complex plots it becomes more obvious still, as shown below.

Adding a dimension using color

This plot shows the power of ggplot2: building complex visualizations with minimal code.

ggplot(data = airqual, mapping = aes(x = Temp, y = Ozone, color = Month)) +
  geom_point()

Figure 5.4: Ozone as function of Temp with plot symbols colored by Month

Inspecting and tuning the figure

What can you tell about the data and its measurements when looking at this plot?

Looking at the above plot, you should notice that

the temperature measurement is probably in degrees Fahrenheit. This should be apparent from the plot. The measurement unit for Ozone is missing. You should look both up; the datasets package doc says it is in Parts Per Billion (ppb).
temperature is lowest in the fifth month -probably May but once again you should make certain- and highest in months 8 and 9.
ozone levels seem positively correlated with temperature (or Month), but not in an obvious linear way
a detail: temperature is measured in whole degrees only. This will give plotting artifacts: discrete vertical lines of data points.

The plot below fixes and addresses the above issues to create a publication-ready figure. We’ll get to the details of this code as we proceed in this chapter. For now the message is be meticulous in constructing your plot.

ggplot(data = airqual, 
       mapping = aes(x = Temp, y = Ozone)) +
  geom_point(mapping = aes(color = Month)) + 
  geom_smooth(method = "loess", formula = y ~ x) + #the default formula, but prevents a printed message
  xlab(expression("Temperature " (degree~F))) +
  ylab("Ozone (ppb)")

Ozone level dependency on Temperature. Grey area: Loess smoother with 95% confidence interval. Source: R dataset "Daily air quality measurements in New York, May to September 1973."

Figure 5.5: Ozone level dependency on Temperature. Grey area: Loess smoother with 95% confidence interval. Source: R dataset “Daily air quality measurements in New York, May to September 1973.”

5.2 Overview of ggplot

5.2.1 ggplot2 and the theory of graphics

Philosophy of ggplot2

The author of ggplot2, Hadley Wickham, had a very clear goal in mind when he embarked on the development of this package:

“The emphasis in ggplot2 is reducing the amount of thinking time by making it easier to go from the plot in your brain to the plot on the page.” (Wickham, 2012)

The way this is achieved is through “The grammar of graphics”

The grammar of graphics

The grammar of graphics tells us that a statistical graphic is a mapping from data to geometric objects (points, lines, bars) with aesthetic attributes (color, shape, size).

The plot may also contain statistical transformations of the data and is drawn on a specific coordinate system. Faceting -grid layout- can be used to generate the same plot for different subsets of the dataset. (Wickham, 2010)

5.2.2 Building plots with ggplot2

The layered plot architecture

A graph in ggplot2 is built using a few “layers”, or building blocks.

Table 5.1: The three core ggplot2 layers
ggplot2	description
data=	the data that you want to plot
aes()	mappings of data to position (axes), colors, sizes
geom_…..()	shapes (geometries) that will represent the data

First, there is the data layer - the input data that you want to visualize:

The data layer

Next, using the aes() function, the data is mapped to a coordinate system. This encompasses not only the xy-coordinates but also possible extra plot dimensions such as color and shape.

The data and aesthetic layers

As a third step, the data is visually represented in some way, using a geometry (dealt with by one of the many geom_....() functions). Examples of geometries are point for scatterplots, boxplot, line etc.

The data, aesthetic and geometry layers

At a minimum, these three layers are used in every plot you create.

Besides these fundamental aspects there are other elements you may wish to add or modify: axis labels, legend, titles, etc. These constitute additional, optional layers:

All layers

Except for Statistics and Coordinates, each of these layers will be discussed in detail in subsequent paragraphs.

“Tidy” the data

This is a very important aspect of plotting using ggplot2: getting the data in a way that ggplot2 can deal with it. Sometimes it may be a bit challenging to get the data in such a format: some form of data mangling is often required. How to get your data like this is the topic of a next chapter, but here you’ll already see a little preview.

The ggplot2 function expects its data to come in a tidy format. A dataset is considered tidy when it is formed according to these rules:

Each variable has its own column.
Each observation has its own row.
Each value has its own cell.

Want to know more about tidy data? Read the paper by Hadley Wickham: (tidy-data?).

Here is an example dataset that requires some mangling, or tidying, to adhere to these rules.

##    patient sex dose10mg dose100mg
## 1      001   f       12        88
## 2      002   f       11        54
## 3      003   m       54        14
## 4      004   m       71        21
## 5      005   f       19        89
## 6      006   f       22        99
## 7      007   f       23        69
## 8      008   m       68        31
## 9      009   f       30        85
## 10     010   m       83        18
## 11     011   m       72        37
## 12     012   m       48        28
## 13     013   m       67        16
## 14     014   f       13        79
## 15     015   m       73        22
## 16     016   f       20        84
## 17     017   f       22        96
## 18     018   m       40        14
## 19     019   m       57        12
## 20     020   f       26        63
## 21     021   f       17        89
## 22     022   f       29        77
## 23     023   m       54        21
## 24     024   m       61        10
## 25     025   m       57        36
## 26     026   f       11        80

This dataset is not tidy because there is an independent variable -the dose- that should have its own column; its value is now buried inside two column headers (dose10mg and dose10mg). Also, there is actually a single variable -the response- that is now split into two columns. Thus, a row now contains two observations.

Suppose you want to plot the response as a function of the dose. That is not quite possible right now in ggplot2. This is because you want to do something like

ggplot(data=dose_response,
       mapping = aes(x = "<I want to get the dose levels here>",
                     y = "<I want to get the response here>")) +
    geom_boxplot()

The problem is you cannot specify the mapping in a straightforward manner. Note that in base R you would probably do this:

boxplot(dose_response$dose10mg, dose_response$dose100mg)

Figure 5.6: Selecting untidy data

So, we need to tidy this dataframe since the dose_10_response and dose_100_response columns actually describe the same variable (measurement) but with different conditions. Luckily, there is a very nice package that makes this quite easy: tidyr.

Tidying data using `tidyr::pivot_longer()`

## tidy
dose_response_tidy <- pivot_longer(data = dose_response,
                                   cols = c("dose10mg", "dose100mg"),
                                   names_pattern = "dose(\\d+)mg",
                                   names_to = "dose",
                                   values_to = "response")

DT::datatable(dose_response_tidy, 
              options = list(pageLength = 15,
                             dom = 'tpli'))

The data is tidy now, and ready for use within ggplot2.

We’ll explore the pivot_longer() function in detail in a next chapter when discussing the tidyr package.

Now, creating the plot in ggplot2 is a breeze

dr_plot <- ggplot(dose_response_tidy, aes(x = dose, y = response))
dr_plot + 
    geom_boxplot()

Would you proceed with this hypothetical drug?

5.2.3 Inheritance of aesthetics

In the code that creates the figure above you see two calls to the aes() function: one in ggplot() and one in geom_point(). Look at the same code, but with the aesthetics combined into the main ggplot call.

ggplot(data = airqual, 
       mapping = aes(x = Temp, y = Ozone, color = Month)) +
  geom_point() + 
  geom_smooth(method = "loess", formula = y ~ x) + #the default formula, but prevents a printed message
  xlab(expression("Temperature " (degree~F))) +
  ylab("Ozone (ppb)")

Figure 5.7: Ozone level dependency on Temperature. Grey area: Loess smoother with 95% confidence interval. Source: R dataset “Daily air quality measurements in New York, May to September 1973.”

The difference is cause by inheritance of aesthetics!

Like the main ggplot() function, every geom_ function accepts its own mapping = aes(...). The mapping is inherited from the ggplot() function so any aes(...) mapping defined in the main ggplot() call applies to all subsequent layers. However, you can specify your own “local” aesthetic mapping within a geom_xxxx(). Aesthetics defined within a geom_ function are scoped to that function call only.

In the plot below you see another example of how this works (it is not a nice plot any more).

ggplot(data = airqual, mapping = aes(x = Solar.R, y = Ozone)) +
    geom_smooth(aes(linetype = Month), method = "lm", formula = y ~ x) +
    geom_point(aes(color = Month), alpha = 0.7)

Any aesthetics defined outside the aes() function calls are static properties and will be dealt with in a literal manner.

Note that you can “override” global (ggplot()) aesthetics in geom_xxx() but this can give unexpected behavior, as seen in the paragraph on Color.

5.3 Aesthetics

After you obtain a tidy dataset and pass it to ggplot you must decide what the aesthetics are: the way the data are represented in your plot. Very roughly speaking, you could correlate the aesthetics to the dimensions of the data you want to visualize. For instance, given this chapters’ first example of the airquality dataset, the aesthetics were defined in three “dimensions”:
- dimension “X” for temperature, - dimension “Y” for Ozone - dimension “color” for the month.

Although color is used most often to represent an extra dimension in the data, other aesthetics you may consider are shape, size, line width, line type and facetting (making a grid of plots).

Colors

Colors can be defined in a variety of ways in ggpplot (and R in general):

color name
existing color palette
custom color palette

Below is a panel displaying all named colors you can use in R

When you provide a literal (character) for the color aesthetic it will simply be that color. If you want to map a property (e.g. “Month”) to a range of colors, you should use a color palette. Since ggplot has build-in color palettes, you can simply use color=<my-third-dimension-variable>. This variable mapping to color can be either a factor (discrete scale) or numeric (continuous scale). The ggplot function will map the variable the default color palette.

Be aware that there is a big difference in where you specify an aesthetic. When it should be mapped onto a variable (the values within a column) you should put it within the aes() call. When you want to specify a literal -static- aesthetic (e.g. color) you place it outside the aes() call. When you misplace the mapping you get strange behavior:

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
    geom_point(aes(color = 'Green'))

This will not work either (not evaluated because it gives an error):

ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
    geom_point(color = Species)

And when you specify it twice the most ‘specific’ will take precedence (but the legend label is incorrect here):

ggplot(data = airqual, 
       mapping = aes(x = Ozone, y = Solar.R, color = Month)) +
    geom_point(mapping = aes(color = Day))

Have a look at the paragraph “Inheritance of aesthetics” for more detail. Here are some ways to work with color palettes

The default palette

#store it in variable "sp" for re-use in subsequenct chunks
sp <- ggplot(iris, aes(Sepal.Length, Sepal.Width)) + 
    geom_point(aes(color = Species))
sp

Manual palettes

You can specify your own colors using scale_color_manual() for scatter plots or scale_fill_manual() for boxplots and bar plots.

sp + scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))

Here the palette is defined using the hexadecimal notation: Each color can be specified as a mix of Red, Green, and Blue values in a range from 0 to 255. In Hexadecimal notation these are the position 1 and 2 (Red), 3 and 4 (Green) and 5 and 6 (Blue) after the hash sign (#). 00 equals zero and FF equals 255 (16*16). This is quite a universal encoding: a gazillion websites style their pages using this notation.

Here is a nice set of colors:

custom_col <- c("#FFDB6D", "#C4961A", "#F4EDCA", 
                "#D16103", "#C3D7A4", "#52854C", "#4E84C4", "#293352")
show_palette(custom_col, cols=length(custom_col))

Here is a colorblind-friendly palette:

# The palette with grey:
cbp1 <- c("#999999", "#E69F00", "#56B4E9", "#009E73",
          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

show_palette(cbp1, cols=length(cbp1))

When you pass a palette that is longer than the number of levels in your factor, R will only use as many as required:

sp + scale_color_manual(values = cbp1)

Shapes

These are the shapes available in ggplot2 (and base R as well).

shapes <- data.frame(
  shape = c(0:19, 22, 21, 24, 23, 20),
  x = 0:24 %/% 5,
  y = -(0:24 %% 5)
)
ggplot(shapes, aes(x, y)) + 
  geom_point(aes(shape = shape), size = 5, fill = "red") +
  geom_text(aes(label = shape), hjust = 0, nudge_x = 0.15) +
  scale_shape_identity() +
  #expand_limits(x = 4.1) +
  theme_void()

Warning: do not clutter your plot with too many dimensions/aesthetics!

Lines

Geoms that draw lines have a “linetype” parameter.

Legal values are the strings “blank”, “solid”, “dashed”, “dotted”, “dotdash”, “longdash”, and “twodash”. Alternatively, the numbers 0 to 6 can be used (0 for “blank”, 1 for “solid”, …).

You can set line type to a constant value. For this you use the linetype geom parameter. For instance, geom_line(data=d, mapping=aes(x=x, y=y), linetype=3) sets the line type of all lines in that layer to 3, which corresponds to a dotted line), but you can also use it dynamically.

Here is an example where the female and male deaths in the UK for 72 successive months are plotted. The linetype = sex aesthetic could as well have been defined within the global ggplot call. It may be a bit more logical to specify it where it applies to the geom.

deaths <- data.frame(
    month = rep(1:72, times = 2), 
    sex = rep(factor(c("m", "f")), each = 72),
    deaths = c(mdeaths, fdeaths)
)

ggplot(data = deaths, mapping = aes(x = month, y = deaths)) +
    geom_line(aes(linetype = sex))

When you want to add lines of same type, color and width, you can use the group= argument in geom_line(). In the next example, all the chickens on the same diet get the same colour:

ggplot(data = ChickWeight,
       mapping = aes(x = Time, y = weight, color = Diet)) +
    geom_line(aes(group=Chick)) +
    theme_classic()

Size

The size of the plotting symbol can also be used as an extra dimension in your visualization. Here is an example showing the solar radiation of the airquality data as third dimension.

ggplot(data = airqual, 
       mapping = aes(x = Wind, y = Ozone, size = Solar.R)) +
    geom_point(color = "red", alpha = 0.5) +
    labs(size = "Solar radiation (Lang)")

5.4 Geometries

What are geometries

Geometries are the ways data can be visually represented. Boxplot, scatterplot and histogram are a few examples. There are many geoms available in ggplot2; type geom_ in the console and you will get a listing. Even more are available outside the ggplot2 package. Here we’ll only explore the most used geoms in science.

Boxplot

Boxplot is one of the most-used data visualizations. It displays the 5-number summary containing from bottom to top: minimum, first quartile, median (= second quartile), third quartile, maximum. Outliers, usually defined as more than 1.5 * IQR from the median, are displayed as separate points. Some color was added in the example below.

dr_plot <- ggplot(dose_response_tidy, aes(x = dose, y = response))
dr_plot + geom_boxplot(fill='#E69F00')

Jitter

Jitter is a good alternative to boxplot when you have small sample sizes, or discrete measurements with many exact copies, resulting in much overlap. Use the width and height attributes to adjust the jittering.

dr_plot + geom_jitter(width = 0.1, height = 0)

Note that vertical jitter was set to zero because the y-axis values are already in a continuous scale. You should use vertical jittering only when these have discreet values that otherwise overlap too much.

Below, a split over the sexes is added. Suddenly, a dramatic dosage effect becomes apparent that was smoothed out when the two sexes were combined.

dr_plot + geom_jitter(width = 0.1, height = 0, aes(colour = sex))

Alternatively, use a grid of plots to emphasize the contrast further.

dr_plot + 
    geom_jitter(width = 0.1, height = 0, aes(colour = sex)) + 
    facet_wrap( . ~ sex)

Plot overlays: boxplot + jitter

This example shows how you can overlay plots on top of each other as much as you like. The order in which you define the layers is the order in which they are stacked on top of each other in the graph. You could use this as a feature:

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

dr_plot <- ggplot(dose_response_tidy, aes(x = dose, y = response))
p1 <- dr_plot + 
    geom_boxplot(fill='#E69F00') +
    geom_jitter(width = 0.1, height = 0, size = 2, alpha = 0.4)
p2 <- dr_plot + 
    geom_jitter(width = 0.1, height = 0, size = 2, alpha = 0.6) +
    geom_boxplot(fill='#E69F00') 
grid.arrange(p1, p2, nrow = 1) #create a panel of plots

The gridExtra package is discussed in a more complex setting below, in section “Advanced plotting aspects”.

Plot overlays: smooth + ribbon

Here is another pair of examples of overlays of different geoms. In the first, the original datapoints are included.

ggplot(mpg, aes(displ, hwy)) +
    geom_point(alpha = 0.4) +
    geom_smooth(se = FALSE, color = "darkgreen", method = "loess", formula = "y ~ x") +
    geom_ribbon(aes(ymin = 0,
                    ymax = predict(loess(hwy ~ displ))),
                alpha = 0.3, fill = 'green')

Note that the method = "loess", formula = "y ~ x" arguments to geom_smooth() are the defaults. However, if omitted they trigger a message (\geom_smooth()` using method = ‘loess’ and formula ‘y ~ x’`) that I do not like in my output. In this second example, the data points are omitted altogether, making the plot focus solely on global trend.

ggplot(mpg, aes(displ, hwy)) +
    geom_smooth(se = FALSE, color = "darkgreen") +
    geom_ribbon(aes(ymin = 10,
                    ymax = predict(loess(hwy ~ displ))),
                alpha = 0.3, fill = 'green') +
    ylim(10, max(mpg$hwy))

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Scatterplot: Points

The geom_point() function is used to create the good old scatterplot of which we have seen several examples already.

Line plots

When points can be logically connected it may be a good idea to use a line to visualize trends, as we have seen in the deaths plot in section Aesthetics.

If you want both lines and points you need to overlay them. In this example I take it a bit further bu adding the dimension ‘activity’ to the points geom only. This is a typical case for geom_line since the measurements of the two beavers were taken sequentially, for that particular beaver.

b1_start <- beaver1[1, "time"] / 60
b2_start <- beaver1[2, "time"] / 60

suppressMessages(library(dplyr))
#uses dplyr (later this course)
beaverA <- beaver1 %>% mutate(time_h = seq(from = b1_start, 
                                           to = b1_start + (nrow(beaver1)*10)/60, 
                                           length.out = nrow(beaver1)))
beaverB <- beaver2 %>% mutate(time_h = seq(from = b2_start, 
                                           to = b2_start + (nrow(beaver2)*10)/60, 
                                           length.out = nrow(beaver2)))
beavers_all <- bind_rows(beaverA, beaverB) %>%
    mutate(beaver = c(rep("1", nrow(beaverA)), rep("2", nrow(beaverB))),
           activity = factor(activ, levels = c(0, 1), labels = c("inactive", "active")))

ggplot(data = beavers_all, aes(x = time_h, y = temp)) +
    geom_line(aes(linetype = beaver)) +
    geom_point(aes(color = activity)) +
    xlab("time (h)") +
    ylab(expression('Temperature ('*~degree*C*')'))

Histograms

A histogram is a means to visualize the distribution of a dataset, as are boxplot (geom_boxplot()), violin plot (geom_violin()) and density plot (geom_freqpoly()).

Here we look at the eruption intervals of the “faithful” geyser. A binwidth argument is used to adjust the number of bins. Alternative use the bins argument.

ggplot(data=faithful, mapping = aes(x = waiting)) + 
    geom_histogram(binwidth = 3)

There are some statistics available to adjust what is shown on the y axis. The default that is used by geom_histogram is stat(count), so if you don’t specify anything this will be used. But if you want it scaled to a maximum of 1, use stat(count / max(count)). The stat() function is a flag to ggplot2 that you want to use calculated aesthetics produced by the statistic.You can use any transformation of the statistic, e.g. y = stat(log2(count)).

ggplot(data=faithful, mapping = aes(x = waiting)) + 
    geom_histogram(binwidth = 3, aes(y = stat(count / max(count)))) + 
    ylab(label = "normalized proportion")

## Warning: `stat(count / max(count))` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count / max(count))` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Alternatively, if you want percentages, you can use y = stat(count / sum(count) * 100).

ggplot(data=faithful, mapping = aes(x = waiting)) + 
    geom_histogram(binwidth = 3, mapping = aes(y = stat(count / sum(count) * 100))) + 
    ylab(label = "%")

Violin plot

A violin plot is a compact display of a continuous distribution. It is a blend of geom_boxplot() and geom_density(): a violin plot is a mirrored density plot displayed in the same way as a boxplot. It is not seen as often as should be. An example best explains.

ggplot(data=airqual, mapping = aes(x = Month, y = Temp, fill = Month)) +
    geom_violin() + theme(legend.position = "none")

Barplot

The bar plot is similar to a histogram in appearance, but quite different in intent. Where a histogram visualizes the density of a continuous variable, a bar plot tries to visualize the counts (or weights) of distinct groups.

If you don’t provide a weight aesthetic, geom_bar will count all occurrences of the different values in the provided x-axis variable. Here is an example.

ggplot(data = esoph,
       mapping = aes(x = alcgp)) +
    geom_bar(fill = "darkgreen") +
    theme_classic()

You can provide a weight argument In that case the counts will be replaced by the literal value found in that variable. Here is a small example where the ten subjects of the sleep dataset have been charted (the x axis), and the extra column provided the height of the bar, split over the two groups. I used the position argument to get side-by-side bars instead of stacked on top of each other.

ggplot(data = sleep, mapping = aes(ID)) +
    geom_bar(aes(weight = extra, fill = group), 
             position = position_dodge(width=0.7),
             alpha = 0.6, color = "black") +
    scale_fill_manual(values = c("darkblue", "red")) +
    labs(x = "student ID", y = "increase in sleep (h)", fill = "Drug") + 
    theme_classic()

The position = argument could have been "dodge" for simple side-by-side plotting of the bars.

Overview of the main geoms

There are many geoms and even more outside the ggplot2 package. Here is a small overview of some of them.

Table 5.2: Some more geoms
function.	description
geom_abline()	Add reference lines to a plot, either horizontal, vertical, or diagonal
geom_bar()	A bar plot makes the height of the bar proportional to the number of cases in each group
geom_density()	Computes and draws kernel density estimate, which is a smoothed version of the histogram
geom_line()	Connects the observations in order of the variable on the x axis
geom_path()	Connects the observations in the order in which they appear in the data
geom_qq()	geom_qq and stat_qq produce quantile-quantile plots
geom_smooth()	Aids the eye in seeing patterns in the presence of overplotting
geom_violin()	A violin plot is a compact display of a continuous distribution. It is a blend of geom_boxplot() and geom_density()

If you want to know them all, simply type ?geom_ and select the one that looks like the thing you want, or go to the tidyverse ggplot2 reference page.

5.5 Faceting

Faceting is the process of splitting into multiple plots with exactly the same coordinate system where each plot show a subset of the data. It can be applied to any geom. The figure above could be improved slightly with this technique.

ggplot(data = airqual, mapping = aes(x = Solar.R, y = Ozone)) +
    geom_smooth(aes(linetype = Month), method = "lm", formula = y ~ x) +
    geom_point(aes(color = Month), alpha = 0.7) + 
    facet_wrap(. ~ Month)

4 Complex Datatypes and File Reading

6 Built-in Functions

5 Basics of the ggplot2 package