Chapter 9 Exploratory Data Analysis
9.1 Introduction
This chapter introduces you to the concept of Exploratory Data Analysis (EDA). In EDA a dataset is analysed with the goal of assessing its main characteristics, including its quality and usability for subsequent statistical modelling and analysis. Data summary techniques and visualizations ore used in an EDA. There is no fixed set of activities; each dataset poses its own questions and challenges.
Datasets are often expensive and time-consuming to collect, and therefore they are usually collected with a specific goal in mind. The EDA you perform should always be with this goal on the horizon: is the dataset of sufficient quality to answer the scientific questions for which they were collected?
EDA outline
An EDA therefore may include:
- Dataset description: how were they collected, what data types do the variables have, what are the units and physical quantities, how are missing data encoded, etc.
- Data summary: number of cases, number of variables, basic statistical description (i.e. mean, median, sd etc.), variable distributions (normal or not, outliers, missing data, skewed distributions)
- Visual summaries using boxplots, histograms, density plots.
- Recoding or transformations of variables may be required to get better results (e.g. from numeric to factor or vice versa, or log transformation of exponential data)
- Exploration of variable relations/covariance. It is always interesting to know about correlations between variables, but this is especially the case when you have a dependent variable for which you wish to build a statistical model in a later stage of your analysis.
Note that your EDA results should always be accompanied by text that describes your results and discusses the implications of them. your figures should be well annotated with a caption and legend if relevant. An EDA is not a publication, but it will usually have a short introduction, a results section and a discussion. In contrast with a publication you usually do show the code. This ensures complete transparency and reproducibility.
In this chapter I will demonstrate a short EDA on the “Yeast dataset” from the UCI Machine Learning repository.
It is outside the scope of this short analysis to delve too deeply in the attribute background information, but you should realize that for a real analysis this is absolutely critical: without domain knowledge the you won’t have the insight to identify strange, worrying or surprising results.
9.2 EDA of the Yeast dataset
9.2.1 Introduction
The data were collected from the UCI Machine Learning website. To ensure their continued availability the files were copied to a personal repository.
The data were accompanied by a very short abstract: “Predicting the Cellular Localization Sites of Proteins”. The data description states there are 1484 instances, each with 8 attributes (variables) and a class label. The class label is the variable we wish to build a model for so this is the dependent variable.
The Attribute Information section describes these variables:
- Sequence Name: Accession number for the SWISS-PROT database
mcg
: McGeoch’s method for signal sequence recognition.gvh
: von Heijne’s method for signal sequence recognition.alm
: Score of the ALOM membrane spanning region prediction program.mit
: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.erl
: Presence of “HDEL” substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.pox
: Peroxisomal targeting signal in the C-terminus.vac
: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.nuc
: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.
The sequence name is not interesting in this EDA: we are interested in patterns, not individuals. However, if there is an anomalous protein in the EDA it should be possible to retrace its origin so the identifier stays in the dataset.
There seem to be two attributes (mcg
and gvh
) measuring the same property - the presence of a signal sequence which is the N-terminal part of a protein signalling the cellular machinery the protein should be exported.
Very simply put, if this scoring system was perfect each of the attributes (except mcg
and gvh
) would unequivocally “assign” the protein in question to a cellular location: Cellular external matrix, membrane inserted, mitochondrial, endoplasmic reticulum lumen, peroxisomal, vacuolar or nuclear.
The tenth variable in the data is the dependent or explanatory variable. The yeast.names
file describes it like this:
Class Distribution. The class is the localization site. Please see Nakai &
Kanehisa referenced above for more details.
CYT (cytosolic or cytoskeletal) 463
NUC (nuclear) 429
MIT (mitochondrial) 244
ME3 (membrane protein, no N-terminal signal) 163
ME2 (membrane protein, uncleaved signal) 51
ME1 (membrane protein, cleaved signal) 44
EXC (extracellular) 37
VAC (vacuolar) 30
POX (peroxisomal) 20
ERL (endoplasmic reticulum lumen) 5
This tells me the different localizations are by no means equally distributed; there is an over representation of “cytosolic or cytoskeletal” and “nuclear” and a huge under representation of especially endoplasmic reticulum lumen proteins.
9.2.2 Data loading and prepping
Since I am going to rerun the code in this notebook often I am going to create a local copy, and load that one.
file_name <- "yeast.data"
yeast_data_url <- paste0("https://raw.githubusercontent.com/MichielNoback/datasets/master/UCI_yeast_protein_loc/", file_name)
yeast_local_location <- paste0("../", file_name)
#only download if not present
if (! file.exists(yeast_local_location)) {
download.file(url = yeast_data_url,
destfile = yeast_local_location)
}
yeast_data <- read.table(file = yeast_local_location,
sep = ",",
as.is = 1)
str(yeast_data)
## 'data.frame': 1484 obs. of 10 variables:
## $ V1 : chr "ADT1_YEAST" "ADT2_YEAST" "ADT3_YEAST" "AAR2_YEAST" ...
## $ V2 : num 0.58 0.43 0.64 0.58 0.42 0.51 0.5 0.48 0.55 0.4 ...
## $ V3 : num 0.61 0.67 0.62 0.44 0.44 0.4 0.54 0.45 0.5 0.39 ...
## $ V4 : num 0.47 0.48 0.49 0.57 0.48 0.56 0.48 0.59 0.66 0.6 ...
## $ V5 : num 0.13 0.27 0.15 0.13 0.54 0.17 0.65 0.2 0.36 0.15 ...
## $ V6 : num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ V7 : num 0 0 0 0 0 0.5 0 0 0 0 ...
## $ V8 : num 0.48 0.53 0.53 0.54 0.48 0.49 0.53 0.58 0.49 0.58 ...
## $ V9 : num 0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.34 0.22 0.3 ...
## $ V10: Factor w/ 10 levels "CYT","ERL","EXC",..: 7 7 7 8 7 1 7 8 7 1 ...
The data seems to have been loaded correctly and, as expected since that was stated in the original data description, there are no missing data.
The column names were not defined in the data file so this will be fixed first. I will create a data frame that also holds the column descriptions. I have put the data in a small text file for easy loading and editing; the attribute descriptions copied/pasted into a text file and with find/replace converted in easy to load form, and a new label
column added for use in plotting. There were a few occurrences of the '
character which always corrupt data import into R. They were removed.
attribute_info <- read.table("data/yeast_attribute_info.txt",
sep = ":",
header = TRUE,
stringsAsFactors = FALSE)
#attach column names
(colnames(yeast_data) <- attribute_info$attribute)
## [1] "accno" "mcg" "gvh" "alm" "mit" "erl" "pox" "vac" "nuc"
## [10] "loc"
To make this info easily accessible, a function is created that can be used to fetch either the description or the label.
get_attribute_info <- function(attribute, resource = "label") {
if (! resource %in% c("label", "description")) {
stop(paste0("type ", resource, "is not an attribute resource"))
}
return(attribute_info[attribute_info$attribute == attribute, resource])
}
#test it
get_attribute_info("gvh")
## [1] "VonHeijne signal"
## [1] "Accession number for the SWISS-PROT database"
9.2.3 Data verification
The original data description stated that 1484 instances with 9 attributes + a class label are present.
## [1] 1484 10
This is correct. No missing data are supposed to be there:
## [1] 1484
Also correct.
The classes of the columns are also OK so the data is verified and found correct.
9.2.4 Attribute summaries
A first scan of the attributes.
## accno mcg gvh alm
## Length:1484 Min. :0.11 Min. :0.13 Min. :0.21
## Class :character 1st Qu.:0.41 1st Qu.:0.42 1st Qu.:0.46
## Mode :character Median :0.49 Median :0.49 Median :0.51
## Mean :0.50 Mean :0.50 Mean :0.50
## 3rd Qu.:0.58 3rd Qu.:0.57 3rd Qu.:0.55
## Max. :1.00 Max. :1.00 Max. :1.00
##
## mit erl pox vac
## Min. :0.000 Min. :0.500 Min. :0.0000 Min. :0.00
## 1st Qu.:0.170 1st Qu.:0.500 1st Qu.:0.0000 1st Qu.:0.48
## Median :0.220 Median :0.500 Median :0.0000 Median :0.51
## Mean :0.261 Mean :0.505 Mean :0.0075 Mean :0.50
## 3rd Qu.:0.320 3rd Qu.:0.500 3rd Qu.:0.0000 3rd Qu.:0.53
## Max. :1.000 Max. :1.000 Max. :0.8300 Max. :0.73
##
## nuc loc
## Min. :0.000 CYT :463
## 1st Qu.:0.220 NUC :429
## Median :0.220 MIT :244
## Mean :0.276 ME3 :163
## 3rd Qu.:0.300 ME2 : 51
## Max. :1.000 ME1 : 44
## (Other): 90
All these attributes seem to be in the range zero to one, except of course for the localization attribute and the accession number. This is not surprising since these attributes are all probabilities. The class distribution corresponds with the published data.
The
The pox
(Peroxisomal) and erl
( ER-retention) attributes, seem to have a really strange distribution and should be investigated further.
Here are all the numeric attributes in a single panel. I chose histogram over boxplot or density plot because it is more fine-grained than boxplot and very easy to interpret.
par(mfrow = c(4, 2))
for ( attrib in colnames(yeast_data[2:9])) {
hist(yeast_data[, attrib],
main = NULL,
xlab = paste0(get_attribute_info(attrib), " (prob)"))
}
Here, it can be seen that the attributes “ER-retention” and “Peroxisomal” are pretty much uninformative, a problem probably largely caused by the low abundance of proteins in these classes. Surprisingly enough, the “Vacuolar” property which is also low-abundant, does not have this extreme distribution.
What is also striking is that, although these attributes are probabilities of targeting signals, none (with the exception of the two low-abundance ones) show a more or less bi-modal distribution as you would naively expect.
9.2.5 Attribute relationships
Here is a quick scan of variable relationship, excluding the dependent variable. The pairs()
function is used for that. I excluded erl
and pox
because they do not add information to the picture.
pairs(yeast_data[c(2, 3, 4, 5, 8, 9)],
panel = "panel.smooth",
pch = 20,
cex = 0.35,
col = rgb(0.1, 0.4, 1, alpha = 0.4))
By adding the smoother it is made clear that the only pair which shows a clear correlation is the pair mcg
/ gvh
which actually predict the same property. Therefore it would have been very surprising indeed if they would not have a correlation. The strength of this correlation -the R-squared- is this
## [1] 0.3383
This is not very strong.
9.2.6 Correlations to the dependent variable
As a first investigation, numeric variables are split on the location variable.
library(RColorBrewer)
colors <- RColorBrewer::brewer.pal(8, name = "Accent")
#col = c("darkblue", "darkgreen", "magenta", )
col_i <- 0
for (name in names(yeast_data[2:9])) {
col_i <- col_i + 1
boxplot(yeast_data[, name] ~ yeast_data$loc,
xlab = NULL,
ylab = get_attribute_info(name),
col = colors[col_i])
}
Quite a few of the variables correlate pretty well with the dependent variable. These do so very well: alm
, mit
, erl
, pox
. Others are not: nuc
, vac
, and some are ambiguous - they discriminate, but not exclusively: mcg
, gvh
.
Here are some summary statistics for the numeric variables split on the dependent variable:
options(digits=2)
tmp <- t(aggregate( . ~ loc, data = yeast_data[, -1], FUN = mean))
knitr::kable(tmp)
## Warning in kable_pipe(x = structure(c("loc", "mcg", "gvh", "alm", "mit", : The
## table should have a header (column names)
loc | CYT | ERL | EXC | ME1 | ME2 | ME3 | MIT | NUC | POX | VAC |
mcg | 0.48 | 0.79 | 0.74 | 0.79 | 0.72 | 0.43 | 0.52 | 0.45 | 0.52 | 0.55 |
gvh | 0.47 | 0.77 | 0.72 | 0.76 | 0.60 | 0.49 | 0.53 | 0.46 | 0.51 | 0.53 |
alm | 0.54 | 0.48 | 0.49 | 0.38 | 0.41 | 0.36 | 0.52 | 0.53 | 0.51 | 0.47 |
mit | 0.23 | 0.34 | 0.29 | 0.31 | 0.28 | 0.21 | 0.40 | 0.23 | 0.25 | 0.20 |
erl | 0.50 | 1.00 | 0.50 | 0.50 | 0.51 | 0.50 | 0.50 | 0.50 | 0.50 | 0.50 |
pox | 0.0011 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0089 | 0.0000 | 0.4235 | 0.0000 |
vac | 0.50 | 0.55 | 0.46 | 0.51 | 0.51 | 0.51 | 0.50 | 0.49 | 0.50 | 0.53 |
nuc | 0.26 | 0.25 | 0.23 | 0.27 | 0.25 | 0.27 | 0.24 | 0.33 | 0.23 | 0.25 |
## Warning in kable_pipe(x = structure(c("loc", "mcg", "gvh", "alm", "mit", : The
## table should have a header (column names)
loc | CYT | ERL | EXC | ME1 | ME2 | ME3 | MIT | NUC | POX | VAC |
mcg | 0.48 | 0.78 | 0.77 | 0.78 | 0.76 | 0.43 | 0.51 | 0.45 | 0.48 | 0.56 |
gvh | 0.46 | 0.80 | 0.73 | 0.73 | 0.59 | 0.49 | 0.53 | 0.46 | 0.51 | 0.51 |
alm | 0.53 | 0.50 | 0.49 | 0.36 | 0.41 | 0.36 | 0.52 | 0.53 | 0.51 | 0.49 |
mit | 0.19 | 0.37 | 0.28 | 0.29 | 0.26 | 0.20 | 0.41 | 0.20 | 0.21 | 0.17 |
erl | 0.5 | 1.0 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
pox | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.5 | 0.0 |
vac | 0.51 | 0.54 | 0.48 | 0.53 | 0.52 | 0.51 | 0.50 | 0.50 | 0.51 | 0.53 |
nuc | 0.22 | 0.22 | 0.22 | 0.22 | 0.22 | 0.22 | 0.22 | 0.28 | 0.22 | 0.22 |