Chapter 9 Exploratory Data Analysis

9.1 Introduction

This chapter introduces you to the concept of Exploratory Data Analysis (EDA). In EDA a dataset is analysed with the goal of assessing its main characteristics, including its quality and usability for subsequent statistical modelling and analysis. Data summary techniques and visualizations ore used in an EDA. There is no fixed set of activities; each dataset poses its own questions and challenges.

Datasets are often expensive and time-consuming to collect, and therefore they are usually collected with a specific goal in mind. The EDA you perform should always be with this goal on the horizon: is the dataset of sufficient quality to answer the scientific questions for which they were collected?

EDA outline

An EDA therefore may include:

Dataset description: how were they collected, what data types do the variables have, what are the units and physical quantities, how are missing data encoded, etc.
Data summary: number of cases, number of variables, basic statistical description (i.e. mean, median, sd etc.), variable distributions (normal or not, outliers, missing data, skewed distributions)
Visual summaries using boxplots, histograms, density plots.
Recoding or transformations of variables may be required to get better results (e.g. from numeric to factor or vice versa, or log transformation of exponential data)
Exploration of variable relations/covariance. It is always interesting to know about correlations between variables, but this is especially the case when you have a dependent variable for which you wish to build a statistical model in a later stage of your analysis.

Note that your EDA results should always be accompanied by text that describes your results and discusses the implications of them. your figures should be well annotated with a caption and legend if relevant. An EDA is not a publication, but it will usually have a short introduction, a results section and a discussion. In contrast with a publication you usually do show the code. This ensures complete transparency and reproducibility.

In this chapter I will demonstrate a short EDA on the “Yeast dataset” from the UCI Machine Learning repository.

It is outside the scope of this short analysis to delve too deeply in the attribute background information, but you should realize that for a real analysis this is absolutely critical: without domain knowledge the you won’t have the insight to identify strange, worrying or surprising results.

9.2 EDA of the Yeast dataset

9.2.1 Introduction

The data were collected from the UCI Machine Learning website. To ensure their continued availability the files were copied to a personal repository.

The data were accompanied by a very short abstract: “Predicting the Cellular Localization Sites of Proteins”. The data description states there are 1484 instances, each with 8 attributes (variables) and a class label. The class label is the variable we wish to build a model for so this is the dependent variable.

The Attribute Information section describes these variables:

Sequence Name: Accession number for the SWISS-PROT database
mcg: McGeoch’s method for signal sequence recognition.
gvh: von Heijne’s method for signal sequence recognition.
alm: Score of the ALOM membrane spanning region prediction program.
mit: Score of discriminant analysis of the amino acid content of the N-terminal region (20 residues long) of mitochondrial and non-mitochondrial proteins.
erl: Presence of “HDEL” substring (thought to act as a signal for retention in the endoplasmic reticulum lumen). Binary attribute.
pox: Peroxisomal targeting signal in the C-terminus.
vac: Score of discriminant analysis of the amino acid content of vacuolar and extracellular proteins.
nuc: Score of discriminant analysis of nuclear localization signals of nuclear and non-nuclear proteins.

The sequence name is not interesting in this EDA: we are interested in patterns, not individuals. However, if there is an anomalous protein in the EDA it should be possible to retrace its origin so the identifier stays in the dataset.

There seem to be two attributes (mcg and gvh) measuring the same property - the presence of a signal sequence which is the N-terminal part of a protein signalling the cellular machinery the protein should be exported.

Very simply put, if this scoring system was perfect each of the attributes (except mcg and gvh) would unequivocally “assign” the protein in question to a cellular location: Cellular external matrix, membrane inserted, mitochondrial, endoplasmic reticulum lumen, peroxisomal, vacuolar or nuclear.

The tenth variable in the data is the dependent or explanatory variable. The yeast.names file describes it like this:

Class Distribution. The class is the localization site. Please see Nakai &
               Kanehisa referenced above for more details.
  CYT (cytosolic or cytoskeletal)                    463
  NUC (nuclear)                                      429
  MIT (mitochondrial)                                244
  ME3 (membrane protein, no N-terminal signal)       163
  ME2 (membrane protein, uncleaved signal)            51
  ME1 (membrane protein, cleaved signal)              44
  EXC (extracellular)                                 37
  VAC (vacuolar)                                      30
  POX (peroxisomal)                                   20
  ERL (endoplasmic reticulum lumen)                    5

This tells me the different localizations are by no means equally distributed; there is an over representation of “cytosolic or cytoskeletal” and “nuclear” and a huge under representation of especially endoplasmic reticulum lumen proteins.

9.2.2 Data loading and prepping

Since I am going to rerun the code in this notebook often I am going to create a local copy, and load that one.

file_name <- "yeast.data"
yeast_data_url <- paste0("https://raw.githubusercontent.com/MichielNoback/datasets/master/UCI_yeast_protein_loc/", file_name)
yeast_local_location <- paste0("../", file_name)

#only download if not present
if (! file.exists(yeast_local_location)) {
    download.file(url = yeast_data_url, 
                  destfile = yeast_local_location)
}

yeast_data <- read.table(file = yeast_local_location,
                         sep = ",",
                         as.is = 1)
str(yeast_data)

## 'data.frame':    1484 obs. of  10 variables:
##  $ V1 : chr  "ADT1_YEAST" "ADT2_YEAST" "ADT3_YEAST" "AAR2_YEAST" ...
##  $ V2 : num  0.58 0.43 0.64 0.58 0.42 0.51 0.5 0.48 0.55 0.4 ...
##  $ V3 : num  0.61 0.67 0.62 0.44 0.44 0.4 0.54 0.45 0.5 0.39 ...
##  $ V4 : num  0.47 0.48 0.49 0.57 0.48 0.56 0.48 0.59 0.66 0.6 ...
##  $ V5 : num  0.13 0.27 0.15 0.13 0.54 0.17 0.65 0.2 0.36 0.15 ...
##  $ V6 : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ V7 : num  0 0 0 0 0 0.5 0 0 0 0 ...
##  $ V8 : num  0.48 0.53 0.53 0.54 0.48 0.49 0.53 0.58 0.49 0.58 ...
##  $ V9 : num  0.22 0.22 0.22 0.22 0.22 0.22 0.22 0.34 0.22 0.3 ...
##  $ V10: Factor w/ 10 levels "CYT","ERL","EXC",..: 7 7 7 8 7 1 7 8 7 1 ...

The data seems to have been loaded correctly and, as expected since that was stated in the original data description, there are no missing data.

The column names were not defined in the data file so this will be fixed first. I will create a data frame that also holds the column descriptions. I have put the data in a small text file for easy loading and editing; the attribute descriptions copied/pasted into a text file and with find/replace converted in easy to load form, and a new label column added for use in plotting. There were a few occurrences of the ' character which always corrupt data import into R. They were removed.

attribute_info <- read.table("data/yeast_attribute_info.txt", 
                             sep = ":",
                             header = TRUE,
                             stringsAsFactors = FALSE)
#attach column names
(colnames(yeast_data) <- attribute_info$attribute)

##  [1] "accno" "mcg"   "gvh"   "alm"   "mit"   "erl"   "pox"   "vac"   "nuc"  
## [10] "loc"

To make this info easily accessible, a function is created that can be used to fetch either the description or the label.

get_attribute_info <- function(attribute, resource = "label") {
    if (! resource %in% c("label", "description")) {
        stop(paste0("type ", resource, "is not an attribute resource"))
    }
    return(attribute_info[attribute_info$attribute == attribute, resource])
}

#test it
get_attribute_info("gvh")

## [1] "VonHeijne signal"

get_attribute_info("accno", "description")

## [1] "Accession number for the SWISS-PROT database"

9.2.3 Data verification

The original data description stated that 1484 instances with 9 attributes + a class label are present.

dim(yeast_data)

## [1] 1484   10

This is correct. No missing data are supposed to be there:

sum(complete.cases(yeast_data))

## [1] 1484

Also correct.

The classes of the columns are also OK so the data is verified and found correct.

9.2.4 Attribute summaries

A first scan of the attributes.

summary(yeast_data)

##     accno                mcg            gvh            alm      
##  Length:1484        Min.   :0.11   Min.   :0.13   Min.   :0.21  
##  Class :character   1st Qu.:0.41   1st Qu.:0.42   1st Qu.:0.46  
##  Mode  :character   Median :0.49   Median :0.49   Median :0.51  
##                     Mean   :0.50   Mean   :0.50   Mean   :0.50  
##                     3rd Qu.:0.58   3rd Qu.:0.57   3rd Qu.:0.55  
##                     Max.   :1.00   Max.   :1.00   Max.   :1.00  
##                                                                 
##       mit             erl             pox              vac      
##  Min.   :0.000   Min.   :0.500   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:0.170   1st Qu.:0.500   1st Qu.:0.0000   1st Qu.:0.48  
##  Median :0.220   Median :0.500   Median :0.0000   Median :0.51  
##  Mean   :0.261   Mean   :0.505   Mean   :0.0075   Mean   :0.50  
##  3rd Qu.:0.320   3rd Qu.:0.500   3rd Qu.:0.0000   3rd Qu.:0.53  
##  Max.   :1.000   Max.   :1.000   Max.   :0.8300   Max.   :0.73  
##                                                                 
##       nuc             loc     
##  Min.   :0.000   CYT    :463  
##  1st Qu.:0.220   NUC    :429  
##  Median :0.220   MIT    :244  
##  Mean   :0.276   ME3    :163  
##  3rd Qu.:0.300   ME2    : 51  
##  Max.   :1.000   ME1    : 44  
##                  (Other): 90

All these attributes seem to be in the range zero to one, except of course for the localization attribute and the accession number. This is not surprising since these attributes are all probabilities. The class distribution corresponds with the published data.

The The pox (Peroxisomal) and erl ( ER-retention) attributes, seem to have a really strange distribution and should be investigated further.

Here are all the numeric attributes in a single panel. I chose histogram over boxplot or density plot because it is more fine-grained than boxplot and very easy to interpret.

par(mfrow = c(4, 2))

for ( attrib in colnames(yeast_data[2:9])) {
    hist(yeast_data[, attrib],
         main = NULL,
         xlab = paste0(get_attribute_info(attrib), " (prob)"))
}

Figure 9.1: Distributions of the numeric variables

#reset par()
par(mfrow = c(1, 1))

Here, it can be seen that the attributes “ER-retention” and “Peroxisomal” are pretty much uninformative, a problem probably largely caused by the low abundance of proteins in these classes. Surprisingly enough, the “Vacuolar” property which is also low-abundant, does not have this extreme distribution.

What is also striking is that, although these attributes are probabilities of targeting signals, none (with the exception of the two low-abundance ones) show a more or less bi-modal distribution as you would naively expect.

9.2.5 Attribute relationships

Here is a quick scan of variable relationship, excluding the dependent variable. The pairs() function is used for that. I excluded erl and pox because they do not add information to the picture.

pairs(yeast_data[c(2, 3, 4, 5, 8, 9)],
      panel = "panel.smooth",
      pch = 20, 
      cex = 0.35,
      col = rgb(0.1, 0.4, 1, alpha = 0.4))

Figure 9.2: Relations between the numeric variables

By adding the smoother it is made clear that the only pair which shows a clear correlation is the pair mcg / gvh which actually predict the same property. Therefore it would have been very surprising indeed if they would not have a correlation. The strength of this correlation -the R-squared- is this

lin_mod <- lm(yeast_data$mcg ~ yeast_data$gvh)
summary(lin_mod)$r.squared

## [1] 0.3383

This is not very strong.

9.2.6 Correlations to the dependent variable

As a first investigation, numeric variables are split on the location variable.

library(RColorBrewer)
colors <- RColorBrewer::brewer.pal(8, name = "Accent")
#col = c("darkblue", "darkgreen", "magenta", )
col_i <- 0
for (name in names(yeast_data[2:9])) {
    col_i <- col_i + 1
    boxplot(yeast_data[, name] ~ yeast_data$loc,
            xlab = NULL,
            ylab = get_attribute_info(name),
            col = colors[col_i])
}

Figure 9.3: Correlations with the dependent variable

Figure 9.4: Correlations with the dependent variable

Figure 9.5: Correlations with the dependent variable

Figure 9.6: Correlations with the dependent variable

Figure 9.7: Correlations with the dependent variable

Figure 9.8: Correlations with the dependent variable

Figure 9.9: Correlations with the dependent variable

Figure 9.10: Correlations with the dependent variable

Quite a few of the variables correlate pretty well with the dependent variable. These do so very well: alm, mit, erl, pox. Others are not: nuc, vac, and some are ambiguous - they discriminate, but not exclusively: mcg, gvh.

Here are some summary statistics for the numeric variables split on the dependent variable:

options(digits=2)

tmp <- t(aggregate( . ~ loc, data = yeast_data[, -1], FUN = mean))
knitr::kable(tmp)

## Warning in kable_pipe(x = structure(c("loc", "mcg", "gvh", "alm", "mit", : The
## table should have a header (column names)

loc	CYT	ERL	EXC	ME1	ME2	ME3	MIT	NUC	POX	VAC
mcg	0.48	0.79	0.74	0.79	0.72	0.43	0.52	0.45	0.52	0.55
gvh	0.47	0.77	0.72	0.76	0.60	0.49	0.53	0.46	0.51	0.53
alm	0.54	0.48	0.49	0.38	0.41	0.36	0.52	0.53	0.51	0.47
mit	0.23	0.34	0.29	0.31	0.28	0.21	0.40	0.23	0.25	0.20
erl	0.50	1.00	0.50	0.50	0.51	0.50	0.50	0.50	0.50	0.50
pox	0.0011	0.0000	0.0000	0.0000	0.0000	0.0000	0.0089	0.0000	0.4235	0.0000
vac	0.50	0.55	0.46	0.51	0.51	0.51	0.50	0.49	0.50	0.53
nuc	0.26	0.25	0.23	0.27	0.25	0.27	0.24	0.33	0.23	0.25

tmp <- t(aggregate( . ~ loc, data = yeast_data[, -1], FUN = median))
knitr::kable(tmp)

## Warning in kable_pipe(x = structure(c("loc", "mcg", "gvh", "alm", "mit", : The
## table should have a header (column names)

loc	CYT	ERL	EXC	ME1	ME2	ME3	MIT	NUC	POX	VAC
mcg	0.48	0.78	0.77	0.78	0.76	0.43	0.51	0.45	0.48	0.56
gvh	0.46	0.80	0.73	0.73	0.59	0.49	0.53	0.46	0.51	0.51
alm	0.53	0.50	0.49	0.36	0.41	0.36	0.52	0.53	0.51	0.49
mit	0.19	0.37	0.28	0.29	0.26	0.20	0.41	0.20	0.21	0.17
erl	0.5	1.0	0.5	0.5	0.5	0.5	0.5	0.5	0.5	0.5
pox	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.5	0.0
vac	0.51	0.54	0.48	0.53	0.52	0.51	0.50	0.50	0.51	0.53
nuc	0.22	0.22	0.22	0.22	0.22	0.22	0.22	0.28	0.22	0.22