16. Next steps: pandas#

Classical programming revolves around building applications. This is in contrast with data science, where building analysis pipelines of datasets is the main focus.
In classical programming you mainly write .py source files. In data science the notebook is the predominant form.
This chapter deals with an essential library if you want to embark in data science: pandas, the core library in working with tabular data.

In the chapter Reading and writing files you have seen this code snippet for reading data from a csv (for ‘comma-separated values’) file:

file = open("data/lengths.csv", "r")
for line in file:
    print(line.strip().split(','))
['1', 'm', '180']
['2', 'm', '188']
['3', 'f', '178']
['4', 'f', '182']
['5', 'f', '172']
['6', 'm', '189']

In data science, this kind of tabular text data is ubiquitous. We have all seen data in this form, at least if you ever worked in Excel.
Without thinking about it, we adhere to some data organization rules that everybody uses. These rules are

  1. Examples (also called cases, or subjects) are organized in rows

  2. Variables (or measurements) ar organized in columns

  3. Each column only has a single data type

  4. All data in a row belong to exactly one example

  5. Each example has exactly the same variables (or a ‘missing data’ encoding when it is absent)

So, in the small dataset above, the first line represents a subject with ID 1 who has sex m (male) and a height of 180 centimetres.
Data files usually also have a header line containing the column names, as in this example:

file = open("data/placebo_drug_test.csv", "r")
for line in file:
    print(line.strip().split(','))
['Patient', 'Placebo', 'Valproate']
['p1', '37', '5']
['p2', '52', '22']
['p3', '63', '41']
['p4', '2', '4']
['p5', '25', '32']
['p6', '29', '20']
['p7', '15', '10']
['p8', '52', '25']
['p9', '19', '17']
['p10', '12', '14']
['p11', '7', '8']
['p12', '9', '8']
['p13', '65', '30']
['p14', '52', '22']
['p15', '6', '11']
['p16', '17', '1']
['p17', '54', '31']
['p18', '27', '15']
['p19', '36', '13']
['p20', '5', '5']

The idea here is the same: Each patient is represented by a row, with an identifier in column 1, a measurement of some response measured with a Placebo and with Valproate, a drug primarily used to treat epilepsy and bipolar disorder and prevent migraine headaches.

What if you wanted to analyse and visualize such a dataset: calculate means, standard deviations, correlations. Build models and apply these to new data.
All this is quite possible in basic Python but you would be crazy to do so. The next few paragraphs show you why.

Let’s start with the same drug treatment data.

import pandas as pd
drug_data = pd.read_csv('data/placebo_drug_test.csv', index_col = 0)
drug_data.head(10)
Placebo Valproate
Patient
p1 37 5
p2 52 22
p3 63 41
p4 2 4
p5 25 32
p6 29 20
p7 15 10
p8 52 25
p9 19 17
p10 12 14

The first line import pandas as pd imports the package with name pd.
The second line reads in the data file using pd.read_csv(). Note that the first column is marked as index, which represent the names of the rows. If omitted, pandas will create an index starting at 0.
In the last line, the first ten rows of the data are displayed. The head() method defaults to 6 if you omit the parameter.
Data frames can also be created as literals in several ways; here is one of them, using a dict with column names as keys and column as lists:

pd.DataFrame( 
      {'A': [30, 7, 10],
       'B': ['foo', 'bar', 'baz'],
       'C': [6.45, 9.21, 2.33]},
      index=['one', 'two', 'three'])
A B C
one 30 foo 6.45
two 7 bar 9.21
three 10 baz 2.33

If you provide lists of unequal length you will get errors looking like this:

ValueError                                Traceback (most recent call last)
Cell In[133], line 1
----> 1 pd.DataFrame( 
      2       {'A': [30, 7, 10],
      3        'B': ['foo', 'bar', 'baz'],
      4        'C': [6.45, 9.21, 2.33, 0.21]},
      5       index=['one', 'two', 'three'])


<<ROWS OMITTED>>

ValueError: Length of values (4) does not match length of index (3)

Next, let’s look at some functions used to get information about the pandas data frame.

The first is describe() which gives some summary statistics of the columns.

drug_data.describe()
Placebo Valproate
count 20.000000 20.000000
mean 29.200000 16.700000
std 20.829129 10.945319
min 2.000000 1.000000
25% 11.250000 8.000000
50% 26.000000 14.500000
75% 52.000000 22.750000
max 65.000000 41.000000

If you want to know which types, access the dtypes property (note it is not a method call!)

drug_data.dtypes
Placebo      int64
Valproate    int64
dtype: object

16.1. Selecting (and setting) data#

DataFrame provides several ways to access (get or set) positions, columns, rows or partial DataFrames. These are

  • []: slicing the data frame itself

  • pandas.DataFrame.at: Access a single value for a row/column label pair.

  • pandas.DataFrame.iat: Access a single value for a row/column pair by integer position.

  • pandas.DataFrame.loc: Access a group of rows and columns by label(s) or a boolean array.

  • pandas.DataFrame.iloc: Purely integer-location based indexing for selection by position.

As you can read, the last four serve the same purpose but with differing ways to specify indices (coordinates). They are not used as method calls, but similar to list indexing. Selections can return a DataFrame or a Series (which is list-ish in nature), depending on the result.

Below are a few examples with basic slicing.

This first one gives Placebo as a pandas.Series object (and chained to Series.head() call). This already demonstrates how chaining is an important aspect of working with data frames.

drug_data['Placebo'].head()
Patient
p1    37
p2    52
p3    63
p4     2
p5    25
Name: Placebo, dtype: int64

Since this is a Series, you can proceed with label-based selections as well.

drug_data['Placebo']['p1':'p3']
Patient
p1    37
p2    52
p3    63
Name: Placebo, dtype: int64

By providing a list of column labels, this selects multiple columns (in arbitrary order). The result is a pandas.DataFrame. The result (tmp) is used to demonstrate how a column can be deleted.

tmp = drug_data[['Valproate','Placebo']]
print(tmp.head())
del(tmp['Placebo'])
tmp.head()
         Valproate  Placebo
Patient                    
p1               5       37
p2              22       52
p3              41       63
p4               4        2
p5              32       25
Valproate
Patient
p1 5
p2 22
p3 41
p4 4
p5 32

Here is the at[] way to select or set single values. The last line uses at[] to set a value at a position.

print(drug_data.at['p1', 'Valproate']) # Valproate measurement for Patient 1
print(drug_data.at['p5', 'Placebo']) # Placebo measurement for Patient 5
drug_data.at['p5', 'Placebo'] = 28
drug_data.head()
5
25
Placebo Valproate
Patient
p1 37 5
p2 52 22
p3 63 41
p4 2 4
p5 28 32

Note that the same can be achieved using selection chaining:

drug_data['Placebo']['p1']
37

Here are the same selections with iat[]. With iat, indexing is all zero-based, as always with Python.

print(drug_data.iat[0, 1]) # Valproate measurement for Patient 1
print(drug_data.iat[4, 0]) # Placebo measurement for Patient 5
drug_data.iat[4, 0] = 25 #reset to original value
drug_data.head()
5
28
Placebo Valproate
Patient
p1 37 5
p2 52 22
p3 63 41
p4 2 4
p5 25 32

The difference between at[]/iat[] and loc[]/iloc[] is that the first pair only accesses single values. Tha latter can access series and ranges. Here are a few examples.

A single row name returns a pandas.Series object, while multiple rows returns a DataFrame. Here are a few examples.

Single value:

drug_data.loc['p1', 'Valproate'] # same as drug_data.at['p1', 'Valproate']
5

Single row, returns a Series

drug_data.loc['p1']
Placebo      37
Valproate     5
Name: p1, dtype: int64

Single column, returns a Series.

drug_data.loc[:, 'Valproate'].head() #same as drug_data['Valproate']
Patient
p1     5
p2    22
p3    41
p4     4
p5    32
Name: Valproate, dtype: int64

Multiple rows, single column.

drug_data.loc['p2':'p4', 'Valproate']
Patient
p2    22
p3    41
p4     4
Name: Valproate, dtype: int64

Multiple rows, multiple columns as column range.

drug_data.loc['p2':'p5', 'Placebo':'Valproate']
Placebo Valproate
Patient
p2 52 22
p3 63 41
p4 2 4
p5 25 32

Or, columns in arbitrary order:

drug_data.loc['p2':'p5', ['Valproate', 'Placebo']] # same as drug_data[['Valproate', 'Placebo']]['p2':'p5']
Valproate Placebo
Patient
p2 22 52
p3 41 63
p4 4 2
p5 32 25

All of the above selection methods can also be used to assign values.

16.2. Summary statistics#

Summary statistics is where you usually begin. There are many functions available

drug_data['Placebo'].mean()
drug_data.mean()
Placebo      29.2
Valproate    16.7
dtype: float64

This section only skimmed te possibilities of the pandas DataFrame class. Many dedicated books and tutorials exist, and a complete reference of all properties and functions can be found here.