16. Next steps: pandas#
Classical programming revolves around building applications. This is in contrast with data science, where building analysis pipelines of datasets is the main focus.
In classical programming you mainly write .py
source files. In data science the notebook is the predominant form.
This chapter deals with an essential library if you want to embark in data science: pandas
, the core library in working with tabular data.
In the chapter Reading and writing files you have seen this code snippet for reading data from a csv (for ‘comma-separated values’) file:
file = open("data/lengths.csv", "r")
for line in file:
print(line.strip().split(','))
['1', 'm', '180']
['2', 'm', '188']
['3', 'f', '178']
['4', 'f', '182']
['5', 'f', '172']
['6', 'm', '189']
In data science, this kind of tabular text data is ubiquitous. We have all seen data in this form, at least if you ever worked in Excel.
Without thinking about it, we adhere to some data organization rules that everybody uses. These rules are
Examples (also called cases, or subjects) are organized in rows
Variables (or measurements) ar organized in columns
Each column only has a single data type
All data in a row belong to exactly one example
Each example has exactly the same variables (or a ‘missing data’ encoding when it is absent)
So, in the small dataset above, the first line represents a subject with ID 1
who has sex m
(male) and a height of 180
centimetres.
Data files usually also have a header line containing the column names, as in this example:
file = open("data/placebo_drug_test.csv", "r")
for line in file:
print(line.strip().split(','))
['Patient', 'Placebo', 'Valproate']
['p1', '37', '5']
['p2', '52', '22']
['p3', '63', '41']
['p4', '2', '4']
['p5', '25', '32']
['p6', '29', '20']
['p7', '15', '10']
['p8', '52', '25']
['p9', '19', '17']
['p10', '12', '14']
['p11', '7', '8']
['p12', '9', '8']
['p13', '65', '30']
['p14', '52', '22']
['p15', '6', '11']
['p16', '17', '1']
['p17', '54', '31']
['p18', '27', '15']
['p19', '36', '13']
['p20', '5', '5']
The idea here is the same: Each patient is represented by a row, with an identifier in column 1, a measurement of some response measured with a Placebo and with Valproate, a drug primarily used to treat epilepsy and bipolar disorder and prevent migraine headaches.
What if you wanted to analyse and visualize such a dataset: calculate means, standard deviations, correlations. Build models and apply these to new data.
All this is quite possible in basic Python but you would be crazy to do so. The next few paragraphs show you why.
Let’s start with the same drug treatment data.
import pandas as pd
drug_data = pd.read_csv('data/placebo_drug_test.csv', index_col = 0)
drug_data.head(10)
Placebo | Valproate | |
---|---|---|
Patient | ||
p1 | 37 | 5 |
p2 | 52 | 22 |
p3 | 63 | 41 |
p4 | 2 | 4 |
p5 | 25 | 32 |
p6 | 29 | 20 |
p7 | 15 | 10 |
p8 | 52 | 25 |
p9 | 19 | 17 |
p10 | 12 | 14 |
The first line import pandas as pd
imports the package with name pd
.
The second line reads in the data file using pd.read_csv()
. Note that the first column is marked as index
, which represent the names of the rows. If omitted, pandas will create an index starting at 0.
In the last line, the first ten rows of the data are displayed. The head()
method defaults to 6 if you omit the parameter.
Data frames can also be created as literals in several ways; here is one of them, using a dict with column names as keys and column as lists:
pd.DataFrame(
{'A': [30, 7, 10],
'B': ['foo', 'bar', 'baz'],
'C': [6.45, 9.21, 2.33]},
index=['one', 'two', 'three'])
A | B | C | |
---|---|---|---|
one | 30 | foo | 6.45 |
two | 7 | bar | 9.21 |
three | 10 | baz | 2.33 |
If you provide lists of unequal length you will get errors looking like this:
ValueError Traceback (most recent call last)
Cell In[133], line 1
----> 1 pd.DataFrame(
2 {'A': [30, 7, 10],
3 'B': ['foo', 'bar', 'baz'],
4 'C': [6.45, 9.21, 2.33, 0.21]},
5 index=['one', 'two', 'three'])
<<ROWS OMITTED>>
ValueError: Length of values (4) does not match length of index (3)
Next, let’s look at some functions used to get information about the pandas data frame.
The first is describe()
which gives some summary statistics of the columns.
drug_data.describe()
Placebo | Valproate | |
---|---|---|
count | 20.000000 | 20.000000 |
mean | 29.200000 | 16.700000 |
std | 20.829129 | 10.945319 |
min | 2.000000 | 1.000000 |
25% | 11.250000 | 8.000000 |
50% | 26.000000 | 14.500000 |
75% | 52.000000 | 22.750000 |
max | 65.000000 | 41.000000 |
If you want to know which types, access the dtypes
property (note it is not a method call!)
drug_data.dtypes
Placebo int64
Valproate int64
dtype: object
16.1. Selecting (and setting) data#
DataFrame provides several ways to access (get or set) positions, columns, rows or partial DataFrames. These are
[]
: slicing the data frame itselfpandas.DataFrame.at
: Access a single value for a row/column label pair.pandas.DataFrame.iat
: Access a single value for a row/column pair by integer position.pandas.DataFrame.loc
: Access a group of rows and columns by label(s) or a boolean array.pandas.DataFrame.iloc
: Purely integer-location based indexing for selection by position.
As you can read, the last four serve the same purpose but with differing ways to specify indices (coordinates). They are not used as method calls, but similar to list indexing. Selections can return a DataFrame or a Series (which is list-ish in nature), depending on the result.
Below are a few examples with basic slicing.
This first one gives Placebo as a pandas.Series object (and chained to Series.head()
call). This already demonstrates how chaining is an important aspect of working with data frames.
drug_data['Placebo'].head()
Patient
p1 37
p2 52
p3 63
p4 2
p5 25
Name: Placebo, dtype: int64
Since this is a Series, you can proceed with label-based selections as well.
drug_data['Placebo']['p1':'p3']
Patient
p1 37
p2 52
p3 63
Name: Placebo, dtype: int64
By providing a list of column labels, this selects multiple columns (in arbitrary order). The result is a pandas.DataFrame. The result (tmp
) is used to demonstrate how a column can be deleted.
tmp = drug_data[['Valproate','Placebo']]
print(tmp.head())
del(tmp['Placebo'])
tmp.head()
Valproate Placebo
Patient
p1 5 37
p2 22 52
p3 41 63
p4 4 2
p5 32 25
Valproate | |
---|---|
Patient | |
p1 | 5 |
p2 | 22 |
p3 | 41 |
p4 | 4 |
p5 | 32 |
Here is the at[]
way to select or set single values. The last line uses at[]
to set a value at a position.
print(drug_data.at['p1', 'Valproate']) # Valproate measurement for Patient 1
print(drug_data.at['p5', 'Placebo']) # Placebo measurement for Patient 5
drug_data.at['p5', 'Placebo'] = 28
drug_data.head()
5
25
Placebo | Valproate | |
---|---|---|
Patient | ||
p1 | 37 | 5 |
p2 | 52 | 22 |
p3 | 63 | 41 |
p4 | 2 | 4 |
p5 | 28 | 32 |
Note that the same can be achieved using selection chaining:
drug_data['Placebo']['p1']
37
Here are the same selections with iat[]
. With iat
, indexing is all zero-based, as always with Python.
print(drug_data.iat[0, 1]) # Valproate measurement for Patient 1
print(drug_data.iat[4, 0]) # Placebo measurement for Patient 5
drug_data.iat[4, 0] = 25 #reset to original value
drug_data.head()
5
28
Placebo | Valproate | |
---|---|---|
Patient | ||
p1 | 37 | 5 |
p2 | 52 | 22 |
p3 | 63 | 41 |
p4 | 2 | 4 |
p5 | 25 | 32 |
The difference between at[]
/iat[]
and loc[]
/iloc[]
is that the first pair only accesses single values. Tha latter can access series and ranges. Here are a few examples.
A single row name returns a pandas.Series
object, while multiple rows returns a DataFrame. Here are a few examples.
Single value:
drug_data.loc['p1', 'Valproate'] # same as drug_data.at['p1', 'Valproate']
5
Single row, returns a Series
drug_data.loc['p1']
Placebo 37
Valproate 5
Name: p1, dtype: int64
Single column, returns a Series.
drug_data.loc[:, 'Valproate'].head() #same as drug_data['Valproate']
Patient
p1 5
p2 22
p3 41
p4 4
p5 32
Name: Valproate, dtype: int64
Multiple rows, single column.
drug_data.loc['p2':'p4', 'Valproate']
Patient
p2 22
p3 41
p4 4
Name: Valproate, dtype: int64
Multiple rows, multiple columns as column range.
drug_data.loc['p2':'p5', 'Placebo':'Valproate']
Placebo | Valproate | |
---|---|---|
Patient | ||
p2 | 52 | 22 |
p3 | 63 | 41 |
p4 | 2 | 4 |
p5 | 25 | 32 |
Or, columns in arbitrary order:
drug_data.loc['p2':'p5', ['Valproate', 'Placebo']] # same as drug_data[['Valproate', 'Placebo']]['p2':'p5']
Valproate | Placebo | |
---|---|---|
Patient | ||
p2 | 22 | 52 |
p3 | 41 | 63 |
p4 | 4 | 2 |
p5 | 32 | 25 |
All of the above selection methods can also be used to assign values.
16.2. Summary statistics#
Summary statistics is where you usually begin. There are many functions available
drug_data['Placebo'].mean()
drug_data.mean()
Placebo 29.2
Valproate 16.7
dtype: float64
This section only skimmed te possibilities of the pandas DataFrame class. Many dedicated books and tutorials exist, and a complete reference of all properties and functions can be found here.