{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Next steps: pandas\n",
"Classical programming revolves around building applications. This is in contrast with data science, where building analysis pipelines of datasets is the main focus. \n",
"In classical programming you mainly write `.py` source files. In data science the notebook is the predominant form. \n",
"This chapter deals with an essential library if you want to embark in data science: `pandas`, the core library in working with tabular data. \n",
"\n",
"In the chapter {doc}`Reading and writing files <05_reading_writing_files>` you have seen this code snippet for reading data from a csv (for 'comma-separated values') file:"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['1', 'm', '180']\n",
"['2', 'm', '188']\n",
"['3', 'f', '178']\n",
"['4', 'f', '182']\n",
"['5', 'f', '172']\n",
"['6', 'm', '189']\n"
]
}
],
"source": [
"file = open(\"data/lengths.csv\", \"r\")\n",
"for line in file:\n",
" print(line.strip().split(','))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In data science, this kind of tabular text data is ubiquitous. We have all seen data in this form, at least if you ever worked in Excel. \n",
"Without thinking about it, we adhere to some data organization rules that everybody uses. These rules are \n",
"\n",
"1. _Examples_ (also called cases, or subjects) are organized in rows\n",
"2. _Variables_ (or measurements) ar organized in columns\n",
"3. Each column only has a _single data type_\n",
"4. All data in a row belong to exactly one example\n",
"5. Each example has exactly the same variables (or a 'missing data' encoding when it is absent)\n",
"\n",
"So, in the small dataset above, the first line represents a subject with ID `1` who has sex `m` (male) and a height of `180` centimetres. \n",
"Data files usually also have a header line containing the column names, as in this example: "
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Patient', 'Placebo', 'Valproate']\n",
"['p1', '37', '5']\n",
"['p2', '52', '22']\n",
"['p3', '63', '41']\n",
"['p4', '2', '4']\n",
"['p5', '25', '32']\n",
"['p6', '29', '20']\n",
"['p7', '15', '10']\n",
"['p8', '52', '25']\n",
"['p9', '19', '17']\n",
"['p10', '12', '14']\n",
"['p11', '7', '8']\n",
"['p12', '9', '8']\n",
"['p13', '65', '30']\n",
"['p14', '52', '22']\n",
"['p15', '6', '11']\n",
"['p16', '17', '1']\n",
"['p17', '54', '31']\n",
"['p18', '27', '15']\n",
"['p19', '36', '13']\n",
"['p20', '5', '5']\n"
]
}
],
"source": [
"file = open(\"data/placebo_drug_test.csv\", \"r\")\n",
"for line in file:\n",
" print(line.strip().split(','))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The idea here is the same: Each patient is represented by a row, with an identifier in column 1, a measurement of some response measured with a Placebo and with Valproate, a drug primarily used to treat epilepsy and bipolar disorder and prevent migraine headaches. \n",
"\n",
"What if you wanted to analyse and visualize such a dataset: calculate means, standard deviations, correlations. Build models and apply these to new data. \n",
"All this is quite possible in basic Python but you would be crazy to do so. The next few paragraphs show you why. \n",
"\n",
"Let's start with the same drug treatment data."
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Placebo | \n",
" Valproate | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p1 | \n",
" 37 | \n",
" 5 | \n",
"
\n",
" \n",
" p2 | \n",
" 52 | \n",
" 22 | \n",
"
\n",
" \n",
" p3 | \n",
" 63 | \n",
" 41 | \n",
"
\n",
" \n",
" p4 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" p5 | \n",
" 25 | \n",
" 32 | \n",
"
\n",
" \n",
" p6 | \n",
" 29 | \n",
" 20 | \n",
"
\n",
" \n",
" p7 | \n",
" 15 | \n",
" 10 | \n",
"
\n",
" \n",
" p8 | \n",
" 52 | \n",
" 25 | \n",
"
\n",
" \n",
" p9 | \n",
" 19 | \n",
" 17 | \n",
"
\n",
" \n",
" p10 | \n",
" 12 | \n",
" 14 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Placebo Valproate\n",
"Patient \n",
"p1 37 5\n",
"p2 52 22\n",
"p3 63 41\n",
"p4 2 4\n",
"p5 25 32\n",
"p6 29 20\n",
"p7 15 10\n",
"p8 52 25\n",
"p9 19 17\n",
"p10 12 14"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"drug_data = pd.read_csv('data/placebo_drug_test.csv', index_col = 0)\n",
"drug_data.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first line `import pandas as pd` imports the package with name `pd`. \n",
"The second line reads in the data file using `pd.read_csv()`. Note that the first column is marked as `index`, which represent the names of the rows. If omitted, pandas will create an index starting at 0. \n",
"In the last line, the first ten rows of the data are displayed. The `head()` method defaults to 6 if you omit the parameter. \n",
"Data frames can also be created as literals in several ways; here is one of them, using a dict with column names as keys and column as lists: "
]
},
{
"cell_type": "code",
"execution_count": 134,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" A | \n",
" B | \n",
" C | \n",
"
\n",
" \n",
" \n",
" \n",
" one | \n",
" 30 | \n",
" foo | \n",
" 6.45 | \n",
"
\n",
" \n",
" two | \n",
" 7 | \n",
" bar | \n",
" 9.21 | \n",
"
\n",
" \n",
" three | \n",
" 10 | \n",
" baz | \n",
" 2.33 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" A B C\n",
"one 30 foo 6.45\n",
"two 7 bar 9.21\n",
"three 10 baz 2.33"
]
},
"execution_count": 134,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pd.DataFrame( \n",
" {'A': [30, 7, 10],\n",
" 'B': ['foo', 'bar', 'baz'],\n",
" 'C': [6.45, 9.21, 2.33]},\n",
" index=['one', 'two', 'three'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you provide lists of unequal length you will get errors looking like this:\n",
"\n",
"```traceback\n",
"ValueError Traceback (most recent call last)\n",
"Cell In[133], line 1\n",
"----> 1 pd.DataFrame( \n",
" 2 {'A': [30, 7, 10],\n",
" 3 'B': ['foo', 'bar', 'baz'],\n",
" 4 'C': [6.45, 9.21, 2.33, 0.21]},\n",
" 5 index=['one', 'two', 'three'])\n",
"\n",
"\n",
"<>\n",
"\n",
"ValueError: Length of values (4) does not match length of index (3)\n",
"```\n",
"\n",
"Next, let's look at some functions used to get information about the pandas data frame. \n",
"\n",
"The first is `describe()` which gives some summary statistics of the columns."
]
},
{
"cell_type": "code",
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Placebo | \n",
" Valproate | \n",
"
\n",
" \n",
" \n",
" \n",
" count | \n",
" 20.000000 | \n",
" 20.000000 | \n",
"
\n",
" \n",
" mean | \n",
" 29.200000 | \n",
" 16.700000 | \n",
"
\n",
" \n",
" std | \n",
" 20.829129 | \n",
" 10.945319 | \n",
"
\n",
" \n",
" min | \n",
" 2.000000 | \n",
" 1.000000 | \n",
"
\n",
" \n",
" 25% | \n",
" 11.250000 | \n",
" 8.000000 | \n",
"
\n",
" \n",
" 50% | \n",
" 26.000000 | \n",
" 14.500000 | \n",
"
\n",
" \n",
" 75% | \n",
" 52.000000 | \n",
" 22.750000 | \n",
"
\n",
" \n",
" max | \n",
" 65.000000 | \n",
" 41.000000 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Placebo Valproate\n",
"count 20.000000 20.000000\n",
"mean 29.200000 16.700000\n",
"std 20.829129 10.945319\n",
"min 2.000000 1.000000\n",
"25% 11.250000 8.000000\n",
"50% 26.000000 14.500000\n",
"75% 52.000000 22.750000\n",
"max 65.000000 41.000000"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to know which types, access the `dtypes` property (note it is not a method call!)"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Placebo int64\n",
"Valproate int64\n",
"dtype: object"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.dtypes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Selecting (and setting) data\n",
"\n",
"DataFrame provides several ways to access (get or set) positions, columns, rows or partial DataFrames. These are \n",
"\n",
"- `[]`: slicing the data frame itself\n",
"- `pandas.DataFrame.at`: Access a single value for a row/column label pair.\n",
"- `pandas.DataFrame.iat`: Access a single value for a row/column pair by integer position.\n",
"- `pandas.DataFrame.loc`: Access a group of rows and columns by label(s) or a boolean array.\n",
"- `pandas.DataFrame.iloc`: Purely integer-location based indexing for selection by position.\n",
"\n",
"As you can read, the last four serve the same purpose but with differing ways to specify indices (coordinates).\n",
"They are not used as method calls, but similar to list indexing.\n",
"Selections can return a DataFrame or a Series (which is list-ish in nature), depending on the result. \n",
"\n",
"Below are a few examples with basic slicing. \n",
"\n",
"This first one gives Placebo as a pandas.Series object (and chained to `Series.head()` call). This already demonstrates how chaining is an important aspect of working with data frames."
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Patient\n",
"p1 37\n",
"p2 52\n",
"p3 63\n",
"p4 2\n",
"p5 25\n",
"Name: Placebo, dtype: int64"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data['Placebo'].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since this is a Series, you can proceed with label-based selections as well."
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Patient\n",
"p1 37\n",
"p2 52\n",
"p3 63\n",
"Name: Placebo, dtype: int64"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data['Placebo']['p1':'p3']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By providing a list of column labels, this selects multiple columns (in arbitrary order). The result is a pandas.DataFrame. The result (`tmp`) is used to demonstrate how a column can be deleted."
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Valproate Placebo\n",
"Patient \n",
"p1 5 37\n",
"p2 22 52\n",
"p3 41 63\n",
"p4 4 2\n",
"p5 32 25\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Valproate | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p1 | \n",
" 5 | \n",
"
\n",
" \n",
" p2 | \n",
" 22 | \n",
"
\n",
" \n",
" p3 | \n",
" 41 | \n",
"
\n",
" \n",
" p4 | \n",
" 4 | \n",
"
\n",
" \n",
" p5 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Valproate\n",
"Patient \n",
"p1 5\n",
"p2 22\n",
"p3 41\n",
"p4 4\n",
"p5 32"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tmp = drug_data[['Valproate','Placebo']]\n",
"print(tmp.head())\n",
"del(tmp['Placebo'])\n",
"tmp.head()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is the `at[]` way to select or set single values. The last line uses `at[]` to set a value at a position."
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"25\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Placebo | \n",
" Valproate | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p1 | \n",
" 37 | \n",
" 5 | \n",
"
\n",
" \n",
" p2 | \n",
" 52 | \n",
" 22 | \n",
"
\n",
" \n",
" p3 | \n",
" 63 | \n",
" 41 | \n",
"
\n",
" \n",
" p4 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" p5 | \n",
" 28 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Placebo Valproate\n",
"Patient \n",
"p1 37 5\n",
"p2 52 22\n",
"p3 63 41\n",
"p4 2 4\n",
"p5 28 32"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(drug_data.at['p1', 'Valproate']) # Valproate measurement for Patient 1\n",
"print(drug_data.at['p5', 'Placebo']) # Placebo measurement for Patient 5\n",
"drug_data.at['p5', 'Placebo'] = 28\n",
"drug_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note that the same can be achieved using selection chaining:"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"37"
]
},
"execution_count": 122,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data['Placebo']['p1']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are the same selections with `iat[]`. With `iat`, indexing is all zero-based, as always with Python. "
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"5\n",
"28\n"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Placebo | \n",
" Valproate | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p1 | \n",
" 37 | \n",
" 5 | \n",
"
\n",
" \n",
" p2 | \n",
" 52 | \n",
" 22 | \n",
"
\n",
" \n",
" p3 | \n",
" 63 | \n",
" 41 | \n",
"
\n",
" \n",
" p4 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" p5 | \n",
" 25 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Placebo Valproate\n",
"Patient \n",
"p1 37 5\n",
"p2 52 22\n",
"p3 63 41\n",
"p4 2 4\n",
"p5 25 32"
]
},
"execution_count": 123,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"print(drug_data.iat[0, 1]) # Valproate measurement for Patient 1\n",
"print(drug_data.iat[4, 0]) # Placebo measurement for Patient 5\n",
"drug_data.iat[4, 0] = 25 #reset to original value\n",
"drug_data.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The difference between `at[]`/`iat[]` and `loc[]`/`iloc[]` is that the first pair only accesses single values. Tha latter can access series and ranges. Here are a few examples.\n",
"\n",
"A single row name returns a `pandas.Series` object, while multiple rows returns a DataFrame. Here are a few examples. \n",
"\n",
"Single value:"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5"
]
},
"execution_count": 124,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc['p1', 'Valproate'] # same as drug_data.at['p1', 'Valproate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Single row, returns a Series"
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Placebo 37\n",
"Valproate 5\n",
"Name: p1, dtype: int64"
]
},
"execution_count": 125,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc['p1']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Single column, returns a Series."
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Patient\n",
"p1 5\n",
"p2 22\n",
"p3 41\n",
"p4 4\n",
"p5 32\n",
"Name: Valproate, dtype: int64"
]
},
"execution_count": 126,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc[:, 'Valproate'].head() #same as drug_data['Valproate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Multiple rows, single column."
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Patient\n",
"p2 22\n",
"p3 41\n",
"p4 4\n",
"Name: Valproate, dtype: int64"
]
},
"execution_count": 127,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc['p2':'p4', 'Valproate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Multiple rows, multiple columns as column range."
]
},
{
"cell_type": "code",
"execution_count": 128,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Placebo | \n",
" Valproate | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p2 | \n",
" 52 | \n",
" 22 | \n",
"
\n",
" \n",
" p3 | \n",
" 63 | \n",
" 41 | \n",
"
\n",
" \n",
" p4 | \n",
" 2 | \n",
" 4 | \n",
"
\n",
" \n",
" p5 | \n",
" 25 | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Placebo Valproate\n",
"Patient \n",
"p2 52 22\n",
"p3 63 41\n",
"p4 2 4\n",
"p5 25 32"
]
},
"execution_count": 128,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc['p2':'p5', 'Placebo':'Valproate']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Or, columns in arbitrary order:"
]
},
{
"cell_type": "code",
"execution_count": 129,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Valproate | \n",
" Placebo | \n",
"
\n",
" \n",
" Patient | \n",
" | \n",
" | \n",
"
\n",
" \n",
" \n",
" \n",
" p2 | \n",
" 22 | \n",
" 52 | \n",
"
\n",
" \n",
" p3 | \n",
" 41 | \n",
" 63 | \n",
"
\n",
" \n",
" p4 | \n",
" 4 | \n",
" 2 | \n",
"
\n",
" \n",
" p5 | \n",
" 32 | \n",
" 25 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Valproate Placebo\n",
"Patient \n",
"p2 22 52\n",
"p3 41 63\n",
"p4 4 2\n",
"p5 32 25"
]
},
"execution_count": 129,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data.loc['p2':'p5', ['Valproate', 'Placebo']] # same as drug_data[['Valproate', 'Placebo']]['p2':'p5']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All of the above selection methods can also be used to assign values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Summary statistics\n",
"Summary statistics is where you usually begin. There are many functions available\n"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Placebo 29.2\n",
"Valproate 16.7\n",
"dtype: float64"
]
},
"execution_count": 130,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"drug_data['Placebo'].mean()\n",
"drug_data.mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This section only skimmed te possibilities of the pandas DataFrame class. Many dedicated books and tutorials exist, and a complete reference of all properties and functions can be found [here](https://pandas.pydata.org/docs/reference/frame.html)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "education",
"language": "python",
"name": "education"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 2
}