{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Next steps: pandas\n",
    "Classical programming revolves around building applications. This is in contrast with data science, where building analysis pipelines of datasets is the main focus.  \n",
    "In classical programming you mainly write `.py` source files. In data science the notebook is the predominant form.  \n",
    "This chapter deals with an essential library if you want to embark in data science: `pandas`, the core library in working with tabular data. \n",
    "\n",
    "In the chapter {doc}`Reading and writing files <05_reading_writing_files>` you have seen this code snippet for reading data from a csv (for 'comma-separated values') file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['1', 'm', '180']\n",
      "['2', 'm', '188']\n",
      "['3', 'f', '178']\n",
      "['4', 'f', '182']\n",
      "['5', 'f', '172']\n",
      "['6', 'm', '189']\n"
     ]
    }
   ],
   "source": [
    "file = open(\"data/lengths.csv\", \"r\")\n",
    "for line in file:\n",
    "    print(line.strip().split(','))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In data science, this kind of tabular text data is ubiquitous. We have all seen data in this form, at least if you ever worked in Excel.  \n",
    "Without thinking about it, we adhere to some data organization rules that everybody uses. These rules are  \n",
    "\n",
    "1. _Examples_ (also called cases, or subjects) are organized in rows\n",
    "2. _Variables_ (or measurements) ar organized in columns\n",
    "3. Each column only has a _single  data type_\n",
    "4. All data in a row belong to exactly one example\n",
    "5. Each example has exactly the same variables (or a 'missing data' encoding when it is absent)\n",
    "\n",
    "So, in the small dataset above, the first line represents a subject with ID `1` who has sex `m` (male) and a height of `180` centimetres.  \n",
    "Data files usually also have a header line containing the column names, as in this example:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Patient', 'Placebo', 'Valproate']\n",
      "['p1', '37', '5']\n",
      "['p2', '52', '22']\n",
      "['p3', '63', '41']\n",
      "['p4', '2', '4']\n",
      "['p5', '25', '32']\n",
      "['p6', '29', '20']\n",
      "['p7', '15', '10']\n",
      "['p8', '52', '25']\n",
      "['p9', '19', '17']\n",
      "['p10', '12', '14']\n",
      "['p11', '7', '8']\n",
      "['p12', '9', '8']\n",
      "['p13', '65', '30']\n",
      "['p14', '52', '22']\n",
      "['p15', '6', '11']\n",
      "['p16', '17', '1']\n",
      "['p17', '54', '31']\n",
      "['p18', '27', '15']\n",
      "['p19', '36', '13']\n",
      "['p20', '5', '5']\n"
     ]
    }
   ],
   "source": [
    "file = open(\"data/placebo_drug_test.csv\", \"r\")\n",
    "for line in file:\n",
    "    print(line.strip().split(','))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The idea here is the same: Each patient is represented by a row, with an identifier in column 1, a measurement of some response measured with a Placebo and with Valproate, a drug primarily used to treat epilepsy and bipolar disorder and prevent migraine headaches.  \n",
    "\n",
    "What if you wanted to analyse and visualize such a dataset: calculate means, standard deviations, correlations. Build models and apply these to new data.  \n",
    "All this is quite possible in basic Python but you would be crazy to do so. The next few paragraphs show you why.  \n",
    "\n",
    "Let's start with the same drug treatment data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Placebo</th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p1</th>\n",
       "      <td>37</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>52</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>63</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>25</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p6</th>\n",
       "      <td>29</td>\n",
       "      <td>20</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p7</th>\n",
       "      <td>15</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p8</th>\n",
       "      <td>52</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p9</th>\n",
       "      <td>19</td>\n",
       "      <td>17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p10</th>\n",
       "      <td>12</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Placebo  Valproate\n",
       "Patient                    \n",
       "p1            37          5\n",
       "p2            52         22\n",
       "p3            63         41\n",
       "p4             2          4\n",
       "p5            25         32\n",
       "p6            29         20\n",
       "p7            15         10\n",
       "p8            52         25\n",
       "p9            19         17\n",
       "p10           12         14"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "drug_data = pd.read_csv('data/placebo_drug_test.csv', index_col = 0)\n",
    "drug_data.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first line `import pandas as pd` imports the package with name `pd`.  \n",
    "The second line reads in the data file using `pd.read_csv()`. Note that the first column is marked as `index`, which represent the names of the rows. If omitted, pandas will create an index starting at 0.  \n",
    "In the last line, the first ten rows of the data are displayed. The `head()` method defaults to 6 if you omit the parameter.    \n",
    "Data frames can also be created as literals in several ways; here is one of them, using a dict with column names as keys and column as lists:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>A</th>\n",
       "      <th>B</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>one</th>\n",
       "      <td>30</td>\n",
       "      <td>foo</td>\n",
       "      <td>6.45</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>two</th>\n",
       "      <td>7</td>\n",
       "      <td>bar</td>\n",
       "      <td>9.21</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>three</th>\n",
       "      <td>10</td>\n",
       "      <td>baz</td>\n",
       "      <td>2.33</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "        A    B     C\n",
       "one    30  foo  6.45\n",
       "two     7  bar  9.21\n",
       "three  10  baz  2.33"
      ]
     },
     "execution_count": 134,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.DataFrame( \n",
    "      {'A': [30, 7, 10],\n",
    "       'B': ['foo', 'bar', 'baz'],\n",
    "       'C': [6.45, 9.21, 2.33]},\n",
    "      index=['one', 'two', 'three'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you provide lists of unequal length you will get errors looking like this:\n",
    "\n",
    "```traceback\n",
    "ValueError                                Traceback (most recent call last)\n",
    "Cell In[133], line 1\n",
    "----> 1 pd.DataFrame( \n",
    "      2       {'A': [30, 7, 10],\n",
    "      3        'B': ['foo', 'bar', 'baz'],\n",
    "      4        'C': [6.45, 9.21, 2.33, 0.21]},\n",
    "      5       index=['one', 'two', 'three'])\n",
    "\n",
    "\n",
    "<<ROWS OMITTED>>\n",
    "\n",
    "ValueError: Length of values (4) does not match length of index (3)\n",
    "```\n",
    "\n",
    "Next, let's look at some functions used to get information about the pandas data frame.  \n",
    "\n",
    "The first is `describe()` which gives some summary statistics of the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Placebo</th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>20.000000</td>\n",
       "      <td>20.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>29.200000</td>\n",
       "      <td>16.700000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>20.829129</td>\n",
       "      <td>10.945319</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>2.000000</td>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>11.250000</td>\n",
       "      <td>8.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>26.000000</td>\n",
       "      <td>14.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>52.000000</td>\n",
       "      <td>22.750000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>65.000000</td>\n",
       "      <td>41.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Placebo  Valproate\n",
       "count  20.000000  20.000000\n",
       "mean   29.200000  16.700000\n",
       "std    20.829129  10.945319\n",
       "min     2.000000   1.000000\n",
       "25%    11.250000   8.000000\n",
       "50%    26.000000  14.500000\n",
       "75%    52.000000  22.750000\n",
       "max    65.000000  41.000000"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to know which types, access the `dtypes` property (note it is not a method call!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Placebo      int64\n",
       "Valproate    int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Selecting (and setting) data\n",
    "\n",
    "DataFrame provides several ways to access (get or set) positions, columns, rows or  partial DataFrames. These are \n",
    "\n",
    "- `[]`: slicing the data frame itself\n",
    "- `pandas.DataFrame.at`: Access a single value for a row/column label pair.\n",
    "- `pandas.DataFrame.iat`: Access a single value for a row/column pair by integer position.\n",
    "- `pandas.DataFrame.loc`: Access a group of rows and columns by label(s) or a boolean array.\n",
    "- `pandas.DataFrame.iloc`: Purely integer-location based indexing for selection by position.\n",
    "\n",
    "As you can read, the last four serve the same purpose but with differing ways to specify indices (coordinates).\n",
    "They are not used as method calls, but similar to list indexing.\n",
    "Selections can return a DataFrame or a Series (which is list-ish in nature), depending on the result.  \n",
    "\n",
    "Below are a few examples with basic slicing.  \n",
    "\n",
    "This first one gives Placebo as a pandas.Series object (and chained to `Series.head()` call). This already demonstrates how chaining is an important aspect of working with data frames."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Patient\n",
       "p1    37\n",
       "p2    52\n",
       "p3    63\n",
       "p4     2\n",
       "p5    25\n",
       "Name: Placebo, dtype: int64"
      ]
     },
     "execution_count": 118,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data['Placebo'].head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since this is a Series, you can proceed with label-based selections as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Patient\n",
       "p1    37\n",
       "p2    52\n",
       "p3    63\n",
       "Name: Placebo, dtype: int64"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data['Placebo']['p1':'p3']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By providing a list of column labels, this selects multiple columns (in arbitrary order). The result is a pandas.DataFrame. The result (`tmp`) is used to demonstrate how a column can be deleted."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 120,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "         Valproate  Placebo\n",
      "Patient                    \n",
      "p1               5       37\n",
      "p2              22       52\n",
      "p3              41       63\n",
      "p4               4        2\n",
      "p5              32       25\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p1</th>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Valproate\n",
       "Patient           \n",
       "p1               5\n",
       "p2              22\n",
       "p3              41\n",
       "p4               4\n",
       "p5              32"
      ]
     },
     "execution_count": 120,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tmp = drug_data[['Valproate','Placebo']]\n",
    "print(tmp.head())\n",
    "del(tmp['Placebo'])\n",
    "tmp.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here is the `at[]` way to select or set single values. The last line uses `at[]` to set a value at a position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 121,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n",
      "25\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Placebo</th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p1</th>\n",
       "      <td>37</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>52</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>63</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>28</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Placebo  Valproate\n",
       "Patient                    \n",
       "p1            37          5\n",
       "p2            52         22\n",
       "p3            63         41\n",
       "p4             2          4\n",
       "p5            28         32"
      ]
     },
     "execution_count": 121,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(drug_data.at['p1', 'Valproate']) # Valproate measurement for Patient 1\n",
    "print(drug_data.at['p5', 'Placebo']) # Placebo measurement for Patient 5\n",
    "drug_data.at['p5', 'Placebo'] = 28\n",
    "drug_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the same can be achieved using selection chaining:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "37"
      ]
     },
     "execution_count": 122,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data['Placebo']['p1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are the same selections with `iat[]`. With `iat`, indexing is all zero-based, as always with Python. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 123,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5\n",
      "28\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Placebo</th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p1</th>\n",
       "      <td>37</td>\n",
       "      <td>5</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>52</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>63</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>25</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Placebo  Valproate\n",
       "Patient                    \n",
       "p1            37          5\n",
       "p2            52         22\n",
       "p3            63         41\n",
       "p4             2          4\n",
       "p5            25         32"
      ]
     },
     "execution_count": 123,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "print(drug_data.iat[0, 1]) # Valproate measurement for Patient 1\n",
    "print(drug_data.iat[4, 0]) # Placebo measurement for Patient 5\n",
    "drug_data.iat[4, 0] = 25 #reset to original value\n",
    "drug_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The difference between `at[]`/`iat[]` and `loc[]`/`iloc[]` is that the first pair only accesses single values. Tha latter can access series and ranges. Here are a few examples.\n",
    "\n",
    "A single row name returns a `pandas.Series` object, while multiple rows returns a DataFrame. Here are a few examples.  \n",
    "\n",
    "Single value:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 124,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5"
      ]
     },
     "execution_count": 124,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc['p1', 'Valproate'] # same as drug_data.at['p1', 'Valproate']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Single row, returns a Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 125,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Placebo      37\n",
       "Valproate     5\n",
       "Name: p1, dtype: int64"
      ]
     },
     "execution_count": 125,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc['p1']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Single column, returns a Series."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 126,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Patient\n",
       "p1     5\n",
       "p2    22\n",
       "p3    41\n",
       "p4     4\n",
       "p5    32\n",
       "Name: Valproate, dtype: int64"
      ]
     },
     "execution_count": 126,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc[:, 'Valproate'].head() #same as drug_data['Valproate']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiple rows, single column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 127,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Patient\n",
       "p2    22\n",
       "p3    41\n",
       "p4     4\n",
       "Name: Valproate, dtype: int64"
      ]
     },
     "execution_count": 127,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc['p2':'p4', 'Valproate']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Multiple rows, multiple columns as column range."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Placebo</th>\n",
       "      <th>Valproate</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>52</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>63</td>\n",
       "      <td>41</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>25</td>\n",
       "      <td>32</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Placebo  Valproate\n",
       "Patient                    \n",
       "p2            52         22\n",
       "p3            63         41\n",
       "p4             2          4\n",
       "p5            25         32"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc['p2':'p5', 'Placebo':'Valproate']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, columns in arbitrary order:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 129,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Valproate</th>\n",
       "      <th>Placebo</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Patient</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>p2</th>\n",
       "      <td>22</td>\n",
       "      <td>52</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p3</th>\n",
       "      <td>41</td>\n",
       "      <td>63</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p4</th>\n",
       "      <td>4</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>p5</th>\n",
       "      <td>32</td>\n",
       "      <td>25</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "         Valproate  Placebo\n",
       "Patient                    \n",
       "p2              22       52\n",
       "p3              41       63\n",
       "p4               4        2\n",
       "p5              32       25"
      ]
     },
     "execution_count": 129,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data.loc['p2':'p5', ['Valproate', 'Placebo']] # same as drug_data[['Valproate', 'Placebo']]['p2':'p5']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of the above selection methods can also be used to assign values."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary statistics\n",
    "Summary statistics is where you usually begin. There are many functions available\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 130,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Placebo      29.2\n",
       "Valproate    16.7\n",
       "dtype: float64"
      ]
     },
     "execution_count": 130,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "drug_data['Placebo'].mean()\n",
    "drug_data.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This section only skimmed te possibilities of the pandas DataFrame class. Many dedicated books and tutorials exist, and a complete reference of all properties and functions can be found [here](https://pandas.pydata.org/docs/reference/frame.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "education",
   "language": "python",
   "name": "education"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}