6. Reading and writing files#

In most data science related scripts and analysis workflows, data will enter via files. To be more precise: via text files.
Fortunately, reading from file is really simple in Python. Unfortunately, you will need to knwo something about file paths in Linux, Mac and Window operating systems, so that is were we’ll start.

6.1. File paths#

6.1.1. Path separators#

There is a big distinction in file paths on Windows versus Linus (and MacOS): on Windows, file paths start with the drive (e.g. C:) and the directory (folder) separator symbol is a backslash \. On unix-like systems (Linux, MacOS), all paths start at the root: / and separator are also the forward slash. If you want to work with paths in your programs, you should therefore never use a literal character for these separators, but ask the OS to provide it for you:

import os.path
print(os.path.sep)

folders = [os.path.sep, 'users', 'Michiel', 'projects', 'python']
print(os.path.join(*folders))
/
/users/Michiel/projects/python

The os.path.join() function uses the separator as defined in os.path.sep.

Note that when you type literal backslash characters, as in Windows paths, they need to be escaped because they give special meaning to the character that comes after. This gives an error:

windows_path = 'C:\Users\Michiel\Projects\Python\'
  Cell In[7], line 1
    windows_path = 'C:\Users\Michiel\Projects\Python\'
                                                      ^
SyntaxError: EOL while scanning string literal

Below is the correct way to deal with that.

windows_path = 'C:\\Users\\Michiel\\Projects\\Python\\'
windows_path
'C:\\Users\\Michiel\\Projects\\Python\\'

Note that it is always better to use os.path.sep, above, or by always using the forward slash (works most most of the time), or like this:

os.path.sep.join(['C:', 'Users', 'Michiel', 'Projects', 'Python'])
'C:/Users/Michiel/Projects/Python'

6.1.2. Absolute and relative paths#

File system locations can be defined in two ways: absolute and relative.
The above examples were all absolute because the paths were all defined starting at the root of the file system, which is a harddisk drive designation (Windows) or a path starting with a forward slash (Linux-like).

Relative paths, on the other hand, specify locations relative to the current working directory. This is either the location of the running piece of code (script, notebook), or the location you have specified using

os.chdir('/users/michiel/projects/project1')

and which you can verify with

os.getcwd()

Suppose I have a directory structure that looks like this.

users
    /michiel
        /projects
            /project1
                /script1.py
                /data
                    /users.txt
            /project2
                /birds.csv

and my current working directory is /users/michiel/python/project1 (because you are working on script1.py).

The different ways to specify the location of the data file users.txt are

  • absolute: /users/michiel/projects/project1/data/users.txt

  • relative to current dir: data/users.txt

  • relative to current dir: ./data/users.txt

The ./ makes it explicit that the start is the current directory.

The different ways to specify the location of the data file birds.csv are

  • absolute: /users/michiel/projects/project2/birds.csv

  • relative to current dir: ../project2/birds.csv

The double dot .. says “go up one directory and work from there.

Linux users may know about the tilde ~ that specifies the current users’ home folder: ~/projects/project2/birds.csv. This does not work in python. Instead, you can use os.path.expanduser('~') to plug it into your file location:

os.path.join(os.path.expanduser('~'), 'projects', 'project2', 'birds.csv'), 
('/Users/michielnoback/projects/project2/birds.csv',)

A last note on file and folder names. Although a wide range of characters is allowed in file paths, when working in datascience it is highly advisable to refrain from using “funny” characters in file paths. They cause errors, and are often hard to type in a string variable. So instead of this:

/homes/michiel/projects/het 'tisser (best) koud! 😀

use this:

/homes/michiel/projects/het_is_er_koud

6.1.3. Reading from file#

And if you have structured data in the form of csv, tsv, xml or Excel, the Python ecosystem prvides a wealth of dedicated data reading functions. If you are going to work with excel-style data (data organized in rows with examples and variables in columns) a lot, it is recommended to have a look at the Pandas library (we’ll have a peek at that at the end of this chapter). In this chapter however we are going to check out the basics of file reading and writing, I/O in short.

Suppose we have some data file named lengths.csv which contains the body lengths (in centimeters) of a sample of male and female subjects:

1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189

This file in csv format (for Comma-Separated Values) can be found here (at ./data/lengths.csv)

To read this data in the simplest way possible, we can read its contents in one operation:

print(os.getcwd())
file = open("data/lengths.csv", "r")
data = file.read()
print(data)
print(type(data))
/Users/michielnoback/git_projects/python_intro
1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189

<class 'str'>

The statement

file = open("data/lengths.csv", "r")

opens the file in read mode (the second argument is the mode argument which defaults to 'r', so it could have been omitted). The functions returns a stream, or handle on the file. Not the actual contents yet.

Reading the contents happens with the file.read() function call.

6.2. Iterating contents#

Usually you want to iterate over file contents line by line without the need to store it all in memory as-is. This is done by applying the for-loop on the file stream:

file = open("data/lengths.csv", "r")
for line in file:
    print(line.strip().split(',')) # of course you want to split the data to separate values
['1', 'm', '180']
['2', 'm', '188']
['3', 'f', '178']
['4', 'f', '182']
['5', 'f', '172']
['6', 'm', '189']

The file stream object returned by the open() function supports iteration. Note that line endings are data in the file and are included when reading the lines. To remove any leading and trailing whitespaces we use the strip() function.
To only remove whitespace characters at the end, use rstrip() with an optional argument specifying which characters to strip off.

6.3. Closing files#

It is good custom to close streams to files that you open. In read mode this is not essential, but in write mode it is. You do this using the close() method. The above fragment is better like this:

file = open("data/lengths.csv", "r")
for line in file:
    print(line.strip())
file.close()            # explicitly closing resources is always a good idea
1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189

6.4. The best way: using with#

Since programmers forgot to close their files all the time, the “with open” syntax was introduced. If you simply always use this form you will never go wrong.

with open("data/lengths.csv", "r") as file:
    for line in file:
        print(line.strip())
# no need to close since that is assured by using with
1,m,180
2,m,188
3,f,178
4,f,182
5,f,172
6,m,189

6.5. Writing to file#

To open a stream for writing you need to set the mode to one of these:

  • “w” (open for writing, truncating the file first)

  • “a” (open for writing, appending to the end of the file if it exists).

When writing to file, using the with syntax is the best way.

my_data = ["Better safe\n", "then sorry\n"] #note the newlines already present!

with open("data/saying.txt", "w") as sayings:
    for l in my_data:
        sayings.write(l)

#or, in one operatition
#with open("data/saying.txt", "w") as sayings:
#    sayings.writelines(my_data)

Both operations will result in a file with these contents:

Better safe
then sorry

And no matter how often the code is run, the same file will be created. If the mode "a" had been used, the saying would be added to the file every time the code was run.