12. Data classes#

12.1. Data models the easy way#

Suppose we have some datamodel involving Users and their roles.

User has

  • email

  • roles

  • street

  • number

  • city

Naively, many (non-OO) programmers will start out with something like this:

### This list of tuples holds all information of users: 
# email at [0]
# roles at [1]
# street at [2]
# number at [3]
# city at [4]
users = [('henk@example.com', ['scrum master', 'team leader'], 'Square', 'Krowing'), 
         ('ian@example.com'), ['programmer', 'designer'], 'Maple Street', 98, 'Peelsing']

Yes, I have seen this type of structure - may times!

OK, maybe some student is a little bit more aware of data structures and you get this:

users = [{'email': 'henk@example.com', 
          'roles': ['scrum master', 'team leader'],
          'street': 'Square',
          'number': 31,
          'city': 'Krowing'}, 
         {'email': 'ian@example.com', 
          'roles': ['programmer', 'designer'],
          'street': 'Maple Street',
          'number': 98,
          'city': 'Peelsing'}]

Still, this needs careful inspection of the fields and how you are going to access them, and also being very alert on typos when accessing the data fields.
Yet another programmer has read about this OO API of Python and comes up with this:

class User1(object):
    def __init__(self, email, roles, street, number, city):
        self.email = email
        self.roles = roles
        self.street = street
        self.number = number
        self.city = city
    
    def __str__(self):
        return f'{self.email} lives at {self.street} {self.number} in {self.city} and has the following roles: {self.roles}'
    
    def __repr__(self):
        return f'User({self.email}, {self.roles}, {self.street}, {self.number}, {self.city})'
    
    def __eq__(self, other):
        return self.email == other.email and self.street == other.street and self.number == other.number and self.city == other.city and self.roles == other.roles
    
    # more hooks implemented

So we have a nice class that models user objects. Here we create a few and put them in a list.

users = [User1('henk@example.com', ['scrum master', 'team leader'], 'Square', 31, 'Krowing'), 
         User1('ian@example.com', ['programmer', 'designer'], 'Maple Street', 98, 'Peelsing')]
users
[User(henk@example.com, ['scrum master', 'team leader'], Square, 31, Krowing),
 User(ian@example.com, ['programmer', 'designer'], Maple Street, 98, Peelsing)]

The alert reader, knowing a bit about the Single Responsibility Principle (SRP) might suggest to split into two entities: Address and User. After all, addresses can be used outside the scope of user instances, and by separating them, the code becomes simpler and easier to maintain and extend.

class Address2(object):
    def __init__(self, street, number, city):
        self.street = street
        self.number = number
        self.city = city
    
    def __str__(self):
        return f'{self.street} {self.number}, {self.city}'
    
    def __repr__(self):
        return f'Address({self.street}, {self.number}, {self.city})'
    
    def __eq__(self, other):
        return self.street == other.street and self.number == other.number and self.city == other.city
class User2(object):
    def __init__(self, email, roles, address):
        self.email = email
        self.roles = roles
        self.address = address
    
    def __str__(self):
        return f'{self.email} lives at {self.address} and has the following roles: {self.roles}'
    
    def __repr__(self):
        return f'User({self.email}, {self.roles}, {self.address})'
    
    def __eq__(self, other):
        return self.email == other.email and self.address == other.address and self.roles == other.roles
users = [User2('henk@example.com', ['scrum master', 'team leader'], Address2('Square', 31, 'Krowing')), 
         User2('ian@example.com', ['programmer', 'designer'], Address2('Maple Street', 98, 'Peelsing'))]
for user in users:
    print(user)
henk@example.com lives at Square 31, Krowing and has the following roles: ['scrum master', 'team leader']
ian@example.com lives at Maple Street 98, Peelsing and has the following roles: ['programmer', 'designer']

But wow, that is a lot of boilerplate code, almost as much as pre-records Java!

Boilerplate

Boilerplate code are sections of code that are repeated in multiple places with little to no variation.
You need to write a lot of boilerplate code to accomplish only minor functionality.

Wouldn’t it be wonderful if there was some way to get rid of this boilerplate, and simply focus on what is being modelled: a User and an Address, both with some properties!

In Java nowadays we simple write this:

record Address(String street, int number, String city){
    // really, there's nothing here anymore!
    // we get a constructor, equals(), hashCode, 
    // toString(), all free
    // of charge from the compiler
}

As is typically Python, there are not one, not two, but at least three different ways to get to more concise data classes.
Let’s start with the most well-known (but least versatile), namedtuple.

12.2. Option one: collections.namedtuple#

from collections import namedtuple

Address3 = namedtuple('Address3', ['street', 'number', 'city'])
User3 = namedtuple('User3', ['email', 'roles', 'address'])

user = User3('henk@example.com', ['scrum master', 'team leader'], 
              Address3('Square', 31, 'Krowing'))
print(user)
User3(email='henk@example.com', roles=['scrum master', 'team leader'], address=Address3(street='Square', number=31, city='Krowing'))

This gives you the most basic data class. It extends from namedtuple, but is closed for extension. You cannot add extra methods, such as for instance, addRole(role).
However, you get everything tuple has, like slicing and in:

nt_user = User3('henk@example.com', ['scrum master', 'team leader'], Address3('Square', 31, 'Krowing'))
nt_user[1:3]
(['scrum master', 'team leader'],
 Address3(street='Square', number=31, city='Krowing'))
'henk@example.com' in nt_user
True

Moreover, you get to access properties using the dot operator:

print(nt_user.address)
print(nt_user.address.city)
Address3(street='Square', number=31, city='Krowing')
Krowing

If you omit one of the parameter values, you get an error:

address = Address3('Square', 31)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[21], line 1
----> 1 address = Address3('Square', 31)

TypeError: Address3.__new__() missing 1 required positional argument: 'city'

Specifying default values is done using a second list, or a dictionary:

#Address3 = namedtuple('Address3', ['street', 'number', 'city'], defaults=[None, None, 'Groningen'])
Address3 = namedtuple('Address3', ['street', 'number', 'city'], defaults={'city': 'Groningen'})
address = Address3('Square', 31)
print(address)
address2 = Address3('Square', 31, 'Groningen')
print(address2)
#address3 = Address3('Square') #fails with dict, passes with list!
#print(address3)
Address3(street='Square', number=31, city='city')
Address3(street='Square', number=31, city='Groningen')

12.3. Option two: dataclasses.dataclass#

When I first encountered the @dataclass decorator I was really enthousiastic. Here is Address again, now as @dataclass:

from dataclasses import dataclass

@dataclass
class Address4:
    street: str
    number: int
    city: str

Great, now we have a dataclass that represents an address. Let’s inspect its behavior:

address1 = Address4('Square', 31, 'Krowing')
print(address1)

address2 = Address4('Square', 31, 'Krowing')
print(address1 == address2)
Address4(street='Square', number=31, city='Krowing')
True

The @dataclass decorator takes quite some arguments:

@dataclass(*, init=True, repr=True, eq=True, 
           order=False, unsafe_hash=False, frozen=False)

The first three default to True, the others to False.

Defining defaults is simple - as long as they are not container (mutable) types:

@dataclass
class Address5:
    street: str
    number: int
    city: str = 'Groningen'
    
address1 = Address5('Square', 31)
address2 = Address5('Square', 31, 'Groningen')
address1 == address2
True

Note that arguments with defaults should come after the ones without, just like regular method definitions.
But how about the User class? Is it as simple?

@dataclass
class User4:
    email: str
    roles: list
    address: Address4

user = User4('henk@example.com', ['scrum master', 'team leader'], Address4('Square', 31, 'Krowing'))
print(user)
User4(email='henk@example.com', roles=['scrum master', 'team leader'], address=Address4(street='Square', number=31, city='Krowing'))

It seems really simple, but when you want to define default values for mutable types it gets harder:

@dataclass
class User4:
    email: str
    address: Address4
    roles: list = []

user1 = User4('henk@exampple.com', ['scrum master', 'team leader'], Address4('Square', 31, 'Krowing'))
print(user)
ValueError                                Traceback (most recent call last)
Cell In[14], line 2
      1 @dataclass
----> 2 class User5:
##many lines of Traceback omitted
ValueError: mutable default <class 'list'> for field roles is not allowed: use default_factory

You are not allowed to define collection (mutable) types as default values (which is a good idea, as I discovered recently 😅); we will see more on that later.
To specify collection defaults, you need to use a default_factory argument:

from dataclasses import dataclass, field

@dataclass
class User5:
    email: str
    address: Address5
    roles: list = field(default_factory=list)

user = User5('henk@example.com', Address5('Square', 31, 'Krowing'))
print(user)
User5(email='henk@example.com', address=Address5(street='Square', number=31, city='Krowing'), roles=[])

Of course, you can specify your own default factory:

def homeless():
    return Address5('UnderTheBridge', 0, 'Knowhere')

@dataclass
class User6:
    email: str
    address: Address5 = field(default_factory=homeless)
    roles: list = field(default_factory=list)

user = User6('john_doe@example.com', roles = ['bum'])
print(user)
User6(email='john_doe@example.com', address=Address5(street='UnderTheBridge', number=0, city='Knowhere'), roles=['bum'])

Note that the field() function accepts many more arguments, just RTFM.
But what’s really funny is that the type hints are here, just as with real type hints, merely hints…

@dataclass
class Address6:
    street: str
    number: int
    city: str = 'Groningen'

address = Address6(3.1415, ('a', 'b'), True)

Classes defined with @dataclass are no subclasses from tuple, so they don’t inherit all the nice tuple functionality.
All they have are the methods defined in the @dataclass function definition.

12.4. Option 3: typing.NamedTuple#

This is my favourite. It is most like the Java record type, and well, I simply love Java.

from typing import NamedTuple

class Address7(NamedTuple):
    street: str
    number: int
    city: str = 'Groningen'

address = Address7('Square', 31, 'Krowing')
print(address)
#still no runtime type checking...
address = Address7(3.1415, ('a', 'b'), True)
address
Address7(street='Square', number=31, city='Krowing')
Address7(street=3.1415, number=('a', 'b'), city=True)

Although Address7 seems to extend from NamedTuple, it is not. “typing.NamedTuple uses metaclass functionality to customize the creatinon of the user’s class”:

print(issubclass(Address7, tuple))
True

Again, because it is a proper subclass of tuple, it has all functionality that tuple has.

address = Address7('Square', 31, 'Krowing')
print(31 in address)
print(address[2])
True
Krowing

Defining mutable type defaults is not possible however:

def homeless():
    return Address7('UnderTheBridge', 0, 'Knowhere')

class User7(NamedTuple):
    email: str
    roles: list = list() ## default value will be a static property!
    address: Address7 = homeless()

user1 = User7('henk@example.com')
user2 = User7('henk@example.com')

user1.roles.append('scrum master')

print(user1)
print(user2.roles)
print(id(user1.address))
print(id(user2.address))
User7(email='henk@example.com', roles=['scrum master'], address=Address7(street='UnderTheBridge', number=0, city='Knowhere'))
['scrum master']
4384145296
4384145296

By the way, like tuples, these classes are immutable, as they should be. This will give an error:

user1.address.city = Address7('Willow Str.', 3, 'Gneait')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[68], line 1
----> 1 user1.address.city = Address7('Willow Str.', 3, 'Gneait')

AttributeError: can't set attribute