12. Data classes#
12.1. Data models the easy way#
Suppose we have some datamodel involving Users and their roles.
User has
email
roles
street
number
city
Naively, many (non-OO) programmers will start out with something like this:
### This list of tuples holds all information of users:
# email at [0]
# roles at [1]
# street at [2]
# number at [3]
# city at [4]
users = [('henk@example.com', ['scrum master', 'team leader'], 'Square', 'Krowing'),
('ian@example.com'), ['programmer', 'designer'], 'Maple Street', 98, 'Peelsing']
Yes, I have seen this type of structure - may times!
OK, maybe some student is a little bit more aware of data structures and you get this:
users = [{'email': 'henk@example.com',
'roles': ['scrum master', 'team leader'],
'street': 'Square',
'number': 31,
'city': 'Krowing'},
{'email': 'ian@example.com',
'roles': ['programmer', 'designer'],
'street': 'Maple Street',
'number': 98,
'city': 'Peelsing'}]
Still, this needs careful inspection of the fields and how you are going to access them, and also being very alert on typos when accessing the data fields.
Yet another programmer has read about this OO API of Python and comes up with this:
class User1(object):
def __init__(self, email, roles, street, number, city):
self.email = email
self.roles = roles
self.street = street
self.number = number
self.city = city
def __str__(self):
return f'{self.email} lives at {self.street} {self.number} in {self.city} and has the following roles: {self.roles}'
def __repr__(self):
return f'User({self.email}, {self.roles}, {self.street}, {self.number}, {self.city})'
def __eq__(self, other):
return self.email == other.email and self.street == other.street and self.number == other.number and self.city == other.city and self.roles == other.roles
# more hooks implemented
So we have a nice class that models user objects. Here we create a few and put them in a list.
users = [User1('henk@example.com', ['scrum master', 'team leader'], 'Square', 31, 'Krowing'),
User1('ian@example.com', ['programmer', 'designer'], 'Maple Street', 98, 'Peelsing')]
users
[User(henk@example.com, ['scrum master', 'team leader'], Square, 31, Krowing),
User(ian@example.com, ['programmer', 'designer'], Maple Street, 98, Peelsing)]
The alert reader, knowing a bit about the Single Responsibility Principle (SRP) might suggest to split into two entities: Address
and User
. After all, addresses can be used outside the scope of user instances, and by separating them, the code becomes simpler and easier to maintain and extend.
class Address2(object):
def __init__(self, street, number, city):
self.street = street
self.number = number
self.city = city
def __str__(self):
return f'{self.street} {self.number}, {self.city}'
def __repr__(self):
return f'Address({self.street}, {self.number}, {self.city})'
def __eq__(self, other):
return self.street == other.street and self.number == other.number and self.city == other.city
class User2(object):
def __init__(self, email, roles, address):
self.email = email
self.roles = roles
self.address = address
def __str__(self):
return f'{self.email} lives at {self.address} and has the following roles: {self.roles}'
def __repr__(self):
return f'User({self.email}, {self.roles}, {self.address})'
def __eq__(self, other):
return self.email == other.email and self.address == other.address and self.roles == other.roles
users = [User2('henk@example.com', ['scrum master', 'team leader'], Address2('Square', 31, 'Krowing')),
User2('ian@example.com', ['programmer', 'designer'], Address2('Maple Street', 98, 'Peelsing'))]
for user in users:
print(user)
henk@example.com lives at Square 31, Krowing and has the following roles: ['scrum master', 'team leader']
ian@example.com lives at Maple Street 98, Peelsing and has the following roles: ['programmer', 'designer']
But wow, that is a lot of boilerplate code, almost as much as pre-records Java!
Boilerplate
Boilerplate code are sections of code that are repeated in multiple places with little to no variation.
You need to write a lot of boilerplate code to accomplish only minor functionality.
Wouldn’t it be wonderful if there was some way to get rid of this boilerplate, and simply focus on what is being modelled: a User and an Address, both with some properties!
In Java nowadays we simple write this:
record Address(String street, int number, String city){
// really, there's nothing here anymore!
// we get a constructor, equals(), hashCode,
// toString(), all free
// of charge from the compiler
}
As is typically Python, there are not one, not two, but at least three different ways to get to more concise data classes.
Let’s start with the most well-known (but least versatile), namedtuple
.
12.2. Option one: collections.namedtuple
#
from collections import namedtuple
Address3 = namedtuple('Address3', ['street', 'number', 'city'])
User3 = namedtuple('User3', ['email', 'roles', 'address'])
user = User3('henk@example.com', ['scrum master', 'team leader'],
Address3('Square', 31, 'Krowing'))
print(user)
User3(email='henk@example.com', roles=['scrum master', 'team leader'], address=Address3(street='Square', number=31, city='Krowing'))
This gives you the most basic data class. It extends from namedtuple
, but is closed for extension. You cannot add extra methods, such as for instance, addRole(role)
.
However, you get everything tuple
has, like slicing and in
:
nt_user = User3('henk@example.com', ['scrum master', 'team leader'], Address3('Square', 31, 'Krowing'))
nt_user[1:3]
(['scrum master', 'team leader'],
Address3(street='Square', number=31, city='Krowing'))
'henk@example.com' in nt_user
True
Moreover, you get to access properties using the dot operator:
print(nt_user.address)
print(nt_user.address.city)
Address3(street='Square', number=31, city='Krowing')
Krowing
If you omit one of the parameter values, you get an error:
address = Address3('Square', 31)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[21], line 1
----> 1 address = Address3('Square', 31)
TypeError: Address3.__new__() missing 1 required positional argument: 'city'
Specifying default values is done using a second list, or a dictionary:
#Address3 = namedtuple('Address3', ['street', 'number', 'city'], defaults=[None, None, 'Groningen'])
Address3 = namedtuple('Address3', ['street', 'number', 'city'], defaults={'city': 'Groningen'})
address = Address3('Square', 31)
print(address)
address2 = Address3('Square', 31, 'Groningen')
print(address2)
#address3 = Address3('Square') #fails with dict, passes with list!
#print(address3)
Address3(street='Square', number=31, city='city')
Address3(street='Square', number=31, city='Groningen')
12.3. Option two: dataclasses.dataclass
#
When I first encountered the @dataclass
decorator I was really enthousiastic. Here is Address
again, now as @dataclass
:
from dataclasses import dataclass
@dataclass
class Address4:
street: str
number: int
city: str
Great, now we have a dataclass that represents an address. Let’s inspect its behavior:
address1 = Address4('Square', 31, 'Krowing')
print(address1)
address2 = Address4('Square', 31, 'Krowing')
print(address1 == address2)
Address4(street='Square', number=31, city='Krowing')
True
The @dataclass
decorator takes quite some arguments:
@dataclass(*, init=True, repr=True, eq=True,
order=False, unsafe_hash=False, frozen=False)
The first three default to True, the others to False.
Defining defaults is simple - as long as they are not container (mutable) types:
@dataclass
class Address5:
street: str
number: int
city: str = 'Groningen'
address1 = Address5('Square', 31)
address2 = Address5('Square', 31, 'Groningen')
address1 == address2
True
Note that arguments with defaults should come after the ones without, just like regular method definitions.
But how about the User
class? Is it as simple?
@dataclass
class User4:
email: str
roles: list
address: Address4
user = User4('henk@example.com', ['scrum master', 'team leader'], Address4('Square', 31, 'Krowing'))
print(user)
User4(email='henk@example.com', roles=['scrum master', 'team leader'], address=Address4(street='Square', number=31, city='Krowing'))
It seems really simple, but when you want to define default values for mutable types it gets harder:
@dataclass
class User4:
email: str
address: Address4
roles: list = []
user1 = User4('henk@exampple.com', ['scrum master', 'team leader'], Address4('Square', 31, 'Krowing'))
print(user)
ValueError Traceback (most recent call last)
Cell In[14], line 2
1 @dataclass
----> 2 class User5:
##many lines of Traceback omitted
ValueError: mutable default <class 'list'> for field roles is not allowed: use default_factory
You are not allowed to define collection (mutable) types as default values (which is a good idea, as I discovered recently 😅); we will see more on that later.
To specify collection defaults, you need to use a default_factory
argument:
from dataclasses import dataclass, field
@dataclass
class User5:
email: str
address: Address5
roles: list = field(default_factory=list)
user = User5('henk@example.com', Address5('Square', 31, 'Krowing'))
print(user)
User5(email='henk@example.com', address=Address5(street='Square', number=31, city='Krowing'), roles=[])
Of course, you can specify your own default factory:
def homeless():
return Address5('UnderTheBridge', 0, 'Knowhere')
@dataclass
class User6:
email: str
address: Address5 = field(default_factory=homeless)
roles: list = field(default_factory=list)
user = User6('john_doe@example.com', roles = ['bum'])
print(user)
User6(email='john_doe@example.com', address=Address5(street='UnderTheBridge', number=0, city='Knowhere'), roles=['bum'])
Note that the field()
function accepts many more arguments, just RTFM.
But what’s really funny is that the type hints are here, just as with real type hints, merely hints…
@dataclass
class Address6:
street: str
number: int
city: str = 'Groningen'
address = Address6(3.1415, ('a', 'b'), True)
Classes defined with @dataclass
are no subclasses from tuple, so they don’t inherit all the nice tuple functionality.
All they have are the methods defined in the @dataclass
function definition.
12.4. Option 3: typing.NamedTuple
#
This is my favourite. It is most like the Java record
type, and well, I simply love Java.
from typing import NamedTuple
class Address7(NamedTuple):
street: str
number: int
city: str = 'Groningen'
address = Address7('Square', 31, 'Krowing')
print(address)
#still no runtime type checking...
address = Address7(3.1415, ('a', 'b'), True)
address
Address7(street='Square', number=31, city='Krowing')
Address7(street=3.1415, number=('a', 'b'), city=True)
Although Address7 seems to extend from NamedTuple, it is not. “typing.NamedTuple
uses metaclass functionality to customize the creatinon of the user’s class”:
print(issubclass(Address7, tuple))
True
Again, because it is a proper subclass of tuple
, it has all functionality that tuple
has.
address = Address7('Square', 31, 'Krowing')
print(31 in address)
print(address[2])
True
Krowing
Defining mutable type defaults is not possible however:
def homeless():
return Address7('UnderTheBridge', 0, 'Knowhere')
class User7(NamedTuple):
email: str
roles: list = list() ## default value will be a static property!
address: Address7 = homeless()
user1 = User7('henk@example.com')
user2 = User7('henk@example.com')
user1.roles.append('scrum master')
print(user1)
print(user2.roles)
print(id(user1.address))
print(id(user2.address))
User7(email='henk@example.com', roles=['scrum master'], address=Address7(street='UnderTheBridge', number=0, city='Knowhere'))
['scrum master']
4384145296
4384145296
By the way, like tuples, these classes are immutable, as they should be. This will give an error:
user1.address.city = Address7('Willow Str.', 3, 'Gneait')
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
Cell In[68], line 1
----> 1 user1.address.city = Address7('Willow Str.', 3, 'Gneait')
AttributeError: can't set attribute