Implementing Dataclasses from scratch

This is sort of an afterthought of the MetaProgramming article.

I realized that while I teased a lot, I never really implemented something non-trivial. And so that’s what we are going to do today - we’re going to implement a dataclass like module to help simplify our class definitions.

The Problem

Say you have a Person class that has a name and an age.

class Person:
    def __init__(self, name, age):
        self.name = name
        self.age = age

Just look at the code duplication! We use the word “name” and “age” in 6 different places. Sure, this isn’t so bad, but it’s a lot of pain to maintain when you have a dozen or so attributes.

The Solution

The right way to do this is to use a dataclass.

from dataclasses import dataclass
@dataclass
class Person:
    name: str
    age: int

But we are not going to do that today. Instead, we’ll try to implement our version of dataclasses. Of course, dataclasses support lots of features, but we’re going to focus on the basics. Here’s what our dataclass will support:

Runtime type checking
Default values
Mixed use of keyword and positional arguments

Enough talking. Now let’s get to the code.

The Implementation

Note: Don’t expect this to be a step-by-step guide. It’s just a quick and dirty implementation that I hacked together in an hour and wanted to share.

Let’s start by taking a look at our descriptor

from typing import Any, Callable


class NoDefault:
    ...


NO_DEFAULT = NoDefault()


class Descriptor:
    def __init__(self, name, kind, default: Callable | NoDefault):
        self.name = name
        self.kind = kind
        self.default = default

    def __get__(self, instance, owner):
        if (
          self.name not in instance.__dict__
          and self.default is not NO_DEFAULT
        ):
            return self.default()  # type: ignore
        return instance.__dict__[self.name]

    def __set__(self, instance, value):
        if isinstance(value, self.kind):
            instance.__dict__[self.name] = value
            return
        raise TypeError(
            f"Expected '{self.kind.__name__}' \
            got '{type(value).__name__}' for {self.name}"
        )

Descriptors are out of the scope of this article, but I’ll give a gist for those unaware. Descriptors allow you to(as is commonly said) “own the dot”

Say when you do obj.x. Normally, this would lookup x in the object’s dictionary. But if x is a descriptor, it will call __get__ on it.`

What our descriptor does is straightforward. Let’s take a look at the two methods in it.

In the __get__ method, it checks if the attribute x is in the object’s dictionary. If it is, it returns the value. This means the user has already set the value. If it isn’t, it checks if the user set a default value. If they did, it returns the default value.

For the __set__ method, it checks if the value is of the correct type. If it is, it sets the value in the object’s dictionary. If it isn’t, it raises a TypeError.

Now that we have a descriptor, let’s define a way to tell our code that we want to use our dataclass. We use this by creating a Type class.

class Type:
    def __init__(self, kind, **kwargs) -> None:
        self.kind = kind
        if "default" in kwargs:
            self.default = kwargs["default"]
        else:
            self.default = NO_DEFAULT

All it does is take in the kind of the attribute and the default value. The kind can be a type. For example, if we want to create a str attribute, we would do Type(str).

default is the function that when called, would return the default value. For example, if we want to create a str attribute with a default value of “Hello”, we would do Type(str, default=lambda: "Hello").

And now finally, we can create our metaclass!

class MetaClass(type):
    def __new__(cls, name, bases, attrs):
        new_attrs = {}
        for key, value in attrs.items():
            if isinstance(value, Type):
                new_attrs[key] = Descriptor(
                  key,
                  value.kind,
                  value.default
                )
            else:
                new_attrs[key] = value
        return super().__new__(cls, name, bases, new_attrs)

    def __call__(self, *args: Any, **kwargs: Any) -> Any:
        params = {
            key: value
            for key, value in self.__dict__.items()
            if isinstance(value, Descriptor)
        }
        if (len(args) + len(kwargs)) > len(params):
            raise TypeError("Too many arguments")
        items = iter(params.items())
        for arg in args:
            key, _ = next(items)
            if key in kwargs:
                raise TypeError(
                    f"Duplicate argument - {key}\
                     passed both by position and by name"
                )
            kwargs[key] = arg

        for param in items:
            key, value = param
            if key not in kwargs and value.default is NO_DEFAULT:
                raise TypeError(f"Missing argument - {key}")

        def __init__(instance, *vargs, **kvargs):
            for key, value in kwargs.items():
                setattr(instance, key, value)

        self.__init__ = __init__
        return super().__call__()

A metaclass is a class that is used to create a class. That class must be returned by the __new__ method. In it, we loop over the attributes of the class.

for key, value in attrs.items():
  if isinstance(value, Type):
      new_attrs[key] = Descriptor(
        key,
        value.kind,
        value.default
      )
  else:
      new_attrs[key] = value

If the attribute is a Type instance, we create a Descriptor instance and add it to the new class. Remember - even though descriptors belong to the class, they intercept attribute accesses on the actual instance.

If it is not a Type instance, we just add it to the list of attributes.

Then, we call the super class’s __new__ method, passing in the attributes that we created.

return super().__new__(cls, name, bases, new_attrs)

Next, we create the __call__ method. This method is called when an instance of the class is created. For example, if we do Person(), this method is called.

Here, we do some basic housekeeping. First, we ensure that we have the correct number of arguments. Then, we loop over the attributes and set the values.

params = {
    key: value
    for key, value in self.__dict__.items()
    if isinstance(value, Descriptor)
}
if (len(args) + len(kwargs)) > len(params):
    raise TypeError("Too many arguments")
items = iter(params.items())
for arg in args:
    key, _ = next(items)
    if key in kwargs:
        raise TypeError(
            f"Duplicate argument - {key}\
              passed both by position and by name"
        )
    kwargs[key] = arg

for param in items:
    key, value = param
    if key not in kwargs and value.default is NO_DEFAULT:
        raise TypeError(f"Missing argument - {key}")

We also take care of missing and duplicate arguments. If however there’s a default value set, we need not worry about raising an error.

And of course, we must set the __init__ method of the class. While we have all the arguments we need, we have to make sure when the __init__ method is called, it sets the values.

def __init__(instance, *vargs, **kvargs):
  for key, value in kwargs.items():
      setattr(instance, key, value)

self.__init__ = __init__
return super().__call__()

Yes, that’s perfectly legal. You can set __init__ on a class(which in this case is self)

And that’s it! I know that was a lot of work, but I think it’s worth it.

Using it

Let’s consider the same example as before.

class Person(MyDataClass):
    name = Type(str)
    age = Type(int, default=lambda: 18)

So we have a class called Person that has a name attribute(which is a str) and an age attribute (which is an int with a default value of 18).

Now, let’s create a new instance of Person.

>>> p = Person()
Traceback (most recent call last):
  File "<pyshell#2>", line 1, in <module>
    p = Person()
  File "/mnt/Programming/Python/lib.py", line 70, in __call__
    raise TypeError(f"Missing argument - {key}")
TypeError: Missing argument - name
>>> p = Person("Shashwat")
>>> p.name
'Shashwat'
>>> p.age
18

Looks good so far! Let’s try the type checking

>>> p = Person("Shashwat", age="25")
Traceback (most recent call last):
  File "<pyshell#6>", line 1, in <module>
    p = Person("Shashwat", age="25")
  File "/mnt/Programming/Python/lib.py", line 77, in __call__
    return super().__call__()
  File "/mnt/Programming/Python/lib.py", line 74, in __init__
    setattr(instance, key, value)
  File "/mnt/Programming/Python/lib.py", line 26, in __set__
    raise TypeError(
TypeError: Expected 'int' got 'str' for age

We can also set the attributes after the instance has been created.

>>> p = Person("Shashwat", 20)
>>> p.age
20
>>> p.age = 10
>>> p.age
10
>>> p.age = 10.5
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in <module>
    p.age = 10.5
  File "/mnt/Programming/Python/lib.py", line 26, in __set__
    raise TypeError(
TypeError: Expected 'int' got 'float' for age

And we have validation for the __init__ as well

>>> p = Person("Shashwat", name="Another")
Traceback (most recent call last):
  File "<pyshell#14>", line 1, in <module>
    p = Person("Shashwat", name="Another")
  File "/mnt/Programming/Python/lib.py", line 62, in __call__
    raise TypeError(
TypeError: Duplicate argument - name passed both by position and by name
>>> p = Person("Shashwat", 10, 20)
Traceback (most recent call last):
  File "<pyshell#15>", line 1, in <module>
    p = Person("Shashwat", 10, 20)
  File "/mnt/Programming/Python/lib.py", line 57, in __call__
    raise TypeError("Too many arguments")
TypeError: Too many arguments

That’s a lot of neat functionality for a class that is only 3 lines long.

If your head hasn’t exploded yet, you can try using inheritance to see if this works there as well!

Conclusion

If you’ve never heard of Metaclasses or Descriptors, this might feel weird and confusing. And you might wonder why you’d want to use them at all. After all, python has dataclasses so why reinvent the wheel?

The answer to that is obvious - You should never create your version of dataclasses. But the concepts we’ve discussed today are quite useful and you’ll find these techniques used in a lot of frameworks to make things easier.

Take this example from the sqlalchemy documentation:

class User(db.Model):
    id = db.Column(db.Integer, primary_key=True)
    username = db.Column(db.String(80), unique=True, nullable=False)
    email = db.Column(db.String(120), unique=True, nullable=False)

The reason it’s possible to define your models in such a concise manner is that the library itself is using descriptors and metaclasses to do the heavy lifting.

Using them in user code is probably something you’d never want to do but if you’re writing your framework, they come in incredibly handy.

Resources

Here are some things you might find helpful if you want to learn more about what we’ve discussed:

You can find the entire code for this article here.

Hope you learned something useful!