Keeping data integrity with Django migrations

Django built-in migrations were designed as a way of propagating the changes you make on the models to the database. Migrations exist mostly to keep your code in sync with your schema and to provide a good way of versioning it. We just need a couple of commands to get our apps up and running with an updated version of our database, or not.

Unfortunately, there are some situations when applying a migration by itself is not enough. What would you do if you needed to add a required field to an existing table?

How about changing a many-to-many relationship to just a foreign key without losing pre-existing data? It may be painful and critical to make these changes without the right tool; that’s why we have Data Migrations! In this post I’ll use data migrations to approach these two scenarios.

First Scenario

Let’s say you have to design a tracking system for a company that sells computers. This system should register the computers bought by each customer. We can migrate the following models:

GitHub

from django.db import models

# Create your models here.
class Client(models.Model):
    name = models.CharField(max_length=255)

class Computer(models.Model):
    brand = models.CharField(max_length=255)
    bought_by = models.ForeignKey(Client, related_name='computers')

Now, for some reason, we’ll need to fetch a list with the information of how many computers each customer has bought.

Somehow, this is incredibly expensive to compute and your queries would take extra time if you use annotations with a Count aggregation.

One possible solution would be to add a column on the Customer model to keep track of the amount of computers sold to that specific customer. To keep data integrity, we’ll make this field required for all new customers.

To provide the initial data to  this new computed required field on your model, you can use a data migration. It basically works by executing a custom function in between schema migrations.

Step 1. First we add the field to our model as a non-required field and run the makemigration and migrate commands:

GitHub

from django.db import models

# Create your models here.
class Client(models.Model):
    name = models.CharField(max_length=255)
    bought = models.IntegerField(null=False)

class Computer(models.Model):
    brand = models.CharField(max_length=255)
    bought_by = models.ForeignKey(Client, related_name='computers')

Step 2. Then, we’ll add a blank migration file; here we’ll add the custom function to perform the data migration. We use the following command:

python manage.py makemigrations --empty store --name prepopulate_bought_field

Note: store is the Django app you’re working on and you can use the –name parameter to get more informative names for your migrations.

Step 3. Now you’ll need to write the function for this migration (this is called the forwards function):

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from django.db import migrations

def add_bought(apps, schema_editor):
    Client = apps.get_model('store', 'Client')
    for c in Client.objects.all():
        c.bought = c.computers.count()
        c.save()

class Migration(migrations.Migration):
    dependencies = [
        ('store', '0002_client_bought'),
    ]

    operations = [
        migrations.RunPython(add_bought, reverse_code=migrations.RunPython.noop)
    ]

Note that for using our Customer model, we can’t just simply import it; this would get the most recent model that could be already changed by someone else (e.g. adding a field). To avoid this, we use apps.get_model to fetch the current migration version.

This operation can also have a rollback (or reverse function) if you want or need to, simply add it as the RunPython’s parameter reverse_code – see more here. In this example, we use migrations.RunPython.noop to make the migration reversible without executing any code on the rollback.

Step 4. After computing and storing all data, we can change the bought field to required in our model by setting its null property to False (it’s also possible to  do this change on the data migration file if you want to save a migration). This step will generate another migration and you may see something like this in your screen after running makemigrations command:

You are trying to change the nullable field 'bought' on customer to non-nullable without a default; we can't do that (the database needs something to populate existing rows).
Please select a fix:
1) Provide a one-off default now (will be set on all existing rows with a null value for this column)
2) Ignore for now, and let me handle existing rows with NULL myself (e.g. because you added a RunPython or RunSQL operation to handle NULL values in a previous data migration)
3) Quit, and let me add a default in models.py
Select an option:

You only need to select 2 because that’s exactly what you did: you added a RunPython operation to handle null values in a previous data migration.

Second Scenario

Now, say we have an address list and we thought it would be nice to have a many-to-many relationship so we wouldn’t need to repeat rows in address table. A few days later you realize that when users change their addresses, it changes for other users with the same address too. You now want to store a full address for each customer, even if it comes with some cost.

The current state of our models looks like this:

GitHub

from django.db import models

# Create your models here.
class Address(models.Model):
    street_name = models.CharField(max_length=255)

class Client(models.Model):
    name = models.CharField(max_length=255)
    addresses = models.ManyToManyField(Address)

For this data migration, we’ll need an intermediary model to store addresses data before deleting it! We change our models to something like this:

GitHub

from django.db import models

# Create your models here.
class Address(models.Model):
    street_name = models.CharField(max_length=255)

class Client(models.Model):
    name = models.CharField(max_length=255)
    addresses = models.ManyToManyField(Address)

class NewAddress(models.Model):
    street_name = models.CharField(max_length=255)
    client = models.ForeignKey(Client)

Similarly to what we’ve done in the previous scenario, we create a migration and write write a custom function (or forwards_func) to move all data that was store on Address table to NewAddress table.

It looks like this:

GitHub

# -*- coding: utf-8 -*-
from __future__ import unicode_literals

from django.db import migrations, models
import django.db.models.deletion

def migrate_data(apps, schema_editor):
    Client = apps.get_model('address_book', 'Client')
    NewAddress = apps.get_model('address_book', 'NewAddress')

    for c in Client.objects.all():
        for a in c.addresses.all():
            NewAddress.objects.create(street_name=a.street_name, client=c)

class Migration(migrations.Migration):

    dependencies = [
        ('address_book', '0001_initial'),
    ]

    operations = [
        migrations.CreateModel(
            name='NewAddress',
            fields=[
                ('id', models.AutoField(auto_created=True, primary_key=True, serialize=False, verbose_name='ID')),
                ('street_name', models.CharField(max_length=255)),
                ('client', models.ForeignKey(on_delete=django.db.models.deletion.CASCADE, to='address_book.Client')),
            ],
        ),
        migrations.RunPython(migrate_data, reverse_code=migrations.RunPython.noop)
    ]

We can now remove our old Address model, then run makemigrations and rename your NewAddress model to Address. It’s important to run makemigrations in between these two operations, otherwise Django will treat NewAddress as if it was the old Address.

Note that even using migrations.RunPython.noop, this migration is not reversible because we’re deleting and renaming models. If you need it to be reversible, you should write another custom function that does the rollback.

Conclusions

These operations can be very harmful to your data if used improperly, so it’s always a good idea to have a database dump before doing something like this in production. These scenarios we used as example are a bit silly, but it’s enough to get the idea of how to apply data migrations to solve data consistency issues.

You can also make fixtures for loading some initial data to your apps or just importing a text file on any of your data migrations and running a forwards function to get them in your database.

As Sir Oliver Wendell Holmes would say, The young man knows the rules, but the old man knows the exceptions. So if you know any exceptions, please let us know, help is very much appreciated.

The code used in this blogpost is available on GitHub.

Good references

About the author.