Aug 7, 2017 2 min read convert

Save Pandas DataFrame as Django Model

TL;DR: use bulk_upload coupled with a comprehension to speed up loading.

Initially, I started to convert the data frame to a Model object row by row and save it. While easy, it's also arguably the slowest method to load records, because the save() call performs a commit. Then I looked up how to save pandas data frames and I found out quite a few ways...

Manual commit

The first step would be to tweak the transaction to manually commit like described here:

from django.db import transaction

@transaction.commit_manually
def save(df):
    ...
    for item in df.to_dict('records'):
        entry = Entry(**item)
        entry.save()
    transaction.commit()

SQL

Other solution would be to use raw SQL. Here, you need to replicate the table fields by tweaking the data frame.

from django.conf import settings

user = settings.DATABASES['default']['USER']
password = settings.DATABASES['default']['PASSWORD']
database_name = settings.DATABASES['default']['NAME']

database_url = 'postgresql://{user}:{password}@localhost:5432/{database_name}'.format(
    user=user,
    password=password,
    database_name=database_name,
)

engine = create_engine(database_url, echo=False)
df.to_sql(model._meta.db_table, con=engine)

Where:

model._meta.db_table is the table where we want to save
database_url is a sql_alchemy database URL (above configured for PostgreSQL)

This option has a few drawbacks:

The URL needs to be manually created. If you have different DB engines for dev/test, you'll need to cater for that
You'll need to add a pk column with incremental values
The code needs to be changed for each migration. Ouch!

Using bulk_create()

More recent versions of django have added bulk_create() where you can pass an array of model objects and they get created at once. The code would be something like:

Model.objects.bulk_create(
    Model(**vals) for vals in df.to_dict('records')
)

The operation works like this:

Convert the dataframe in a list of dicts
Create a Model object for each item in the list
Perform the bulk_create() on the list

This approach has some disadvantages too:

It creates objects (not updates) so you can end up with duplicates if you're not careful
You need to make sure the dict generated by a row can be used to create the model object
It's slower than the SQL method above
You might end up with lots of memory allocated if the dataframe is very large

I find the above disadvantages are minor compared to the simplicity of the execution.

Adapting DataFrame to Match

My approach to making sure the dict generated from a dataframe's row can be used to create a Model object is:

Note the model's fields which are not nullable
Change the existing column names to match the field names
Pay attention and convert timestamp values to datetime values:
```
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
```
Create any missing columns

Manual commit

SQL

Using bulk_create()

Adapting DataFrame to Match

You might also like...

Helm - Unable to parse YAML

My Process to Convert SACD ISOs

Send JSON Graph Data from Django

Build a Django Docker Image with Cx-Oracle

Fix *standard_init_linux.go:XXX: exec user process caused "no such file or directory"* Error

Popular tags

Fix standard_init_linux.go:XXX: exec user process caused "no such file or directory" Error