Add XML to PostgreSQL from Python

Posted by in Software

One of the projects I worked on was to import a large number of XML files in a PostgreSQL database (as XML files). I chose python to do it. Here are the steps:


The data source is the SCOPUS database which has a silly number of entries (approx 20,000,000), most containing 2 XML files (article and citations). The publications with no citations don’t have the citations.xml file.

Therefore, initial database has a single table named scpous_data with the following fields:

  1. id – string of 20 characters. This is the primary key too
  2. article – the place where we store the article XML
  3. citations – the place where we store the citations xml

The exercise was to fill up the table with data.


I’m connecting to the DB using psycopg2. Since this was a test to see the speed of importing, I chose to hard-code the connection parameters:

Insertion statements

Once the tests of connecting to the DB were done, I’ve built the insertion commands:

Inserting a record

Once the upsert string is defined (and correct), we can design the insert function:

This method will be quite slow because we perform the commit. A better approach is to batch several inserts together and *then* commit.

Load records

To load a record from the physical medium into the DB, I defined a simple loader function:

This takes advantage the last component of the path is the article ID. Otherwise, I’d have to look for the ID in the XML file itself (or extract it from the XML file’s name).

To load the sample onto the disk, I just did an os.walk():

Then, I wrapped everything in a main() function:

Bonus: XML query through python

PostgreSQL is almost as nice as Oracle when it comes to XML, but not nearly as quick (or so they say). However, you can do nice queries in XML files and get the results in python as from any query. Below are a couple of examples.

First one gets the eid of the citing document from the citedby column and presents the result as a list:

Second example gets all the references from an article XML. Since the article XML has namespaces, they must be specified.


This proved to work but it was quite slow.

PostgreSQL is not as powerful as Oracle with XML (Oracle offers e.g. indexing), so queries would be quite slow.

To speed up queries, the data would need to be transformed in a relational model, so the effort for loading XML (and practically duplicating the data) is not really justified.

I attempted to tweak/fix the loading process via:

  • cursor.setinputsizes(cx_Oracle.CLOB, cx_Oracle.CLOB, 20)
  • cursor.prepare(UPDATE_STATEMENT)


A little experiment: If you find this post and ad below useful, please check the ad out 🙂