Use XPath in PL/PGSQL for Batch Processing

Posted by in Software, Tutorial

A while ago I had to process XML loaded in a PostgreSQL database. The task was to extract references from a publication structure. Fortunately, PostgreSQL allows you to use XPath in plpgsql, but there are some things you need to be aware. First, you need to note all namespaces:

Then, you can use the xpath() function:

…but that’s no fun, particularly when you are almost guaranteed to get multiple hits. It’s better to use:

A full implementation looks like this:

It gets all XML documents stored…read more

Migrate PostgreSQL Between Two Machines

Posted by in IT, Tutorial

TL;DR: Once you prepare the connection, pg_dump is your friend. Recently, we had to migrate a large-ish DB (1.5 TB) from a Windows installation to a Linux one in order to accommodate a series of third party extensions. We have split the process in three: Prepare the migration Migrate users Migrate the database Wrap up Let’s say we have the two machines: SourceMachine – the machine containing the source DB (the Windows machine in my case) DestinationMachine – the destination machine Note: All PostgreSQL commands are symmetrical in the sense…read more

Add XML to PostgreSQL from Python

Posted by in Software

One of the projects I worked on was to import a large number of XML files in a PostgreSQL database (as XML files). I chose python to do it. Here are the steps: Database The data source is the SCOPUS database which has a silly number of entries (approx 20,000,000), most containing 2 XML files (article and citations). The publications with no citations don’t have the citations.xml file. Therefore, initial database has a single table named scpous_data with the following fields: id – string of 20 characters. This is the…read more

Add Primary Key Column to PostgreSQL Table

Posted by in Software

The other day I got a question: How to add a numeric PK column to an already existent table? A quick look on the net showed that it is indeed possible to add an auto-incrementing PK to an already existing table (if you don’t have already a PK that is!). The code is quite simple too! The statement:

will add an autoincrementing column named id to my table_without_pk. HTH, PS: Thanks to a couple of colleagues of mine for asking the question and working together to finding the answer…read more

Fuzzy String Matching in PostgreSQL

Posted by in IT

A database I am working on has a lot of records (40M+) and the guys mining it are currently using python to process it. This is very powerful (nltk) but also quite slow. One of the guys has found that you can do fuzzy string matching straight in Postgres and the feature is quite advanced! fuzzystrmatch is a PostgreSQL module providing functions to determine similarities and distances between strings. pg_trgm is a PostgreSQL module to determine the similarity of ASCII alphanumeric text based on trigram matching. This module offers indexing…read more

Postgres: Show Schema Sizes in a Database

Posted by in Software

While loading a large database from files to a Postgres schema, I found very useful to know how much space on disk it actually occupies (just in case I need to re-size the disk). I found this answer online:

Which results in a list like: HTH, PS: All credit goes to this StackOverflow answer! A little experiment: If you find this post and ad below useful, please check the ad out :-)

Load a CSV File with Header in Postgres via Psycopg

Posted by in Software

My task is to create a script to load a bunch of files into a DB model. The files are CSV, with quoted strings and headers. Postgres has a COPY command which I find fabulous for loading data. unfortunately, the psycopg implementations copy_from() , copy_to() and copy_expert() have very few examples and I found their usage a bit of a challenge. Solution Initially, I thought of opening the file, parsing it myself and perform a copy_from() like in the other script dealing with XML (aka lots of work). I had…read more

Insert in PostgreSQL in (Almost) Constant Time

Posted by in Software

One of the things I did was to write a script to insert records in a PostgreSQL database. The records have with a quite simple structure: id – char(20) content – XML field aside – XML field I was working with a sample of 10,000 entries out of 40 million and the script went through several transformations: Version 1 – synchronised INSERT, pretty-printed DOM(*) Version 2 – synchronised INSERT, CR/LF-ed DOM(*) ( xml_source_string.replace("><", "> <") Version 3 – buffered INSERT (commit every e.g. 1000 entries), CR/LF-ed DOM (*) – PostgreSQL XML type…read more

PostgreSQL: Call a Stored Procedure for Every Row

Posted by in Software

I guess that newbies in PostgreSQL like me have first question when they create a store procedure: How do I call it? The answer is simple; you just execute

. Then, usually immediately the second question pops up: I wrote a procedure which I need to call for each row in a table. How do I do that? The answer is, just as simple; you execute

. A Short Explanation Let’s assume a table:

and a procedure:

The procedure is designed to be called for each row…read more