Use XPath in PL/PGSQL for Batch Processing
2 min read

Use XPath in PL/PGSQL for Batch Processing

Use XPath in PL/PGSQL for Batch Processing

A while ago I had to process XML loaded in a PostgreSQL database. The task was to extract references from a publication structure. Fortunately, PostgreSQL allows you to use XPath in plpgsql, but there are some things you need to be aware.

First, you need to note all namespaces:

nsarray := ARRAY [
    ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
    ARRAY['ce',  'http://www.elsevier.com/xml/ani/common'],
    ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
    ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
    ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
];

Then, you can use the xpath() function:

xpath(
    '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
    entry,
    nsarray
)

...but that's no fun, particularly when you are almost guaranteed to get multiple hits. It's better to use:

FOR xp in SELECT * from unnest(xpath(
    '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
    entry.article, nsarray)) LOOP

    -- Do something with the record
    --
    refids := refids + 1;
END LOOP;

A full implementation looks like this:

CREATE OR REPLACE FUNCTION cs_refresh_mviews() RETURNS integer AS $$
DECLARE
    entry RECORD;
    xp RECORD;
    count INTEGER;
    refids INTEGER;
    nsarray varchar array;
begin

    nsarray := ARRAY [
        ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
        ARRAY['ce',  'http://www.elsevier.com/xml/ani/common'],
        ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
        ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
        ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
    ];
    count := 0;
    refids := 0;
    FOR entry IN SELECT * FROM "ERC_ERIS_XML_DATA" LOOP

        FOR xp in SELECT * from unnest(xpath(
          '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
            entry.article, nsarray)) LOOP
            refids := refids + 1;
        END LOOP;

        count := count + 1;

        IF (count % 100) = 0 THEN
        raise notice 'ID: (%)', entry.id;
    END IF;
    END LOOP;
    raise notice 'Rows processed: (%)', count;
    raise notice 'REFS processed: (%)', refids;

    RETURN 1;
end;
$$ LANGUAGE plpgsql;

select cs_refresh_mviews() as output;

It gets all XML documents stored in a table with the column name article. For each document, get the references and increment a value. At the end we'll get messages on how many items we had in the table and how many references.

For more information, go to the documentation.