Skip to content

Use XPath in PL/PGSQL for Batch Processing

March 5th, 2016 - SoftwareTutorial

A while ago I had to process XML loaded in a PostgreSQL database. The task was to extract references from a publication structure. Fortunately, PostgreSQL allows you to use XPath in plpgsql, but there are some things you need to be aware.

First, you need to note all namespaces:

nsarray := ARRAY [
    ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
    ARRAY['ce',  'http://www.elsevier.com/xml/ani/common'],
    ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
    ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
    ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
];

Then, you can use the xpath() function:

xpath(
    '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
    entry,
    nsarray
)

…but that’s no fun, particularly when you are almost guaranteed to get multiple hits. It’s better to use:

FOR xp in SELECT * from unnest(xpath(
    '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
    entry.article, nsarray)) LOOP

    -- Do something with the record
    --
    refids := refids + 1;
END LOOP;

A full implementation looks like this:

CREATE OR REPLACE FUNCTION cs_refresh_mviews() RETURNS integer AS $$
DECLARE
    entry RECORD;
    xp RECORD;
    count INTEGER;
    refids INTEGER;
    nsarray varchar array;
begin

    nsarray := ARRAY [
        ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
        ARRAY['ce',  'http://www.elsevier.com/xml/ani/common'],
        ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
        ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
        ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
    ];
    count := 0;
    refids := 0;
    FOR entry IN SELECT * FROM "ERC_ERIS_XML_DATA" LOOP

        FOR xp in SELECT * from unnest(xpath(
          '//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
            entry.article, nsarray)) LOOP
            refids := refids + 1;
        END LOOP;

        count := count + 1;

        IF (count % 100) = 0 THEN
        raise notice 'ID: (%)', entry.id;
    END IF;
    END LOOP;
    raise notice 'Rows processed: (%)', count;
    raise notice 'REFS processed: (%)', refids;

    RETURN 1;
end;
$$ LANGUAGE plpgsql;

select cs_refresh_mviews() as output;

It gets all XML documents stored in a table with the column name article. For each document, get the references and increment a value. At the end we’ll get messages on how many items we had in the table and how many references.

For more information, go to the documentation.

Share on
Reddit
Linked in
Whatsapp

A little experiment: