TL;DR: Here I present how to insert data with python in the table created at the previous step.
Table of Contents
- Part 1 - Prerequisites
- Part 2 - Create a table and perform a SELECT
- Part 3 - User Python to insert data
- Part 4 - references
In the previous entry, I've finished by creating the table, loading an entry manually and query it. Everything was done via SQL, so repeating the same process for a large number of XMLs is not really an idea I'd entertain too much. As my language of choice is python, I've decided to write a script to scan directories for the relevant files and upload them accordingly.
To make the script more versatile, I'm using argparse:
import argparse #... def arguments(): parser = argparse.ArgumentParser( description='Compute sizes for pattern-based items.' ) parser.add_argument( '-d', '--dir', action='store', type=str, default='.', help='the directory (default .)' ) args = parser.parse_args() return args
This will allow to specify a custom starting folder, in case I want to import more/all records later.
The Database Connection
Connecting to the Oracle DB in python is easy if you have all credentials:
def connect(): dsn = cx_Oracle.makedsn('server.na.me', port, 'SID') connection = cx_Oracle.connect(user='user', password='password', dsn=dsn) return connection
I've tried before using cx_Oracle.Connection() but it didn't work. This works for meTM.
Walk the files
My sample structure is:
+- Samples +- 00000000000000000001 +- 00000000000000000001.xml \- citedby.xml +- 00000000000000000002 +- 00000000000000000002.xml \- citedby.xml ... \- 00000000000000000999 +- 00000000000000000999.xml \- citedby.xml
So, to accommodate even more layers in the future (e.g. a per-year grouping), The code looks like this:
for root, dirs, files in os.walk(args.dir): # Only care about leaf folders. # if len(dirs) == 0: identifier = os.path.split(root)[-1] article = open(os.path.join(root, "%s.xml" % identifier)).read().strip() # the citations file is optional # try: citations = open(os.path.join(root, "citedby.xml")).read().strip() except IOError: citations = '<?xml version="1.0" encoding="UTF-8"?><cited-by><count>0</count></cited-by>'
This code will take care of reading the XML files and give me the identifier, article and citations strings.
Insert the XML into Oracle DB
Now that we have the table created in the previous article, we can connect to the DB from python and that we are able to scan the directories and load the files, the next step is to insert the read contents into the DB.
Prepare the cursor
To do this, we need to first create a cursor:
connection = connect() cursor = connection.cursor() cursor.setinputsizes(20, cx_Oracle.CLOB, cx_Oracle.CLOB) cursor.prepare('insert into ARTICLES_XML_DATA(id, article, citedby) ' + 'values (:1, xmltype(:2), xmltype(:3))')
The cursor.setinputsizes() function gives some hints as to the size and type (at some extent) of the data that will be loaded. In our case, we'll need the first field to be of a length of 20 (characters) and the subsequent fields are of CLOB types (therefore of variable length).
I chose to use the couple cursor.prepare(), cursor.insert(None, data) rather than cursor.insert('sql string', data) command directly, just to have the code clearer. The statement would be the same.
Important! If you're getting errors trying to insert XMLTYPE fields, the above "xmltype(:var)" is what you're looking for. by default, the above bound variables would be VARCHAR(20), CLOB and CLOB respectively while we want VARCHAR(20), XMLTYPE and XMLTYPE. Oracle offers a cast-like function.
Loading the data
Once our cursor is prepared, all we need to do is to use it. So, inside the loop, just after the citations processing, insert:
# Add the data to the DB try: cursor.execute(None, (identifier, article, citations)) connection.commit() except cx_Oracle.IntegrityError, e: if ("%s" % e.message).startswith('ORA-00001:'): print >> sys.stderr, "Entry already there: ", identifier else: raise e except cx_Oracle.DatabaseError, e: if ("%s" % e.message).startswith('ORA-31011:'): print >> sys.stderr, "Parsing failed: ", identifier else: raise e
I've complicated this a bit to handle duplicated records and output XML parsing issues. The simple variant is:
# Add the data to the DB # cursor.execute(None, (identifier, article, citations)) connection.commit()
Once you're done with the loop close the cursor via
cursor.close() and when you're finished with the whole processing, close the DB connection too (
Loading data from python into Oracle DB is doable for exotic types like XMLTYPE and the combination is very powerful. One can combine these technologies with e.g. django and REST to get very interesting results :)