Synchronizing Plone Content with an ODBC database using pyodbc

by Martin Preisler last modified Dec 30, 2008 03:03 PM
Describes how to synchronize via ODBC in Microsoft Windows systems. The library used for getting the data via ODBC is pyodbc.

What exactly do you mean by synchronization?

Imagine that you have a working intranet Plone-based application and you want certain data, for example a list of customers with addresses, to be grabbed from existing database system based on MS Access (or any other ODBC capable database).

One way to do that is to execute a method of the folder holding the Plone customer entry objects. The method can grab the whole list of customers via ODBC, and iterate over them checking whether customer of such id already exists or not. If it exists, just update the data; if it doesn't exist, create it from scratch via self.invokeFactory( "Customer", customer_id ).

Please note that even though customer's name (title) can be changed, the object's id (or any other credential you know is immutable) must not. That's because we need a way to determine that such customer is already in the database in order to avoid creating duplicate instances for the customer. Also, deleting every customer in the list and adding them from scratch isn't a good option: it's too slow and it breaks references and integrity (your ZODB will grow into gigabytes very quickly unless you do packing).

Sound good, how do you do it then?

The idea is pretty simple, first get all entries we already have in our folder and then just get all rows via ODBC and iterate over them. In each iteration, iterate over data we already have and find if there is a matching customer; if so, update it's data, if none is found, create a new one and update it's data.

Is it multi-platform?

Of course, you're probably not running MS Access or SQL Server on Linux or anything else non-MS. But, you may of course connect remotely from UNIX machine to a Windows one holding the ODBC database.

How does it perform?

Performance for this solution isn't excellent, but I found it acceptable if done at night. I synchronize 300 entries, each with 20+ columns and it takes 8-15 minutes each time it's run. Machine load is very low during that time, I am not sure what slows it down. (ODBC Tracing was enabled, with disabled tracing it takes barely a minute.)

Let's get started

Installing pyodbc

Head over http://pyodbc.sourceforge.net and download it, place it into your site-packages directory in either global Python path or bundled Python if you installed Plone/Zope via the unified installer.

Adding ODBC system source

Start->Control Panel->Administrative tools->Data sources (ODBC)

Switch to System DSN tab and add source to your MS Access (or any other ODBC capable) database. Select default user name and password to make things easier (this is a security risk, tighten this up for a non-development server

).

 

Querying the data from python

ODBC accepts SQL commands, so to get the data you have to perform SQL query.

Insert this method into your folderish container you want synchronized (CustomerList in my case):

    def synchronize( self ):
        ret = ""
        
        contents = self.listFolderContents()

        ret = ret + "\nConnecting to database via ODBC"
        
        import pyodbc
        connection = pyodbc.connect( "DSN=INSERT_YOUR_DSN_HERE;Trusted_Connection=yes" )
        cursor = connection.cursor()

        ret = ret + "\nExecuting query - SELECT * FROM Customers ORDER BY Name"
        cursor.execute( "SELECT * FROM Customers ORDER BY Name" )      
        ret = ret + "\nQuery executed, iterating over the resulting rows"

        plone_utils = getToolByName( self, 'plone_utils', None )

        for row in cursor:
            ret = ret + "\nI am at entry:" + str( getattr( row, "Name" ) )
              
            existing = False
            existing_zodb_object = None
            for entry in contents:
                # if the entry is already there, just edit it's properties
                if entry.getName() == getattr( row, "Name" ):
                    existing = True
                    existing_zodb_object = entry
                    break

            zodb_object = None
            if existing:
                ret = ret + "\n ... entry is already there, updating it's properties"
                zodb_object = existing_zodb_object
            else:
                ret = ret + "\n ... entry isn't in the system, creating it and updating it's properties"
                new_id = plone_utils.normalizeString( getattr( row, "Name" ) )
                zodb_object = self.invokeFactory( "Customer", new_id )
                zodb_object = getattr( self, new_id )
                zodb_object.setName( getattr( row, "Name" ) )
            
            # now set your properties to zope object
            zodb_object.setTitle( getattr( row, "Name" ) )
            zodb_object.reindexObject()
            
            
        connection.close()
        ret = ret + "\nSynchronization ran successfully!"
        return ret

Why did I choose pyodbc?

It was the only free and up-to-date solution, mxODBC is commercial, and other packages like Zope ODBC adapter are outdated or don't suit my needs. And because "it just works" (tm).

Improvements

This is just a code snippet to point you to the right direction. If you're database is substantial, you'll need to make improvements. For example, this script loads all the objects in the folder into memory at once. This is a bad practice performance wise, but it doesn't matter for my application, because I only do synchronization once a day. But if you want better performance or have a large number of items in your database, use portal_catalog, index the immutable property, and use that index to find objects.