#1: Avoid duplicate entries at import
Add a control flow during the import process to check folder wide or portal wide if the bibliographical references that are about to be imported should be considered as a copy of any existing one and in that case to let the user decide what action has to be performed.
- Proposed by
- David Convent
- Seconded by
- Raphael Ritz
- Proposal type
- User interface, Architecture
- Assigned to release
- State
- in-progress
Motivation
In a system where bibliograpical references are added by importing them from (large) files, we have huge risks to have duplicate entries. We want to reduce that risk.
It would probably be difficult to prevent people from creating duplicates while manually adding new bibliography references, but we want to improve the control at import time.
Assumptions
The deadline for implementing the functionnality is set to july 15th (2005), so we don't have much time. Therefore we have to think in terms of basic and extendable functionalities instead of comprehensive usecases.
We need the process to be as general as possible. It has to be performed in a totally independent way from how specific parsers are working. Parsers are modules that can be added to the bibliography tool: we developed specific ones for our needs, I'm sure we aren't or won't stay the only ones.
In the current version of the product, an import report is written in a property of the folder where data was imported. This report has to reflect changes made in order to implement the duplicate check flow.
Proposal
Basically, the user will choose what file to import and its coresponding format from the import form, where he also will have to specify if he wants the imported entries to be checked for duplicates folder wide or portal wide.
For now when the form is submitted, the import script asks the bibliography tool to use the apropriate parser and return all entries as a list of dictionnaries. The idea is to keep this behavior: once it has the entries to import in a format easy to manipulate, the script can check for each entry if it has to be considered as already living in the system.
The check for existing entries must be performed against several criteria: for instance we want the tool to check if the bibliography type (article, book, etc) the authors, the title and the publication year of an entry are matching any entry already in the system.
The criteria list will be defined portal wide (at least at first implementation), and must be editable from the configuration panel of the bibliography tool, so that they can be edited by the portal manager without having to touch the code. Criteria should be editable for every bibliography reference type.
If no existing entry match the criteria (in the current bibliography folder or in the portal, regarding what was decided from the import form), a new entry is added to the current folder and we continue the import process.
But if one or more existing reference(s) match(es) the search criteria, we want the system to ask the user what he wants the system to do.
We defined 4 required possibilities so far, a 5th one would be nice to have (ha sto be scheduled regarding the time left for coding).
Must have:
- delay import: The user is not in the position to take a decision, he wants to keep the new entry data and be able to come back to it later.
- skip creation: The user decides that the entry that is about to be imported is a duplicate, he doesn't want it to be imported.
- replace entry: The user decides that the new entry is an update of the existing one. He wants all values of the old one to be replaced by the new ones. (In this case the old entry must keep its UID)
- force creation: The user decides that the entry that is about to be imported is not a duplicate, he wants it to be imported as a new entry.
Nice to have:
- update entry: The user decides that some (choosable) fields of the existing entry must be updated, but others should be kept in the old state.
Once all entries have been added, skipped, replaced or delayed the flow is finished.
Implementation
We came across two possible main orientations for implementing the improvement:
The first one is to have the system ask the user what action to take while importing the file (during the import flow). The other one is to have the import flow being performed and then ask the user what to do with what he considers to be duplicate entries.
Even if the first option has good points, we finally think that the second one is best.
In the first case (action on duplicate is taken while importing file) is interesting in terms of user interface easyness, but has very bad points that have to be highlighted.
A problem seems obvious in the case of an import of a file containing a large amount of entries that will be considered as duplicates by the system. If he gets disconnected he takes the risk not to be able to get back to his import at the same time as before. If he then tries to reimport the same file, all already imported entries will match the ones that were imported at first upload and he will have to choose to skip the import for all of them before reaching the state where he was at when the first import was interrupted.
With the second solution, I think we can build a system flexible enough. Here is how I see it:
Bibliography containers should have an attribute called _duplicates (a simple attribute should be enough, no need to play with properties of a Zope propertyManager) that can store a list of dictionnaries where duplicate entries are stored before the action is choosen by the user, and a simple API to manipulate that attribute (with security). As formated bibliographical references are transformed by the parser to python dictionnaries during the import process, it will be very easy to add them to _duplicates.
So for now we have this: the import script asks the bibliography tool to use the correct parser to get all entries in a list of dictionnaries. then it iterates that list, and for each entry checks if any entry (in the current folder or in the portal) matches the duplicate criterias. If not it calles the import method of the folder and continues, if yes it adds it to the _duplicates list of the current bibliography folder.
In the _duplicates list, every dictionnary have a value where the UID(s) (Archetypes UID) of the matching reference(s), so it will be easy to find it/them back.
Once the file is processed, the user is redirected to a page where he can take a decison for duplicates. Ideally, this should be possible to organize in a batch mode.
For every entry in the _duplicates list:
- If the import is delayed, nothing is done
- If the import is skipped, the dictionnary is deleted from the _duplicates list.
- If the import should replace the old one (assuming the user has edit permissions for the old one), the old one is updated and its UID removed from the duplicate dictionnary. If the UID list is empty, it removes it from the _duplicates list.
- If the import is forced, a new entry is added to the folder and the coresponding dictionnary is removed from the _duplicates list.
One advantage of this system, is that it allows the user to delay his choice. If he can't make his choice or if he gets 'disconnected' from the portal (for any reason) he can get back to it later and a portlet will inform him that there are some duplicate entries left in a bibliography folder where he imported references, with a link to the same page he is redirected to after having imported the references.
It also has the advantage of a complete roll back if an error occurs while parsing/importing the data from the file to the Zope system.
Deliverables
A configuration page in the bibliography tool that let portal administrators choose for each bibliographical reference type what are the criteria that will be checked against to define if an entry about to be imported matches any existing one(s) (folder wide or portal wide).
A page from where duplicate entries can be managed (skip, delay, replace etc.)
A portlet that tell users if there are entries (they have access to) that are duplicates and where they can manage them from (take the decision on what to do with those).
Progress log
update as of May 30, 2006: we have it basically working in CMFBib trunk. The only thing still missing is a more
robust backwards compatibility. The one I've implemented so far turned out to be to simple-minded :-(
Participants
Raphael Ritz
David Convent
Denis Frère
Alain Spineux

Two points, however, I like to note:
1. Storage: for the intermediate storage of the duplicates I propose to use Zope 3 style
annotations, i.e., use a dictionary called '__annotations__' (accessible via 'getAnnotations')
and use this dict as storage for the 'duplicates', accessed then as
'self.getAnnotations()['duplicates']
2. The criteria according to which the checking is done should be configurable on the tool
and easily extensible, e.g., by allowing custom (skin) methods to be added. These criteria
should have to implement a defined interface which could be as easy as specifying exactly one
input argument (the entry dict) and returning true or false (simple) or the matching UID(s)
(recommended).
# example for a custom criterion returning a list of matching UIDs
def checkPMID(entry):
pmid = entry.get('PMID', None)
if pmid is None: return []
catalog = context.portal_catalog
results = catalog(PMID=pmid)
return [r.UID for r in results] # assuming UID is in the catalog metadata
It should then be possible to define a policy like
match = checkPMID OR checkAuthors AND checkTitle AND checkYear ...
Raphael