#37 — Medline import fails

StateUnconfirmed
Version: 0.8.0
AreaFunctionality
Issue typeBug
SeverityMedium
Submitted byunset
Submitted onSep 03, 2007
Responsible
Target release:
Return to tracker
Last modified on Jan 08, 2009 by Matthew Wilkes
Until recently we could import Pubmed citations by pasting the Medline-formatted entry into the 'import' tab of a bibliography folder. Now, this suddenly stopped working. Perhaps Pubmed made a slight change to their format.

Details:

No import happens but the following traceback appears in the Error log of the site:

Traceback (innermost last):
  Module ZPublisher.Publish, line 115, in publish
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 41, in call_object
  Module Products.CMFFormController.FSControllerPageTemplate, line 96, in __call__
  Module Products.CMFFormController.BaseControllerPageTemplate, line 39, in _call
  Module Products.CMFFormController.ControllerBase, line 243, in getNext
  Module Products.CMFFormController.Actions.TraverseTo, line 36, in __call__
  Module ZPublisher.mapply, line 88, in mapply
  Module ZPublisher.Publish, line 41, in call_object
  Module Products.CMFFormController.FSControllerPythonScript, line 107, in __call__
  Module Products.CMFFormController.Script, line 141, in __call__
  Module Products.CMFCore.FSPythonScript, line 108, in __call__
  Module Shared.DC.Scripts.Bindings, line 311, in __call__
  Module Shared.DC.Scripts.Bindings, line 348, in _bindAndExec
  Module Products.CMFCore.FSPythonScript, line 164, in _exec
  Module None, line 65, in bibliography_import
   - <FSControllerPythonScript at /sfiles/bibliography_import used for /sfiles/mycoplasma/myco_bibliography>
   - Line 65
AttributeError: 'str' object has no attribute 'get'
Steps to reproduce:
1) select an article from Pubmed, choose Medline view
2) Cut&Paste Medline view into import tab of bibliography folder
3) Select 'Medline' format and 'import'
Added byunsetonSep 27, 2007 02:46 PM
I had a deeper look at this issue and it turns out that Pubmed has indeed changed the medline formate:
The medline parser in CMFBibliography expects records to start like this 'PMID-' or this 'AU -'. Instead, in the current version, medline uses tabs to separate the '-'. I've adapted tool/parsers/medline.py to be more permissive. I created a patch for the file medline.py. Paste the following lines into a new file medline.py.patch and apply it with 'patch -p0 < medline.py.patch'!

--- __medline.py 2005-06-01 13:40:06.000000000 +0200
+++ medline.py 2007-09-27 16:32:26.000000000 +0200
@@ -17,6 +17,10 @@

 import re

+def extractKey(rawkey):
+ """adapt to new Pubmed format"""
+ return rawkey.split('-')[0].strip()
+

 class MedlineParser(BibliographyParser):
     """
@@ -34,7 +38,7 @@
                  id = 'medline',
                  title = "Medline parser",
                  delimiter = '\n\n',
- pattern = r'(^.{0,4}- )',
+ pattern = r'(^.{0,5}- )',
                  flag = re.M):
         """
         initializes including the regular expression patterns
@@ -54,11 +58,12 @@
         # vanilla test for 'PMID- ' in the sub-string 'source[0, 100]'
         ## rr: can definitively be improved

- if source.find('PMID- ', 0, 1000) > -1:
+ if source.find('PMID', 0, 1000) > -1:
             return 1
         else:
             return 0

+
     def parseEntry(self, entry):
         """
         parses a single entry
@@ -71,7 +76,7 @@
         tokens = self.pattern.split(entry)

         checkAU = 0
- if 'FAU - ' not in tokens:
+ if 'FAU\t-' not in tokens:
             checkAU = 1

         nested = []
@@ -81,24 +86,25 @@
         # some defaults
         result['note'] = 'automatic medline import'

- for key, value in nested:
- if key == 'PT - ' and value.find('Journal Article')> -1:
+ for k, value in nested:
+ key = extractKey(k)
+ if key == 'PT' and value.find('Journal Article')> -1:
                 result['publication_type'] = 'ArticleReference'
- elif key == 'TI - ':
+ elif key == 'TI':
                 title = value.replace('\n', ' ').replace(' ', '').strip()
                 result['title'] = title
- elif key == 'AB - ':
+ elif key == 'AB':
                 tmp = value.replace('\n', ' ').replace(' ', '')
                 result['abstract'] = tmp.replace(' ', '').replace(' ', '')
- elif key == 'PMID- ': result['pmid'] = str(value).strip()
- elif key == 'TA - ': result['journal'] = str(value).strip()
- elif key == 'VI - ': result['volume'] = str(value).strip()
- elif key == 'IP - ': result['number'] = str(value).strip()
- elif key == 'PG - ': result['pages'] = str(value).strip()
- elif key == 'DP - ':
+ elif key == 'PMID': result['pmid'] = str(value).strip()
+ elif key == 'TA': result['journal'] = str(value).strip()
+ elif key == 'VI': result['volume'] = str(value).strip()
+ elif key == 'IP': result['number'] = str(value).strip()
+ elif key == 'PG': result['pages'] = str(value).strip()
+ elif key == 'DP':
                 result['publication_year'] = value[:4]
                 result['publication_month'] = value[5:].replace('\n','').replace('\r','')
- elif key == 'FAU - ':
+ elif key == 'FAU':
                 raw = value.replace('\n', '').split(', ')
                 lname = raw[0]
                 fnames = raw[1].split(' ',1)
@@ -113,7 +119,7 @@
                          }
                 result.setdefault('authors',[]).append(adict)

- elif checkAU and key == 'AU - ':
+ elif checkAU and key == 'AU':
                 raw = value.replace('\n', '').split()
                 lname = raw[0]
                 fnames = raw[1]
Added byunsetonSep 27, 2007 02:49 PM
I had a deeper look at this issue and it turns out that Pubmed has indeed changed the medline formate:
The medline parser in CMFBibliography expects records to start like this 'PMID-' or this 'AU -'. Instead, in the current version, medline uses tabs to separate the '-'. I've adapted tool/parsers/medline.py to be more permissive. I created a patch for the file medline.py. Apply it with 'patch -p0 < medline.py.patch'! I am also attaching the modified medline.py (based on branch 0.8).

--- __medline.py 2005-06-01 13:40:06.000000000 +0200
+++ medline.py 2007-09-27 16:32:26.000000000 +0200
@@ -17,6 +17,10 @@

 import re

+def extractKey(rawkey):
+ """adapt to new Pubmed format"""
+ return rawkey.split('-')[0].strip()
+

 class MedlineParser(BibliographyParser):
     """
@@ -34,7 +38,7 @@
                  id = 'medline',
                  title = "Medline parser",
                  delimiter = '\n\n',
- pattern = r'(^.{0,4}- )',
+ pattern = r'(^.{0,5}- )',
                  flag = re.M):
         """
         initializes including the regular expression patterns
@@ -54,11 +58,12 @@
         # vanilla test for 'PMID- ' in the sub-string 'source[0, 100]'
         ## rr: can definitively be improved

- if source.find('PMID- ', 0, 1000) > -1:
+ if source.find('PMID', 0, 1000) > -1:
             return 1
         else:
             return 0

+
     def parseEntry(self, entry):
         """
         parses a single entry
@@ -71,7 +76,7 @@
         tokens = self.pattern.split(entry)

         checkAU = 0
- if 'FAU - ' not in tokens:
+ if 'FAU\t-' not in tokens:
             checkAU = 1

         nested = []
@@ -81,24 +86,25 @@
         # some defaults
         result['note'] = 'automatic medline import'

- for key, value in nested:
- if key == 'PT - ' and value.find('Journal Article')> -1:
+ for k, value in nested:
+ key = extractKey(k)
+ if key == 'PT' and value.find('Journal Article')> -1:
                 result['publication_type'] = 'ArticleReference'
- elif key == 'TI - ':
+ elif key == 'TI':
                 title = value.replace('\n', ' ').replace(' ', '').strip()
                 result['title'] = title
- elif key == 'AB - ':
+ elif key == 'AB':
                 tmp = value.replace('\n', ' ').replace(' ', '')
                 result['abstract'] = tmp.replace(' ', '').replace(' ', '')
- elif key == 'PMID- ': result['pmid'] = str(value).strip()
- elif key == 'TA - ': result['journal'] = str(value).strip()
- elif key == 'VI - ': result['volume'] = str(value).strip()
- elif key == 'IP - ': result['number'] = str(value).strip()
- elif key == 'PG - ': result['pages'] = str(value).strip()
- elif key == 'DP - ':
+ elif key == 'PMID': result['pmid'] = str(value).strip()
+ elif key == 'TA': result['journal'] = str(value).strip()
+ elif key == 'VI': result['volume'] = str(value).strip()
+ elif key == 'IP': result['number'] = str(value).strip()
+ elif key == 'PG': result['pages'] = str(value).strip()
+ elif key == 'DP':
                 result['publication_year'] = value[:4]
                 result['publication_month'] = value[5:].replace('\n','').replace('\r','')
- elif key == 'FAU - ':
+ elif key == 'FAU':
                 raw = value.replace('\n', '').split(', ')
                 lname = raw[0]
                 fnames = raw[1].split(' ',1)
@@ -113,7 +119,7 @@
                          }
                 result.setdefault('authors',[]).append(adict)

- elif checkAU and key == 'AU - ':
+ elif checkAU and key == 'AU':
                 raw = value.replace('\n', '').split()
                 lname = raw[0]
                 fnames = raw[1]

No responses can be added.