collective.simserver

by Christian Ledermann last modified Feb 20, 2012 08:16 AM

Semantic indexing for Plone. The related items is a powerful feature but content managers mostly (unless they are very committed and know their content very well) fail to do it, here simserver comes to the rescue and does it automatically.

Project Description

What is a document similarity service?

Conceptually, a service that lets you :

  • train a semantic model from a corpus of plain texts (no manual annotation and mark-up needed)
  • index arbitrary documents using this semantic model
  • query the index for similar documents (the query can be either an uid of a document already in the index, or an arbitrary text)

What is it good for?

Digital libraries of (mostly) text documents. More generally, it helps you annotate, organize and navigate documents in a more abstract way, compared to plain keyword search.

  • Enhance the UX by linking content to related content the user might also be interested in
  • Easy way to tag documents
  • SEO, improve the Page Rank of your site

The package

The plone product consists actually of two products:

1) collective.simserver.core
https://github.com/cleder/collective.simserver.core
provides the common core functionality like an abstracted call interface, training of the corpus  and indexing

2) collective.simserver.related
https://github.com/cleder/collective.simserver.related
provides a form to query the simserver for similar items and set them as related items. A simserver collection that queries the simserver for all documents related to this collection

Plone communicates with the simserver via HTTP. For the plone products to work you will also need restsims https://github.com/cleder/restsims which is a small pyramid wrapper around the Document Similarity Server itself. Simserver is built on GensimGensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.

Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic AnalysisLatent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.

Installing RestSims

You may want to install the server in a clean and repeatable way using virtual environment and buildout.

$ mkdir simserver
$ virtualenv --python=bin/python2.7 simserver/
$ mkdir buildout-cache
$ mkdir buildout-cache/eggs
$ mkdir buildout-cache/downloads
$ mkdir src # for  LAPACK, BLAS and restsims
$ mkdir var
$ mkdir var/corpus # we will need the corpus
$ mkdir var/index # and index dirctories for the export from plone
$ wget http://python-distribute.org/bootstrap.py

restsims comes with an example buildout.cfg file in the tarball, or you find it on github.https://github.com/cleder/restsims/blob/master/buildout.cfg

copy buildout.cfg into the simserver directory

$ bin/python bootstrap.py

you can try to easy_install numpy and scipy

$ bin/easy_install numpy
$ bin/easy_install scipy

but this never worked for me so i installed it from source http://www.scipy.org/Download as documented inhttp://scipy.org/Installing_SciPy/BuildingGeneral the buildout assumes that LAPACK and BLAS are installed in the src directory

test if numpy and scipy are installed correctly:

$ bin/python
>>> import numpy
>>> import scipy
>>>

run buildout

$ bin/buildout

The buildout will take a while and install all the dependencies for the server. Start the server with:

$ bin/pserve src/restsims/development.ini

Now you can access the server at http://localhost:6543/

You will see an interface like this: 


Interact with the simserver

Result: {'status': 'OK', 'response': "SessionServer(\n\tstable=SimServer(loc='/tmp/simserver/var/b', fresh=SimIndex(221 docs, 221 real size), opt=SimIndex(4337 docs, 8023 real size), model=SimModel(method=lsi, dict=Dictionary(50000 unique tokens)), buffer=SqliteDict(/tmp/sqldictb3fe39))\n\tsession=None\n)"}





 [Submit] [Cancel]



Configuration

The configuration is done in two ini files one for development and another one for production

Things you might want to change:

at the beginning of the file:

port = 6543

is the port restsims listens on.

At the end of the file

[simserver]
path=/tmp/simmserver/

is the location of the simserver index

 

Installing the plone products with buildout

Add collective.simserver.related to the eggs section of your buildout (this will pull in collective.simserver.core as well)

eggs =
#    collective.simserver.core
    collective.simserver.related

Activate the product(s) in your add ons section. This will install a Simserver controlpanel in your site setup. 

Configuration

 The Controlpanel lets you configure the basic settings:


Simserver Settings

How to connect to your simserver

Path to export the corpus to
 
Find the collection which provides the items to be exported as the corpus
 [Select a collection]
Name of the simserver to connect to
 
URL of the server (e.g. http://localhost:6543/)
 
Minimal score
 
Maximum number of results returned
 
Index content upon creation (only below contentypes).
 
Content types to be indexed upon creation (only applicable if 'Index automatically' is enabled)
 [select the types to index]


Automatically assign the n most similar items as related content (0 = disable, only applicable if 'Index automatically' is enabled)
 

 [Train] [Save] [Cancel] 


 

Disable  Index automatically for now

Getting Started

Train your dragon

To be able to extract information you first need to build a corpus. The service indexes documents in a semantic representation so we must teach the service how to convert between plain text and semantics first.

For the semantic model to make sense, it has to be trained on a corpus that is:

  • Reasonably similar to (or the same as/ a subset of) the documents you want to index later. Training on a corpus of recipes in French when all indexed documents will be about programming in English will not help.
  • Reasonably large (at least thousands of documents), so that the statistical analysis has a chance to kick in.

Note that each time your train the corpus the index is destroyed and you must reindex all documents. 

  1. To train the index you first have to create a collection. The collection could e.g. return all Pages and Files in the published state. It is entirely up to you what you think makes a good and relevant corpus. 
  2. Assign the collection as the Corpus Collection in the control panel
  3. click on train

You will be presented with a Form to train and index your corpus:


SessionServer( stable=SimServer(loc='/tmp/simserver/var/b', fresh=SimIndex(221 docs, 221 real size), opt=SimIndex(4337 docs, 8023 real size), model=SimModel(method=lsi, dict=Dictionary(50000 unique tokens)), buffer=SqliteDict(/tmp/sqldictb3fe39)) session=None )

Train on a corpus

To be able to extract information from the simserver you need to build a corpus. The service indexes documents in a semantic representation so we must teach the service how to convert between plain text and semantics first
 
Export all the documents to the filesystem for later processing
 
send the documents directly to the simserver

 [Train] [Cancel]


 

 You may either train the corpus directly, which will consume vast amount of memory, or export it to the file system. If you choose to export the files (recommended for bigger sites) got to the export directory and archive them e.g.

$ cd /home/username/simserver/var/corpus
$ tar cfvz ../corpus.tgz *

You can then upload the file corpus.tgz into your simserver, select Train a corpus of documents and submit the form.

Next you have to index your documents.

To index documents you can use any collection in your Plone site, You have a new menu item Index items similarity  in the Actions menu, which will open the form: 


Info
SessionServer( stable=SimServer(loc='/home/lusername/simserver/var/b', fresh=SimIndex(221 docs, 221 real size), opt=SimIndex(4337 docs, 8023 real size), model=SimModel(method=lsi, dict=Dictionary(50000 unique tokens)), buffer=SqliteDict(/tmp/sqldictb3fe39)) session=None )

Index documents

To be able to extract information from the simserver you need to index your documents. When you pass documents that have the same uid as some already indexed document, the indexed document is overwritten by the new input. You don’t have to index all documents first to start querying, indexing can be incremental.
 
Export all the documents to the filesystem for later processing
 
send the documents directly to the simserver
send n documents at a time (saves RAM but slower) only applicable for online indexing
 
 
send only documents not yet indexed to the simserver

[Index] [Remove from Index] [Cancel]


 

 If you exported the files for the corpus earlier you may as well reuse this file to index the documents as well

If you opt to send your documents directly to the simserver you want to have a reasonable chunk size.

Create related Items for your existing content

You can get suggestions for related items on every content item 

You have a new menu item Relate similar Items in the Actions menu, which will open a form like this: 

Black Sea Environmental Management

The objective of this project is training of officers in ODS monitoring and control, as well as establishment, operation and enforcement of licensing systems to enable compliance with the Montreal Protocol trade and licensing provisions and Dexicions IX/8 and IX/9 of the September 1997 Meeting of the Parties.

[Save] [Cancel]

 

Check the items you think are most relevant and save them as related items

To relate items in bulk you can use any Plone collection

 You have a new menu item Set related Items  in the Actions menu, which will open the form: 

Set related items

Set similar items as related items
Minimal score an item must have to be set as related
 
Maximum number of related items to be set on an object
 
 
Relate only to results that appear in this topic

[Update] [Cancel]

 

If you check Same query the items will only be related to other items that have the same criteria i.e. you will relate only to other items that appear in this topic. E.g. you may relate News and Events to other News and Events.

You may want to repeat this with various collections. 

Enable automatic creation of related items

After the initial creation of the related items you may want to enable the automatic creation of relations again. Remind your content editors that these are mere suggestion, not the end all and be all, and they can easily improve it by manually adding related items (with some help).

 

Self-Certification

[ ] Internationalized

[ ] Unit tests

[ ] End-user documentation

[ ] Internal documentation (documentation, interfaces, etc.)

[ ] Existed and maintained for at least 6 months

[X] Installs and uninstalls cleanly

[X] Code structure follows best practice

Current Release

No stable release available yet.

If you are interested in getting the source code of this project, you can get it from the Code repository .

All Releases

Version Released Description Compatibility Licenses Status