Semantic indexing for Plone. The related items is a powerful feature but content managers mostly (unless they are very committed and know their content very well) fail to do it, here simserver comes to the rescue and does it automatically.
What is a document similarity service?
Conceptually, a service that lets you :
- train a semantic model from a corpus of plain texts (no manual annotation and mark-up needed)
- index arbitrary documents using this semantic model
- query the index for similar documents (the query can be either an uid of a document already in the index, or an arbitrary text)
What is it good for?
Digital libraries of (mostly) text documents. More generally, it helps you annotate, organize and navigate documents in a more abstract way, compared to plain keyword search.
- Enhance the UX by linking content to related content the user might also be interested in
- Easy way to tag documents
- SEO, improve the Page Rank of your site
The plone product consists actually of two products:
provides the common core functionality like an abstracted call interface, training of the corpus and indexing
provides a form to query the simserver for similar items and set them as related items. A simserver collection that queries the simserver for all documents related to this collection
Plone communicates with the simserver via HTTP. For the plone products to work you will also need restsims https://github.com/cleder/restsims which is a small pyramid wrapper around the Document Similarity Server itself. Simserver is built on Gensim. Gensim is a free Python framework designed to automatically extract semantic topics from documents, as efficiently (computer-wise) and painlessly (human-wise) as possible.
Gensim aims at processing raw, unstructured digital texts (“plain text”). The algorithms in gensim, such as Latent Semantic Analysis, Latent Dirichlet Allocation or Random Projections, discover semantic structure of documents, by examining word statistical co-occurrence patterns within a corpus of training documents. These algorithms are unsupervised, which means no human input is necessary – you only need a corpus of plain text documents.
You may want to install the server in a clean and repeatable way using virtual environment and buildout.
$ mkdir simserver $ virtualenv --python=bin/python2.7 simserver/ $ mkdir buildout-cache $ mkdir buildout-cache/eggs $ mkdir buildout-cache/downloads $ mkdir src # for LAPACK, BLAS and restsims $ mkdir var $ mkdir var/corpus # we will need the corpus $ mkdir var/index # and index dirctories for the export from plone $ wget http://python-distribute.org/bootstrap.py
restsims comes with an example buildout.cfg file in the tarball, or you find it on github.https://github.com/cleder/restsims/blob/master/buildout.cfg
copy buildout.cfg into the simserver directory
$ bin/python bootstrap.py
you can try to easy_install numpy and scipy
$ bin/easy_install numpy $ bin/easy_install scipy
but this never worked for me so i installed it from source http://www.scipy.org/Download as documented inhttp://scipy.org/Installing_SciPy/BuildingGeneral the buildout assumes that LAPACK and BLAS are installed in the src directory
test if numpy and scipy are installed correctly:
$ bin/python >>> import numpy >>> import scipy >>>
The buildout will take a while and install all the dependencies for the server. Start the server with:
$ bin/pserve src/restsims/development.ini
Now you can access the server at http://localhost:6543/.
Interact with the simserver
The configuration is done in two ini files one for development and another one for production
Things you might want to change:
at the beginning of the file:
port = 6543
is the port restsims listens on.
At the end of the file
is the location of the simserver index
Installing the plone products with buildout
Add collective.simserver.related to the eggs section of your buildout (this will pull in collective.simserver.core as well)
eggs = # collective.simserver.core collective.simserver.related
Activate the product(s) in your add ons section. This will install a Simserver controlpanel in your site setup.
The Controlpanel lets you configure the basic settings:
How to connect to your simserver
Disable Index automatically for now
Train your dragon
To be able to extract information you first need to build a corpus. The service indexes documents in a semantic representation so we must teach the service how to convert between plain text and semantics first.
For the semantic model to make sense, it has to be trained on a corpus that is:
- Reasonably similar to (or the same as/ a subset of) the documents you want to index later. Training on a corpus of recipes in French when all indexed documents will be about programming in English will not help.
- Reasonably large (at least thousands of documents), so that the statistical analysis has a chance to kick in.
Note that each time your train the corpus the index is destroyed and you must reindex all documents.
- To train the index you first have to create a collection. The collection could e.g. return all Pages and Files in the published state. It is entirely up to you what you think makes a good and relevant corpus.
- Assign the collection as the Corpus Collection in the control panel
- click on train
You will be presented with a Form to train and index your corpus:
- SessionServer( stable=SimServer(loc='/tmp/simserver/var/b', fresh=SimIndex(221 docs, 221 real size), opt=SimIndex(4337 docs, 8023 real size), model=SimModel(method=lsi, dict=Dictionary(50000 unique tokens)), buffer=SqliteDict(/tmp/sqldictb3fe39)) session=None )
Train on a corpus
You may either train the corpus directly, which will consume vast amount of memory, or export it to the file system. If you choose to export the files (recommended for bigger sites) got to the export directory and archive them e.g.
$ cd /home/username/simserver/var/corpus $ tar cfvz ../corpus.tgz *
You can then upload the file corpus.tgz into your simserver, select Train a corpus of documents and submit the form.
Next you have to index your documents.
To index documents you can use any collection in your Plone site, You have a new menu item Index items similarity in the Actions menu, which will open the form:
- SessionServer( stable=SimServer(loc='/home/lusername/simserver/var/b', fresh=SimIndex(221 docs, 221 real size), opt=SimIndex(4337 docs, 8023 real size), model=SimModel(method=lsi, dict=Dictionary(50000 unique tokens)), buffer=SqliteDict(/tmp/sqldictb3fe39)) session=None )
If you exported the files for the corpus earlier you may as well reuse this file to index the documents as well
If you opt to send your documents directly to the simserver you want to have a reasonable chunk size.
You can get suggestions for related items on every content item
Black Sea Environmental Management
Check the items you think are most relevant and save them as related items
To relate items in bulk you can use any Plone collection
You have a new menu item Set related Items in the Actions menu, which will open the form:
Set related items
If you check Same query the items will only be related to other items that have the same criteria i.e. you will relate only to other items that appear in this topic. E.g. you may relate News and Events to other News and Events.
You may want to repeat this with various collections.
Enable automatic creation of related items
After the initial creation of the related items you may want to enable the automatic creation of relations again. Remind your content editors that these are mere suggestion, not the end all and be all, and they can easily improve it by manually adding related items (with some help).
[ ] Internationalized
[ ] Unit tests
[ ] End-user documentation
[ ] Internal documentation (documentation, interfaces, etc.)
[ ] Existed and maintained for at least 6 months
[X] Installs and uninstalls cleanly
[X] Code structure follows best practice
No stable release available yet.
If you are interested in getting the source code of this project, you can get it from the Code repository .