ExternalSiteCatalog

Index and search external sites in a Plone site.

Current release
ExternalSiteCatalog 1.2.0

Released Dec 12, 2006 — tested with Plone 2.5.1, Plone 2.1.4, Plone 2.0.5

Index and search external sites in a Plone site
More about this release…

Download fileGet ExternalSiteCatalog for all platforms (0 kB)

Project Description

ExternalSiteCatalog

ExternalSiteCatalog

  ExternalSiteCatalog is a web crawler that can index external sites and make
  them searchable in Plone. You can specify the sites to index in a Plone Configlet,
  and directly index them from Plone, or let a scheduler do the job. Have a look
  at some of the screenshot in the doc folder of the product to get a first
  impression of what it looks like. Searching the external sites is done in a
  special portlet that is installed with ExternalSiteCatalog. External sites are
  not searchable in the normal Plone catalog, but are only available in a separate
  catalog in the portal_externalcatalog tool.

  Make sure to send feedback to support@ingeniweb.com. The current maintainer
  is Maik Röder, maik.roeder@ingeniweb.com.

  Direct indexing

    All external sites are configured in the ExternalSiteCatalog configlet.
    If you want to index external sites immediately, you can do so after
    defining all the parameters, selecting your site, and clicking on
    the "index" button in the configlet.

  Crawling external sites regularly

    If you want external sites to be crawled regularly every day or month,
    you'll have to do some extra work. Make sure to install PloneMaintenance
    starting from version 1.3. Make sure that you are also regularly calling
    PloneMaintenance from cron or one of the Zope schedulers.

    Follow the installation instructions in PloneMaintenance!

    Have a look at the portal_maintenance tool in the Zope Management
    interface! It contains a lot of useful information!
   
  Console indexing utility
 
    The console indexing utility is a bit more complicated than doing everything
    from Plone. In many cases you don't have the resources to administrate an
    external utility, but if you can afford it, this tool gives you the possibility
    to decouple the long running work of fetching external sites from Zope.
    Basically, you can be sure that there is no Zope thread blocked with
    crawling external sites for a long time. Zope is only called from the external
    tool when it should index a page, so there is still some load on the
    Zope server!

    Please note that this external tool is not making use of the information
    entered in the Plone configlet. It is completely independent! It also does
    not make use of PloneMaintenance, and it is up to you to configure and
    call it with a scheduler like cron.

    In the intent to avoid running long Zope transaction while browsing an
    external site, the indexing is driven from the console utility
    '.../ExternalSiteCatalog/bin/indexexternalsite.py'. Just cd there and type
    this for hints::
   
      $ python indexexternalsite.py -h
 
  Querying an ExternalSiteCatalog object

    Querying an ExternalSiteCatalog works just like querying an usual
    ZCatalog. Its indexes are:
   
    * 'PrincipiaSearchSource' (ZCTextIndex, HTML friendly)
   
    * 'hostname' (FieldIndex)
   
    Its metadata are:
   
    * 'url', the URL to the page
   
    * 'title', HTML title of the page when found

    When querying an ExternalSiteCatalog, it acts just like a ZCatalog, that has
    the traditional 'PrincipiaSearchSource' ZCTextIndex and 'url' and 'title'
    metadata. Note that 'title' may be empty since external pages may not have a
    title.
 
    This is the simplest template for querying an ExternalSiteCatalog and
    displaying results::
 
      <html>
        <head>
          <title>Searching other sites</title>
        </head>
        <body tal:define="catalog nocall: container/yourExternalSiteCatalog">
 
          <h2>Searching</h2>
         
          <form action="#"
                tal:attributes="action template/absolute_url">
            Search:
            <input type="text" name="PrincipiaSearchSource" />
            <br />
            In:
            <select name="hostname">
              <option value="">--Any--</option>
              <option tal:repeat="item python:catalog.uniqueValuesFor('hostname')"
                      tal:content="item">
                      option
              </option>
            </select>
            <br />
            <input type="submit" value="Search" />
          </form>
 
          <hr />
     
          <h2>Results</h2>
 
          <tal:block define="pss request/form/PrincipiaSearchSource;
                             hostname request/form/hostname | nothing;
                             results python: catalog(PrincipiaSearchSource=pss, hostname=hostname)">
            <div tal:repeat="result results">
              <a href="#"
                 tal:content="result/title | result/url"
                 tal:attributes="href result/url">
               Some result
              </a>
            </div>
          </tal:block>
        </body>
      </html>
 
    Of course, you should replace 'yourExternalSiteCatalog' by the name of your
    own ExternalSiteCatalog. And this sample is easy to translate to DTML.
 
  Requirements
 
    Zope 2.7+

    PloneMaintenance 1.3

  Credits
 
    * The "Ingeniweb":http://www.ingeniweb.com team.

    * Using a customized version of the fantastic "HarvestMan":http://harvestman.freezope.org (thanks to Anand B Pillai).

    * Amine Mohamed Soulaymani ("rawhead@hotmail.fr":mailto:rawhead@hotmail.fr).

    * Maik Röder - Direct indexing in Plone, configlet, unit tests, functional tests, cleanup