TextIndexNG

TextIndexNG is a new fulltext index for Zope and is the most feature-complete solution for fulltext indexing under Zope.

Current release

No stable release available yet.

Project Description

Features

  • DocumentConverters
  • StemmerSupport for 13 languages
  • SimilaritySearch for english text (based on the Levenshtein distance)
  • NearSearch,
  • PluggableParsers
  • extended StopWords support
  • full integration in ZCatalog
  • TestFunctionality through ZMI
  • ExtensibleArchitecture
  • being MoreEfficient than the current !TextIndex
  • full globbing support (wildcard search)
  • NormalizationSupport (e.g. reducing accented characters to their base form)
  • full UnicodeAwareness
  • Relevance ranking of search results added. Searches are now ranked using an extended cosine measure. The cosine measure is based on a vector model and calculates the document "score" based on the frequency of the query terms inside the document result set.
  • Much faster phrase/near search: the old implementation of TextIndexNG had to perform a very expensive job at query time when phrase/near search was performed. Re-using the !WidCode module of !ZCTextIndex made this operation less expensive.
  • Left-truncation added: TextIndexNG can be configured creation-time time to support left-truncation (means you can search for "*suffix") Left-truncation is an option because this feature requires a second reverted index inside the lexicion and much more memory!
  • optional auto-expansion support: This optional feature allows you to get better search results when some of the query terms could not be found. The index expands a query term "foo" to "foo*" if there was no hit for "foo". This expansion is currently global for the index. This feature will be available on a per-query basis in a later version. (Auto-expansion will be extended in a later version to search for similiar terms)
  • improved HTML converter: now using Chris Withers "Strip-o-Gram" module instead of the Strip-Tag-Parser
  • added converter for text/sgml
  • Similarity search (soundex, metaphone, doublemetaphone) dropped and replace with a more general approach and language indepedant approach using the Levenshtein distance.
  • range searches like "Fi..Foo"
  • substring searches "substring"