The Main Manual Page Dynamic API Documentation CD-ROM API Documentation About Onix Types About Onix Errors Onix's Web Site at Lextek International Lextek International Onix Full Text Indexing and Retrieval Toolkit

 

About Words

Onix allows you to define what a word is in the text which you index.  A word can be composed of, for example, simply a sequence of characters a-z or, it can contain upper ASCII/ANSI extended characters or it can be Unicode, or any other sequence of binary data.  Onix is totally character set independent.  During the indexing process you simply need to specify what the binary data is and how long it is and Onix will do the rest.  Some people want to "normalize" words before they are indexed.  This is typically a very good idea -- especially to the extent of converting the words to either all uppercase or all lower case characters. Beyond this, some developers choose to "stem" the words as they are indexed and then again before the query is passed into the query processing routines. Stemming normalizes all forms of a word into a standardized form -- which may or may not be a real word.  Onix currently has the Porter Stemmer as part of its toolkit for the English language.  The Porter stemming algorithm is considered by many to be one of the best stemming algorithms developed for English.  Stemming has its share of advantages and disadvantages and only your application will dictate whether it is best to stem words before you index them.  Keep in mind however, that if you stem a word before you index it, you must stem the search terms before you conduct a search.  For a more detailed discussion on stemming, see the documentation for ixStemEnglishWord().

To handle the possibility that a word in the index may reflect any character set or may contain any sequence of binary characters, the query processing routines have a standard format for passing in query terms and operands. This is done by passing in the query terms in a 7bit hexadecimal textual format. For example, the word "whale" would look like : "0x7768616c65" in a zero terminated "C" style string. Onix provides some utility functions such as ixConvertQuery() which will convert most queries to this normalized form. However, depending on an applications needs, a developer might need to write their own conversion function. Utility functions such as ixCharToHex() will assist in preparing queries for final processing by the query engine.

There are a number of tricks that can be played during the indexing process which can make your index more useful and easier to manage. We will cover some of these below:

 

Prefixing Words

Often times, it is desirable to be able to search specific sets of words. For example, in some documents, you might want to be able to search for the word "mike" and refer to it being a name rather than a microphone. To index and specify certain sets of words which you can later search on, simply prefix the word during the indexing process with a uniquely identifiable prefix. For example, you could prefix the name "Mike" with the string "Name:" making an indexed term of "Name:Mike". The prefix "Name:" would be prefixed to the other names encountered during the indexing process making other index terms such as "Name:Bob", "Name:Jones", "Name:Henry", "Name:Jesse", "Name:Sarah", etc. When searching for a name, simply prefix the name itself with the appropriate prefix and search on the prefix and name. This is one of several ways that fielded searches can be accomplished via Onix and for many applications, it is perhaps the most efficient in terms of query time. It must be pointed out that if you prefix your words, you should also prefix your normal words (preferably with a different prefix) in order to ensure that all the words of one type or another are alphabetically grouped together within the index.

 

Words as Sets

Words can be used to delimit sets within the index. For example, in a products database, one might want to delimit books, from movies from magazines, from software, etc. Or potentially in a book, delaminate the paragraphs (where each chapter is a record) that belong to Chapter1, Chapter2, Chapter3, etc. To create sets of records, during the indexing process index a "hidden word" which is a word which does not occur in your text that will uniquely identify the set you are adding the record to. For example, if you are indexing a book and each paragraph is a record, you could index the word: "Set:Chapter1". Then if a user wants to search only Chapter 1, the search term "Set:Chapter1" can be boolean ANDed with the rest of the user's query which will limit the search to only those records in Chapter 1.