Indexing is the heart of the process for making your text(s) searchable. During the indexing process, Onix assigns a record number and word number to every word indexed. This information is compiled along with a list of words to create an index much like the index in the back of a book. This index makes searching much faster than performing a simple scan of the text.
To begin indexing, you must first start an indexing session by calling ixStartIndexingSession(). From this point on, building an index using Onix is very simple. For every word in a document or record, you simply need to call ixIndexWord. When you have reached the end of a record or document call ixIncrementRecord. You may then continue on to the next record / document.
Note: ixIncrementRecord should only be called if there is more data to index and should not be called immediately prior to a call to ixEndIndexingSession. The reason why is that ixEndIndexingSession does an implicit call to ixIncrementRecord in that the end of an indexing session is taken to mean that you are also at the end of a record.
What is a record you ask? A record is simply a logical chunk of text and is the unit of granularity for the index. Again using a book's index as an analogy, just as a book's index reflects which pages a given word appears in, Onix's index reflects which records a word appears in. A "record" is also a much more flexible term as the size of a record is highly flexible. For some applications, it is useful to index each paragraph of a text as a separate record. This means that when a search is conducted, it tells which paragraphs a given term or set of terms appear in. Other applications such as web search tools require that a larger unit is used for a record. In this case, a file (or page) is considered by the host application to be a record. When a search is conducted the search engine tells which file a word appears in.
Pseudocode for the indexing process is as follows given an already opened index:
IndexingEngine = ixStartIndexingSession()
while(NotDone) {
for(EveryWordInTheCurrentDocument) {
ixIndexWord(Word);
}
if(MoreDataToIndex){
ixIncrementRecord()
}
}
ixEndIndexingSession()
About Stemming
Some people choose to "stem" the words they index as they index them. The process of stemming a word reduces words to a normalized form. The idea is that all inflections of a word are indexed as a single term. Thus the words "run", "running", "ran", "runs", etc. are all indexed as the same term allowing a user to find all forms of a word with a single query. Keep in mind however, that a stemmer does not always generate a real word -- and that is not its goal. The goal is to have all forms of a word indexed as the same term. So if you are not showing your wordlist to your users, stemming is a good way to go. However, since stemming does not always generate real words, it is not always a good idea to show a stemmed wordlist to users as they are likely to be confused without an explanation. Onix includes a copy of the Porter stemmer which has been found to be one of the best and fastest stemming algorithms available for English.
Note: As you might expect, in order to search an index which contains stemmed terms, the query must have its terms stemmed as well.
About Changes To The Index
Most of the indexing process is performed totally autonomously from the index itself. So while you are indexing, it is perfectly safe to access the index. However, after you call ixEndIndexingSession() changes to the index begin to be made and you will want to avoid accessing the index via any processes or threads which may be active. Since ixEndIndexingSession() may take awhile to run, two other functions have been created. The first is ixFinalProcessIndex() which completes the processing on the temporary files generated during the indexing process including the index compression. The processing completed during ixFinalProcessIndex is also completely autonomous from the index and the index remains safe to access. After calling ixFinalProcessIndex(), you will need to call ixMakeIndexActive() which brings the new index data into the index. During this period, the index must not be accessed otherwise data will be read which is not being expected. How fast is ixMakeIndexActive() you ask? If you are using a distributed index, ixMakeIndexActive should only take 10-20ms (more or less). If you are not using a distributed index, ixMakeIndexActive takes as long as it takes to copy the new index data into the index and thus is about as fast as your hard drive and OS will allow.
Note: After changes have been made to the index, any other index managers which are accessing the index need to reload their index which may be done by calling ixReloadIndex().
About Distributed Indexes
If you have decided to use a distributed index, you will want to make sure that you call ixSetFinalIndexDataPosition at some point during the indexing session. This function tells Onix which file to save the newly created index data as.
ixStartIndexingSession, ixEndIndexingSession, ixIndexWord, ixIndexWordSpecial, ixIncrementRecord, ixStemEnglishWord, ixFinalProcessIndex, ixMakeIndexActive, ixSetFinalIndexDataPosition