ixProcessQuery
OnixQueryVectorT ixProcessQuery(OnixIndexManagerT IndexManager,UCharT *RankedQueryString, UCharT *BooleanQueryString, StatusCodeT *Status)
IndexManager -- Index Manager created by a call to ixCreateIndexManager() and which has an open index which has a retrieval session in progress.
RankedQueryString -- NULL terminated string which has the query terms in it by which the results will be sorted (ranked).
BooleanQueryString -- A NULL terminated string which has the query in it. The query terms must be represented in hexadecimal.
Status -- Status value of type StatusCodeT.
OnixQueryVectorT (contains the results of the search)
ixProcessQuery queries the index currently associated with IndexManager. In order to search an index you must first open it with ixOpenIndex() and then begin a retrieval with ixStartRetrievalSession(). ixProcessQuery takes two different strings that it uses to query the index. The first is a string representing a ranked query. The ranked query attempts to determine which records are most relevant to your search. Not every word in the query is guaranteed to be present in returned records. The next query is a boolean query. You can specify what words must be in returned records. Both queries can be passed to ixProcessQuery. In this case the query processor will return records ranked according to relevance, but only those records which satisfy the boolean query. Pass a NULL to ixProcessQuery if you do not wish to use that query. (For instance if you don't want a ranked query pass NULL in for RankedQueryString)
Query Terms, and Character Sets
Onix is character set independent and query terms are represented in hexadecimal. This allows for any string of binary characters to be both indexed and searched. This allows you to deal with any character set and determine how specific characters are handled. For instance many indexers can not handle NULL characters because they assume those represent the end of a string. By storing query terms in hexadecimal you have total freedom and flexibility over your search terms.
To generate a query you must convert your query terms to hexadecimal. A hexadecimal term starts with the text "0x" followed by a hexadecimal number for each byte. The hexadecimal characters are placed consecutively in the string. A space character represents the end of the term. For example "ahab & whale" should be passed to ixProcessQuery as "0x61686162 & 0x7768616c65". The function ixConvertQuery can be used to automatically convert a query of the form "ahab & whale" to the form "0x61686162 & 0x7768616c65". ixConvertQuery converts many queries but if your query contains extended characters, Unicode, or boolean operators as part of the query terms, it is advisable that you write a conversion function which is specific to your application's requirements.
All query terms need to be represented in the same form they were indexed. In other words, queries are case sensitive and so query terms need to be passed in the same format (upper or lower case) in which they were indexed. For instance if you indexed a word as "Ahab" and search on "ahab" the query will not find "Ahab". This means that some care should be taken in determining how you index your documents. Most applications tend to convert words to lower case before indexing them. Some applications index both the lowercase and uppercase forms of the word, however.
Ranked Queries
Onix supplies several different ranking schemes. This allows you to find a ranking scheme that works the most intuitively for your application and your customer's demands. The ranking scheme that Onix uses is specified when the index is created as one of the index creation parameters passed into ixSetIndexCreationParams(). The query processor finds records most relevant to the query terms that you supply. It then uses the ranking scheme you specify to determine how relevant each record is. When the records are returned in the result vector they are sorted by this relevance.
It is important to remember that not every returned record will necessarily contain all the terms in your query. Records missing terms will be less relevant than those records containing all terms, but may be returned. To specify that returned records must contain a term, prepend that term with the + sign. To specify that a term must not be in returned records prepend that term with the - sign. You can achieve the same functionality by combining a ranked query with a boolean query. However using only the ranked query is faster.
Boolean Queries
Unlike ranked queries which return the most relevant records which match a search, boolean queries return all the records which satisfy the query. Used in conjunction to a ranked query, it will return the top ranked documents which satisfy the boolean expression. Boolean expressions are composed of query terms (operands) and query operators which lets the user specify such things as phrase searching, boolean ANDs, ORs, NOTs, word proximity searching, etc. Parenthesis may be used to group query operations to specify the order in which they must be executed.
The boolean style operators currently supported by Onix are as follows:
Boolean Operator Operator Name & Boolean AND | Boolean OR ! Boolean NOT " " Phrase ^ Exclusive OR w: Within (Word Proximity) M Member ( ) Parenthesis
Boolean Operators
The currently supported boolean operators are "&" (AND), "|" (OR), "!" (NOT), and "^" (EXCLUSIVE OR). Boolean operators take two operands (search terms). One operand on the left and one on the right side of the operator. The boolean operators work as follows:
& -- AND. Finds records which have both operands. For example, the query
cat & dog
finds records which have both the word "cat" AND the word "dog".
| -- OR. Finds records which have either operand. For example, the query
cat | dog
finds records which have the word "cat" OR the word "dog".
! -- NOT. Finds records which have the left operator AND NOT the right operator. For example, the query:
cat ! dog
finds records which have the word "cat" AND NOT the word dog.
^ -- EXCLUSIVE OR. Finds records which have either operator but not both. For example the query:
cat ^ dog
finds records which have either "cat" or "dog" but not both "cat" and "dog".
Phrase and Word Proximity
Besides processing boolean, AND, OR, NOT, Onix also supports phrase searching. Simply put your words (in their hexadecimal form), in quotes. For example, to search for white whale, simply search for:
"white whale"
or (using the hexadecimal form the query processor actually takes):
"0x7768697465 0x7768616C65"
The rest of the examples will use normal English words for clarity but keep in mind that the query processor takes the query itself in a hexadecimal representation.
You can specify word proximity by using the w: operator. The w: operator takes a parameter. This parameter is the distance the first word can be from the second word as measured by words. i.e., A w:5 B will find all instances where A is within 5 words of B. Note the colon (:) which is used to separate the operator from its parameter. You can allow your users to support the NEAR operator by using ixLongQueryFormToShortQueryForm(). This function converts such things as NEAR to an equivalent query of the form A w:n B.
Parenthesis
Parenthesis may be used to specify the order of evaluation of terms in a query. For example, with the query "white & (whale | ahab)", the query processor will for perform a boolean OR on "whale" and "ahab", and then perform a boolean AND on the result with "white".
General Searching
If words are simply specified by spaces, they are ANDed together. So for example, the query
cat dog jane
would find records where all the words, "cat" and "dog" and "jane" occur.
Field Searches
The M operator allows you to specify that a term or series of terms are "members" of a particular field. The M operator can operate on either a single term or on a series of ANDs optimally. For example, you can search for:
bob M name
which specifies that you are looking for "bob" in a field named "name". (Or in other words, "bob" is a member of the field "name".) With boolean ANDs, you can perform a series of ANDs within a given field. This is done by putting the series of ANDs within parenthesis followed by the M operator. For example:
(bob & casey & jones) M name
specifies that you are looking for a record where the words "bob" and "casey" and "jones") are all part of the name field.
Field searches can also be implemented by prefixing the terms that are being indexed with a unique prefix that specifies the field. When searching in a field, this same prefix can be prefixed to the search term to specify the field.
For example, the "Name" field can have each word prefixed with the word "Name:" as in:
Name:Jones
Name:Henry
Name:Smith
etc....
The same can be done for the other fields. This can be faster than using the member (M) operator by a reasonable margin on large indexes. This is due to the field prefix making the indexes for the various words significantly shorter as well as not require the search engine decode as much field positioning data during a query.
Natural Language Searching
One can allow people to write queries just as if they were asking the question of a human. Typically the way this is handled is to remove the stop words from the query. For example for the query "Where is Angkor Wat?" one would remove the words "where" and "is" since they are both stop words leaving the query as: "Angkor Wat" which is then run as the ranked query. Onix will then find the most closely matching records and return those. Even if after removing all the stop words from the query there are still a few words left which are not what one would consider key words, that is o.k., as the relevancy ranking algorithms will typically still be able to detect which words are the most important to the user and return the proper set of records.
Wildcards
In addition, wildcards may be used in the query to specify a class of terms. For example "whal*" will match "whale", "whales", "whaling", etc. The following wildcards apply.
* -- Match any of one or more characters
? -- Match any character
\ -- Escape CharacterIt is important to note that hexadecimal character are composed of two characters (both in the range 0-9,A-F). This means that when wildcarding a query term, a wildcard character replaces two characters in the query term. For example, the wildcarded query term "whal*" is 0x7768616c* (the * replacing the "65".) If you have need to search for a term which contains either a "*", "?", or "\", you will want to prefix the character with the escape character "\". The escape character tells the wildcard pattern matcher to accept the next character literally.
ixProcessQuery returns a query vector which contains the results of the query. A query vector is a list of "hits" or records which match the results of a search. You can view the search results with the functions ixVectorCurrentHit(), ixVectorNextHit(), and ixVectorPreviousHit() as well as find how how many records match your query with ixNumHits().
NOTE: The returned query vector needs to be disposed of after you are finished using it. You can do this by calling the function ixDeleteResultVector.
ixVectorNextHit, ixVectorCurrentHit, ixVectorPreviousHit, ixNumHits