Understanding the Phase 1 design - March 2nd, 2012

Version 3 by dak4279
on Mar 14, 2012 11:25.

compared with
Current by dak4279
on Mar 14, 2012 11:44.

Key
This line was removed.
This word was removed. This word was added.
This line was added.

Changes (2)

View Page History
*Section 5: Redesign Database Tables*

After facing the problem of merging scores from different sections of a profile, the database was redesigned. In the new design, each of the profile tables is created by denormalizing the record for that profile type. The new design is presented in Figure 9. !Picture11.png|border=1,width=881,height=903!


*Section 6: Match Keywords and Extract Records*

When a search keyword is submitted, Lucene searches the index files to find documents that have the keyword in the field value. It uses a default scoring mechanism based on the tf-idf algorithm to score the particular document. The document id and the score are returned as a result. When a single index is used for searching, the result is sorted according to the descending score of the document, placing the document having the highest score at the top.

The creation of an index file is an important factor for search performance. Text analyzer is used to analyze the input text and create document which would be added to the index. An analyzer in Lucene is an object that helps shape documents before they are inserted into the search index. It is also used to shape queries.

In Java's implementation of Lucene, there is an analyzer called "StandardAnalyzer". This analyzer takes a document and stems words, filters out stop words, and converts them to lowercase. A programmer looking to implement a full text search would be wise to use this analyzer, because it makes for a nicely relevant solution. During the indexing process, each document would be placed in the index in its analyzed form (assuming the field is set to be tokenized). So a sentence such as:

_Knuth has been called the father of the analysis of algorithms, contributing to the development of, and systematizing formal mathematical techniques for, the rigorous analysis of the computational complexity of algorithms, and in the process popularizing asymptotic notation._

Would end up inside the index as:

_knuth ha been call father analysi algorithm contribut develop systemat formal mathemat techniqu rigor analysi comput complex algorithm process popular asymptot notat_

In this example, words as "the" and "and" were filtered out. Words such as "development" have been stemmed to their roots. This is especially useful considering the task of searching. Suppose a user searches for "analyzing computation". If the programmer had not used a stemming analyzer, the search would result in zero hits. The word "analysis" is not the same as "analyzing" when using a non-stemming analyzer. Performing a modification such as this is the major role of an analyzer. Each analyzer usually serves a different purpose, and makes use of different filters available to it.

Zend's Lucene port comes with a set of basic analyzers and filters. There are tools for performing the lower-casing and stop word filtering needed for a good search. A PHP version of the StandardAnalyzer *is not* currently packaged with Zend_Search_Lucene. This has been developed by [Kenneth Katzgrau|http://codefury.net/projects/StandardAnalyzer/]  as open source module. The above example is taken from his explanation in [http://devzone.zend.com/1303/implementing-a-stemming-analyzer-for-zend_search_lucene/|http://devzone.zend.com/1303/implementing-a-stemming-analyzer-for-zend_search_lucene/]. This analyzer is for the English language and it performs the following functions:
* Word stemming
* Stop word filtering
* Lower-casing

The Donald Knuth example above was generated using this analyzer.

This analyzer is being used to create the index files from the profile tables. At the time of searching, this analyzer is again used to find the matching index document containing the keyword.

When creating index, only the id field of the documents are stored as keywords, all other text field are specified as ‘unstored’ to reduce the index file size. For example,  when creating  ‘facultyIndex’ , each record of FacultyProfile table are treated as documents , but only the ‘id’, ‘profileId’,’profileTypeId’ columns are stored in the index file as keyword. When a matching record is found, we get the id of the record using document specific method and can retrieve the whole record from the database table using its id.

*Section 7: Ranking Algorithm*

Zend Lucene uses a scoring algorithm based on tf-idf. It ranks the result according to the descending scores. That is, the document which has the highest score is placed first, next comes the second highest scored document, and so on. This ranking is done per index basis. If there are two index files, and file-1 is used before file-2, the result set will contain matching documents from file-1 first, and then those from file-2. But we want the final result sorted based only on document scores regardless of which index the document belongs to.

After careful observation of the object returned by the root index and other child indexes, it was found that the document number in the child index is different from the document number in the root index. For example, if index-1 has 20 documents and index-2 has 10 documents, and they are added to the root index in the order (index-1 first, then index-2), the root index will contain 30 documents which are numbered from 0 to 29. If the query “nano” has 3 hits from index-1 (let assume, document 1,3,6) and 2 hits from index-2 ( documents 2,4), the hit list from index-1 will contain document id of 0,2 and 5 , and the hit list from index-2 will contain document  id of 1 and 3.  But the document id 1 and 3 of index-2 is the document id 21 and 23 of root index.

To overcome this problem, a new proxy index is created using the root index which will map the document id of individual indexes to the desired document id of the root index. Zend_Lucene_Search_Proxy  class is used in this regard by extending as Nist_Search_Proxy and customizing required functionalities.

There is also the issue of arranging the documents regardless of indexes. To do this, the hit list array in the returned object is sorted in descending order of score which rearranges all the documents. Then relevant records are fetched from the database and presented in the UI.

The default scoring provided by Zend Lucene uses tf-idf algorithm; tf stands for ‘term frequency’, and ‘idf’ stands for ‘inverse document frequency’. The scoring formula is same as the Java Lucene:\\

score(q,d) =      sum( tf(t in d)

                  * idf(t)

                  * getBoost(t.field in d)

                  * lengthNorm(t.field in d) )

                * coord(q,d) * queryNorm(q)\\

Explanation of the functions is briefly discussed below:

*tf(t in d)* \- Zend_Search_Lucene_Search_Similarity::tf($freq) - a score factor based on the frequency of a term or phrase in a document.

*idf(t)* \- Zend_Search_Lucene_Search_Similarity::idf($input, $reader) - a score factor for a simple term with the specified index.

*getBoost(t.field in d)* \- the boost factor for the term field.

*lengthNorm($term)* \- the normalization value for a field given the total number of terms contained in a field. This value is stored within the index. These values, together with field boosts, are stored in an index and multiplied into scores for hits on each field by the search code.

Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large and larger values when numTokens is small.

*coord(q,d)* \- Zend_Search_Lucene_Search_Similarity::coord($overlap, $maxOverlap) - a score factor based on the fraction of all query terms that a document contains. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

*queryNorm(q)* \- the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multiplied into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

The scoring algorithm can be customized by using own similarity class. This can be done by extending the class Zend_Search_Lucene_Search_Similarity and overriding the methods. While this was tested to observe the changes, any customized formula is not being used for phase 1 prototype. The default scoring algorithm is currently being used.

*Outstanding Issues*

The objective of phase -- 1 design was to clarify the searching mechanism for a subset of data. The final system will have data from different institutions, not just from UTA. Besides, Apache *Solr* is now a days a popular tool for implementing search engines which use Apache Lucene at its core. We considered Zend Lucene for this phase. Hence more researches should be done for the following outstanding issues in upcoming stages:
* Scalability of this design to multiple institutions, not just UTA
* Getting data from other institutions and updating the database
* Use of Apache Solr instead of Zend Lucene

*Conclusion*

In this design phase, the working principle of Lucene search engine is clarified. The better approach to design the database and indexes was also understood. Research was done to find out better text analyzing technique and a good open source implementation was found which is being used for creating the prototype. The scoring and ranking mechanism of Zend Lucene was studied and methods were noted which can be used for customization.