View Source

*Introduction:*

This document presents a comparison between the current test search system (*nist*) developed during “Understanding Phase 1 design” and existing multiple search client tool (*multipleSearch*).  The current system can be accessed at [http://code.uta.edu/mdsalam/nist/testsearch/view|http://code.uta.edu/mdsalam/nist/testsearch/view] (Please use Firefox of Chrome to access the page).

Existing search tool can be accessed at

[http://www.uta.edu/research/collaborate/MultipleSearchTest/MultipleSearchClient.php|http://www.uta.edu/research/collaborate/MultipleSearchTest/MultipleSearchClient.php]\\

The test was done for searching a single keyword. Four keywords -- “laser”, “nano”,”silicon”, “polymer” were chosen for the test. The results are presented in tabular form in the following pages.

Among the matching profiles, top five records are considered and number of occurrence of keywords in those profiles are enlisted in the table. For example, if we consider the keyword “laser”, for each of the top five profiles,  we record how many times this keyword appears in the profile .

The search was done on all types of profiles.
\\

*Result:* !result_1.PNG|border=1!
!result_2.PNG|border=1!\\

*Observation:*
* Number of matching records found using *NIST* is less than that of *MultipleSearch*. The new search tool returns more relevant records and hence the number of records in the hit list is less.
* *NIST* search results include a good mix of different types of profiles. Most of the search hits by *MultipleSearch* include Faculty Profile.
\\

*Discussion:*

The search hits depend on the way the index is created. In case of NIST, the default text analyzer of Lucene is not used, rather an implementation of “StandradAnalyzer” similar to  Java implementation of Lucene is used. When the documents are being created and added to the index files, they are processed using this analyzer and stemming of words, filtering out stop words, and conversion to lowercase take place.  When the search is done for the query keyword, the keyword also goes through these same processes of analyzing and matched against the document in the indexes.

So a sentence such as:

_Knuth has been called the father of the analysis of algorithms, contributing to the development of, and systematizing formal mathematical techniques for, the rigorous analysis of the computational complexity of algorithms, and in the process popularizing asymptotic notation._

Would end up inside the index as:

_knuth ha been call father analysi algorithm contribut develop systemat formal mathemat techniqu rigor analysi comput complex algorithm process popular asymptot notat_

In this example, words as "the" and "and" were filtered out. Words such as "development" have been stemmed to their roots. This is especially useful considering the task of searching. Suppose a user searches for "analyzing computation". If the programmer had not used a stemming analyzer, the search would result in zero hits. The word "analysis" is not the same as "analyzing" when using a non-stemming analyzer.

Because of using this type of analyzer in the current implementation, the search result is different from the existing *MultipleSearch* tool.
\\