Comparison between Current Test Search and existing Multiple Search Client - March 12, 2012

Skip to end of metadata
Go to start of metadata

Introduction:

This document presents a comparison between the current test search system (nist) developed during “Understanding Phase 1 design” and existing multiple search client tool (multipleSearch).  The current system can be accessed at http://code.uta.edu/mdsalam/nist/testsearch/view (Please use Firefox of Chrome to access the page).

Existing search tool can be accessed at

http://www.uta.edu/research/collaborate/MultipleSearchTest/MultipleSearchClient.php

The test was done for searching a single keyword. Four keywords – “laser”, “nano”,”silicon”, “polymer” were chosen for the test. The results are presented in tabular form in the following pages.

Among the matching profiles, top five records are considered and number of occurrence of keywords in those profiles are enlisted in the table. For example, if we consider the keyword “laser”, for each of the top five profiles,  we record how many times this keyword appears in the profile .

The search was done on all types of profiles.

Result:

Observation:

  • Number of matching records found using NIST is less than that of MultipleSearch. The new search tool returns more relevant records and hence the number of records in the hit list is less.
  • NIST search results include a good mix of different types of profiles. Most of the search hits by MultipleSearch include Faculty Profile.

Discussion:

The search hits depend on the way the index is created. In case of NIST, the default text analyzer of Lucene is not used, rather an implementation of “StandradAnalyzer” similar to  Java implementation of Lucene is used. When the documents are being created and added to the index files, they are processed using this analyzer and stemming of words, filtering out stop words, and conversion to lowercase take place.  When the search is done for the query keyword, the keyword also goes through these same processes of analyzing and matched against the document in the indexes.

So a sentence such as:

Knuth has been called the father of the analysis of algorithms, contributing to the development of, and systematizing formal mathematical techniques for, the rigorous analysis of the computational complexity of algorithms, and in the process popularizing asymptotic notation.

Would end up inside the index as:

knuth ha been call father analysi algorithm contribut develop systemat formal mathemat techniqu rigor analysi comput complex algorithm process popular asymptot notat

In this example, words as "the" and "and" were filtered out. Words such as "development" have been stemmed to their roots. This is especially useful considering the task of searching. Suppose a user searches for "analyzing computation". If the programmer had not used a stemming analyzer, the search would result in zero hits. The word "analysis" is not the same as "analyzing" when using a non-stemming analyzer.

Because of using this type of analyzer in the current implementation, the search result is different from the existing MultipleSearch tool.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.