Understanding the Phase 1 design - March 2nd, 2012

Skip to end of metadata
Go to start of metadata

Introduction

The design of phase-1 was initially divided into some major tasks:

  1. Design local database
  2. Define keywords extraction scheme from user input
  3. Match keywords with database records and extract relevant records
  4. Design ranking algorithm to rank the search results

With the progress of these tasks, some necessary adjustments were required and the following tasks were added.

  1. Write script to migrate data from profile system database
  2.  Research on index creation approach
  3.  Redesign database tables


So the revised task sequence is as follows:

  1. Design local database
  2. Define keywords extraction scheme from user input
  3. Write Script to migrate data from profile system database
  4. Research on index creation approach
  5. Redesign database tables
  6. Match keywords with database records and extract relevant records
  7. Design ranking algorithm to rank the search results.

This document includes the explanation of each of these tasks and the overall understanding of the design of phase-1. The description of the tasks can be found in relevant sections.

Section 1: Design Local Database

Phase – 1 uses UT Arlington’s profile system as the source of data. The current Profile System database has tables that were designed to facilitate the operation of profile system website. The objective of NIST project is to search data. There would be no insert, update or delete operation on database records from the client side. So, a new database design was proposed to meet this need.

Database design - 1

In the profile system there are six different types of profiles: Faculty, Research Center, Technology, Facility, Equipment and Labs. In this design each profile type had a base table that included the profileId (pid in the profile system). There were also dependent tables that hold information regarding different sections of a particular profile. For example, the Faculty profile had the base table “Faculty” which had profileId. Other tables like “FacultyPublication”, “FacultyPresentation”, and “FacultyKeyword” etc. hold information of the relevant sections. Each dependent table has a foreign key facultyId that is used to link the record with the particular id of the Faculty table.  The E-R diagram for Faculty tables is presented in Figure – 1.

Using this similar approach, different tables for other profiles were also designed. E-R diagram of each of the tables are provide in the consecutive figures.

Database design - 2

During task 4 it was realized that database design – 1 is not the best approach for creating multiple indexes. Hence the design was changed in task 5. This approach is further discussed in section 5. 

Section 2: Keyword Extraction

The user will provide information related to his need using the form. There are different input fields in the form which will have different types of input such as - paragraph, checkboxes etc.   Users will submit this form and will get the results relevant to their needs. Keywords have to be extracted from these user input so that they can be matched with database records.

A research on keyword extraction issue provided us with two options –

  1. Using  third party API to extract keywords
  2. Using Lucene search engine which internally extracts keywords

 These two options are discussed below.

          a.) Using third party API: There are some companies that provide keyword extraction as web services. A study on them is presented in Table – 1

Sl No Company Web Service/Package Usage Limitations Authentication
1 Yahoo Content analysis service
Web Service Only Non-commercial 5000 queries per day per IP address
API key
2 Tagthe.net
Web Service Nothing mentioned in the page No Limitations No API key
3 OpenCalais
Web Service Both Commercial and Non-Commercial 50,000 transactions/day
4 transactions/second
Looks like we can request for a larger quota if needed.
API key
4 Alchemy API
Web Service Both Commercial and Non-Commercial 30,000 API calls/day for academic users.
Website says "Higher limits available to educational institutions and non-profit groups."
API key
5 Zemanta
Web Service Nothing mentioned in the page 1000 calls/day
Can be increased to 10,000 calls/day
API key
6 Semantic Hacker API
 
Web Service Nothing mentioned in the page 20,000/day,
5/second
API key

Table 1: Comparison of web service API’s provided by different companies

Among these alternatives, Alchemy API shows comparatively good performance in extracting keywords from different paragraphs. Moreover, it provides facility for academic users. Hence it can be a good option for keyword extraction.

            b.)Using Lucene search engine: Zend Lucene will be used to develop the search engine. Lucene can take a text input and search the database for important words present in the text. But it cannot perform keyword extraction with the accuracy of the APIs. As the user input from the client side will consists of paragraphs, it is better to extract the important keywords from the paragraphs and then provide these to Lucene for searching.

We are planning to use Alchemy API for keyword extraction.

Section 3: Write Migration Script

An initial migration script was written for database design-1. When the database was redesigned, a new migration script was written. The workflow steps are as follows:

  1. Copy relevant tables from Profile System database to the new “nist-test” database.
  2. Create tables in the “nist-test” database according to the design.
  3. From the php migration script, make connection to the “nist-test” database.
  4. Write separate methods for populating the tables. For example, “insertIntoFacultyTable” method will fetch rows from “ppl_general_info” table, create a new row using this data and insert the row into “Faculty” table.

 For phase – 1, we are considering only UT Arlington Profile System as the source of data. Later we have to develop API s to get data from different partner institutions and populate the local database.

Section 4: Research on Index Creation Approaches

Lucene search engine uses indexes to find the records in the database matching the keywords. At first, the index files need to be created from database tables. Lucene stores these files not in the database but on the operating system. When a user submits search string, Lucene forms a relevant query object from the string using the keywords. It then uses the index files to find the id of the records that match the query. Each record is fetched from the database using this id.

The first step is to create the index files from the database records. Using the database design-1, a separate index was created for each of the tables. An index file consists of documents. Each document has some fields which hold values of each record.

For example, if we consider “Faculty” table, it has 8463 records. Each record has the columns as per database design-1. We create an index file on the “Faculty” table. This index file will have 8463 documents. Each document will have column names as the field names and column values as the values. Only the id of a record is stored in the index files so that the index file does not become very large.

Indexes were created using “Faculty”, “FacultyResearch” and “FacultyPublication” tables for testing purpose. These indexes were used to search for test keywords. The matching records were also retrieved. But there was one problem when combining the scores – how to meaningfully combine score of different sections of a particular profile? For example, searching for a keyword “nano” returned result from “FacultyResearch” and “FacultyPublication” tables as a faculty profile can contain research related to nano technology as well as publication listing. The results had different scores. The issue is how to combine these scores to find a matching profile type. Because of this problem, database was redesigned to have all the information of a particular profile type in the same table.

With the newly designed database as discussed in section 5, there are as many tables as profile types. For this testing, there are 6 types of profile and hence 6 tables. An index is created for each of the tables. “facultyIndex” is created for “FacultyProfile” table. Similarly, “researchCenterIndex”, “technologyIndex”, “faciltiyIndex”, “equipmentIndex” and “labIndex” are created.  All these indexes are added to a “rootIndex” which is created using the Lucene multisearcher index creation method. 

Using this design, searching for the keyword “nano” gives result from each of the profile types. Hence, the score returned for each record is the combined score from all the sections of that profile. “FacultyProfile” table now contains columns ‘facultyResearch’ and ‘facultyPublication’. When “facultyIndex” is created, it will contain each of the rows of FacultyProfile table as a document. When “nano” keyword is being searched, this index will return the documents matching the keyword. The score will be calculated over all the fields of the document, in other words, all the columns of that table. Hence we get a combined score from ‘research’ and ‘publication’ section of the faculty profile.

Section 5: Redesign Database Tables

After facing the problem of merging scores from different sections of a profile, the database was redesigned. In the new design, each of the profile tables is created by denormalizing the record for that profile type. The new design is presented in Figure 9.

Section 6: Match Keywords and Extract Records

When a search keyword is submitted, Lucene searches the index files to find documents that have the keyword in the field value. It uses a default scoring mechanism based on the tf-idf algorithm to score the particular document. The document id and the score are returned as a result. When a single index is used for searching, the result is sorted according to the descending score of the document, placing the document having the highest score at the top.

The creation of an index file is an important factor for search performance. Text analyzer is used to analyze the input text and create document which would be added to the index. An analyzer in Lucene is an object that helps shape documents before they are inserted into the search index. It is also used to shape queries.

In Java's implementation of Lucene, there is an analyzer called "StandardAnalyzer". This analyzer takes a document and stems words, filters out stop words, and converts them to lowercase. A programmer looking to implement a full text search would be wise to use this analyzer, because it makes for a nicely relevant solution. During the indexing process, each document would be placed in the index in its analyzed form (assuming the field is set to be tokenized). So a sentence such as:

Knuth has been called the father of the analysis of algorithms, contributing to the development of, and systematizing formal mathematical techniques for, the rigorous analysis of the computational complexity of algorithms, and in the process popularizing asymptotic notation.

Would end up inside the index as:

knuth ha been call father analysi algorithm contribut develop systemat formal mathemat techniqu rigor analysi comput complex algorithm process popular asymptot notat

In this example, words as "the" and "and" were filtered out. Words such as "development" have been stemmed to their roots. This is especially useful considering the task of searching. Suppose a user searches for "analyzing computation". If the programmer had not used a stemming analyzer, the search would result in zero hits. The word "analysis" is not the same as "analyzing" when using a non-stemming analyzer. Performing a modification such as this is the major role of an analyzer. Each analyzer usually serves a different purpose, and makes use of different filters available to it.

Zend's Lucene port comes with a set of basic analyzers and filters. There are tools for performing the lower-casing and stop word filtering needed for a good search. A PHP version of the StandardAnalyzer is not currently packaged with Zend_Search_Lucene. This has been developed by Kenneth Katzgrau  as open source module. The above example is taken from his explanation in http://devzone.zend.com/1303/implementing-a-stemming-analyzer-for-zend_search_lucene/. This analyzer is for the English language and it performs the following functions:

  • Word stemming
  • Stop word filtering
  • Lower-casing

The Donald Knuth example above was generated using this analyzer.

This analyzer is being used to create the index files from the profile tables. At the time of searching, this analyzer is again used to find the matching index document containing the keyword.

When creating index, only the id field of the documents are stored as keywords, all other text field are specified as ‘unstored’ to reduce the index file size. For example,  when creating  ‘facultyIndex’ , each record of FacultyProfile table are treated as documents , but only the ‘id’, ‘profileId’,’profileTypeId’ columns are stored in the index file as keyword. When a matching record is found, we get the id of the record using document specific method and can retrieve the whole record from the database table using its id.

Section 7: Ranking Algorithm

Zend Lucene uses a scoring algorithm based on tf-idf. It ranks the result according to the descending scores. That is, the document which has the highest score is placed first, next comes the second highest scored document, and so on. This ranking is done per index basis. If there are two index files, and file-1 is used before file-2, the result set will contain matching documents from file-1 first, and then those from file-2. But we want the final result sorted based only on document scores regardless of which index the document belongs to.

After careful observation of the object returned by the root index and other child indexes, it was found that the document number in the child index is different from the document number in the root index. For example, if index-1 has 20 documents and index-2 has 10 documents, and they are added to the root index in the order (index-1 first, then index-2), the root index will contain 30 documents which are numbered from 0 to 29. If the query “nano” has 3 hits from index-1 (let assume, document 1,3,6) and 2 hits from index-2 ( documents 2,4), the hit list from index-1 will contain document id of 0,2 and 5 , and the hit list from index-2 will contain document  id of 1 and 3.  But the document id 1 and 3 of index-2 is the document id 21 and 23 of root index.

To overcome this problem, a new proxy index is created using the root index which will map the document id of individual indexes to the desired document id of the root index. Zend_Lucene_Search_Proxy  class is used in this regard by extending as Nist_Search_Proxy and customizing required functionalities.

There is also the issue of arranging the documents regardless of indexes. To do this, the hit list array in the returned object is sorted in descending order of score which rearranges all the documents. Then relevant records are fetched from the database and presented in the UI.

The default scoring provided by Zend Lucene uses tf-idf algorithm; tf stands for ‘term frequency’, and ‘idf’ stands for ‘inverse document frequency’. The scoring formula is same as the Java Lucene:

score(q,d) =      sum( tf(t in d)

                  * idf(t)

                  * getBoost(t.field in d)

                  * lengthNorm(t.field in d) )

                * coord(q,d) * queryNorm(q)

Explanation of the functions is briefly discussed below:

tf(t in d) - Zend_Search_Lucene_Search_Similarity::tf($freq) - a score factor based on the frequency of a term or phrase in a document.

idf(t) - Zend_Search_Lucene_Search_Similarity::idf($input, $reader) - a score factor for a simple term with the specified index.

getBoost(t.field in d) - the boost factor for the term field.

lengthNorm($term) - the normalization value for a field given the total number of terms contained in a field. This value is stored within the index. These values, together with field boosts, are stored in an index and multiplied into scores for hits on each field by the search code.

Matches in longer fields are less precise, so implementations of this method usually return smaller values when numTokens is large and larger values when numTokens is small.

coord(q,d) - Zend_Search_Lucene_Search_Similarity::coord($overlap, $maxOverlap) - a score factor based on the fraction of all query terms that a document contains. The presence of a large portion of the query terms indicates a better match with the query, so implementations of this method usually return larger values when the ratio between these parameters is large and smaller values when the ratio between them is small.

queryNorm(q) - the normalization value for a query given the sum of the squared weights of each of the query terms. This value is then multiplied into the weight of each query term. This does not affect ranking, but rather just attempts to make scores from different queries comparable.

The scoring algorithm can be customized by using own similarity class. This can be done by extending the class Zend_Search_Lucene_Search_Similarity and overriding the methods. While this was tested to observe the changes, any customized formula is not being used for phase 1 prototype. The default scoring algorithm is currently being used.

Outstanding Issues

The objective of phase – 1 design was to clarify the searching mechanism for a subset of data. The final system will have data from different institutions, not just from UTA. Besides, Apache Solr is now a days a popular tool for implementing search engines which use Apache Lucene at its core. We considered Zend Lucene for this phase. Hence more researches should be done for the following outstanding issues in upcoming stages:

  • Scalability of this design to multiple institutions, not just UTA
  • Getting data from other institutions and updating the database
  • Use of Apache Solr instead of Zend Lucene

Conclusion

In this design phase, the working principle of Lucene search engine is clarified. The better approach to design the database and indexes was also understood. Research was done to find out better text analyzing technique and a good open source implementation was found which is being used for creating the prototype. The scoring and ranking mechanism of Zend Lucene was studied and methods were noted which can be used for customization.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.