Understanding the Phase 1 design - March 2nd, 2012

Version 2 by dak4279
on Mar 13, 2012 13:29.

compared with
Version 3 by dak4279
on Mar 14, 2012 11:25.

This line was removed.
This word was removed. This word was added.
This line was added.

Changes (4)

View Page History
Lucene search engine uses indexes to find the records in the database matching the keywords. At first, the index files need to be created from database tables. Lucene stores these files not in the database but on the operating system. When a user submits search string, Lucene forms a relevant query object from the string using the keywords. It then uses the index files to find the id of the records that match the query. Each record is fetched from the database using this id.

The first step is to create the index files from the database records. Using the database design-1, a separate index was created for each of the tables. An index file consists of documents. Each document has some fields which hold values of each record. !Picture9.png|border=1,width=855,height=422!


For example, if we consider “Faculty” table, it has 8463 records. Each record has the columns as per database design-1. We create an index file on the “Faculty” table. This index file will have 8463 documents. Each document will have column names as the field names and column values as the values. Only the id of a record is stored in the index files so that the index file does not become very large.

Indexes were created using “Faculty”, “FacultyResearch” and “FacultyPublication” tables for testing purpose. These indexes were used to search for test keywords. The matching records were also retrieved. But there was one problem when combining the scores -- how to meaningfully combine score of different sections of a particular profile? For example, searching for a keyword “nano” returned result from “FacultyResearch” and “FacultyPublication” tables as a faculty profile can contain research related to nano technology as well as publication listing. The results had different scores. The issue is how to combine these scores to find a matching profile type. Because of this problem, database was redesigned to have all the information of a particular profile type in the same table.

With the newly designed database as discussed in section 5, there are as many tables as profile types. For this testing, there are 6 types of profile and hence 6 tables. An index is created for each of the tables. “facultyIndex” is created for “FacultyProfile” table. Similarly, “researchCenterIndex”, “technologyIndex”, “faciltiyIndex”, “equipmentIndex” and “labIndex” are created.  All these indexes are added to a “rootIndex” which is created using the Lucene multisearcher index creation method.  !Picture10.png|border=1,width=932,height=911!


Using this design, searching for the keyword “nano” gives result from each of the profile types. Hence, the score returned for each record is the combined score from all the sections of that profile. “FacultyProfile” table now contains columns ‘facultyResearch’ and ‘facultyPublication’. When “facultyIndex” is created, it will contain each of the rows of FacultyProfile table as a document. When “nano” keyword is being searched, this index will return the documents matching the keyword. The score will be calculated over all the fields of the document, in other words, all the columns of that table. Hence we get a combined score from ‘research’ and ‘publication’ section of the faculty profile.

*Section 5: Redesign Database Tables*

After facing the problem of merging scores from different sections of a profile, the database was redesigned. In the new design, each of the profile tables is created by denormalizing the record for that profile type. The new design is presented in Figure 9.