Understanding the Phase 1 design - March 2nd, 2012

Skip to end of metadata
Go to start of metadata
You are viewing an old version of this page. View the current version. Compare with Current  |   View Page History

Introduction

The design of phase-1 was initially divided into some major tasks:

  1. Design local database
  2. Define keywords extraction scheme from user input
  3. Match keywords with database records and extract relevant records
  4. Design ranking algorithm to rank the search results

With the progress of these tasks, some necessary adjustments were required and the following tasks were added.

  1. Write script to migrate data from profile system database
  2.  Research on index creation approach
  3.  Redesign database tables


So the revised task sequence is as follows:

  1. Design local database
  2. Define keywords extraction scheme from user input
  3. Write Script to migrate data from profile system database
  4. Research on index creation approach
  5. Redesign database tables
  6. Match keywords with database records and extract relevant records
  7. Design ranking algorithm to rank the search results.

This document includes the explanation of each of these tasks and the overall understanding of the design of phase-1. The description of the tasks can be found in relevant sections.

Section 1: Design Local Database

Phase – 1 uses UT Arlington’s profile system as the source of data. The current Profile System database has tables that were designed to facilitate the operation of profile system website. The objective of NIST project is to search data. There would be no insert, update or delete operation on database records from the client side. So, a new database design was proposed to meet this need.

Database design - 1

In the profile system there are six different types of profiles: Faculty, Research Center, Technology, Facility, Equipment and Labs. In this design each profile type had a base table that included the profileId (pid in the profile system). There were also dependent tables that hold information regarding different sections of a particular profile. For example, the Faculty profile had the base table “Faculty” which had profileId. Other tables like “FacultyPublication”, “FacultyPresentation”, and “FacultyKeyword” etc. hold information of the relevant sections. Each dependent table has a foreign key facultyId that is used to link the record with the particular id of the Faculty table.  The E-R diagram for Faculty tables is presented in Figure – 1.

Using this similar approach, different tables for other profiles were also designed. E-R diagram of each of the tables are provide in the consecutive figures.

Database design - 2

During task 4 it was realized that database design – 1 is not the best approach for creating multiple indexes. Hence the design was changed in task 5. This approach is further discussed in section 5. 

Section 2: Keyword Extraction

The user will provide information related to his need using the form. There are different input fields in the form which will have different types of input such as - paragraph, checkboxes etc.   Users will submit this form and will get the results relevant to their needs. Keywords have to be extracted from these user input so that they can be matched with database records.

A research on keyword extraction issue provided us with two options –

  1. Using  third party API to extract keywords
  2. Using Lucene search engine which internally extracts keywords

 These two options are discussed below.

          a.) Using third party API: There are some companies that provide keyword extraction as web services. A study on them is presented in Table – 1

Sl No Company Web Service/Package Usage Limitations Authentication
1 Yahoo Content analysis service
Web Service Only Non-commercial 5000 queries per day per IP address
API key
2 Tagthe.net
Web Service Nothing mentioned in the page No Limitations No API key
3 OpenCalais
Web Service Both Commercial and Non-Commercial 50,000 transactions/day
4 transactions/second
Looks like we can request for a larger quota if needed.
API key
4 Alchemy API
Web Service Both Commercial and Non-Commercial 30,000 API calls/day for academic users.
Website says "Higher limits available to educational institutions and non-profit groups."
API key
5 Zemanta
Web Service Nothing mentioned in the page 1000 calls/day
Can be increased to 10,000 calls/day
API key
6 Semantic Hacker API
 
Web Service Nothing mentioned in the page 20,000/day,
5/second
API key

Table 1: Comparison of web service API’s provided by different companies

Among these alternatives, Alchemy API shows comparatively good performance in extracting keywords from different paragraphs. Moreover, it provides facility for academic users. Hence it can be a good option for keyword extraction.

            b.)Using Lucene search engine: Zend Lucene will be used to develop the search engine. Lucene can take a text input and search the database for important words present in the text. But it cannot perform keyword extraction with the accuracy of the APIs. As the user input from the client side will consists of paragraphs, it is better to extract the important keywords from the paragraphs and then provide these to Lucene for searching.

We are planning to use Alchemy API for keyword extraction.

Section 3: Write Migration Script

An initial migration script was written for database design-1. When the database was redesigned, a new migration script was written. The workflow steps are as follows:

  1. Copy relevant tables from Profile System database to the new “nist-test” database.
  2. Create tables in the “nist-test” database according to the design.
  3. From the php migration script, make connection to the “nist-test” database.
  4. Write separate methods for populating the tables. For example, “insertIntoFacultyTable” method will fetch rows from “ppl_general_info” table, create a new row using this data and insert the row into “Faculty” table.

 For phase – 1, we are considering only UT Arlington Profile System as the source of data. Later we have to develop API s to get data from different partner institutions and populate the local database.

Section 4: Research on Index Creation Approaches

Lucene search engine uses indexes to find the records in the database matching the keywords. At first, the index files need to be created from database tables. Lucene stores these files not in the database but on the operating system. When a user submits search string, Lucene forms a relevant query object from the string using the keywords. It then uses the index files to find the id of the records that match the query. Each record is fetched from the database using this id.

The first step is to create the index files from the database records. Using the database design-1, a separate index was created for each of the tables. An index file consists of documents. Each document has some fields which hold values of each record.

For example, if we consider “Faculty” table, it has 8463 records. Each record has the columns as per database design-1. We create an index file on the “Faculty” table. This index file will have 8463 documents. Each document will have column names as the field names and column values as the values. Only the id of a record is stored in the index files so that the index file does not become very large.

Indexes were created using “Faculty”, “FacultyResearch” and “FacultyPublication” tables for testing purpose. These indexes were used to search for test keywords. The matching records were also retrieved. But there was one problem when combining the scores – how to meaningfully combine score of different sections of a particular profile? For example, searching for a keyword “nano” returned result from “FacultyResearch” and “FacultyPublication” tables as a faculty profile can contain research related to nano technology as well as publication listing. The results had different scores. The issue is how to combine these scores to find a matching profile type. Because of this problem, database was redesigned to have all the information of a particular profile type in the same table.

With the newly designed database as discussed in section 5, there are as many tables as profile types. For this testing, there are 6 types of profile and hence 6 tables. An index is created for each of the tables. “facultyIndex” is created for “FacultyProfile” table. Similarly, “researchCenterIndex”, “technologyIndex”, “faciltiyIndex”, “equipmentIndex” and “labIndex” are created.  All these indexes are added to a “rootIndex” which is created using the Lucene multisearcher index creation method. 

Using this design, searching for the keyword “nano” gives result from each of the profile types. Hence, the score returned for each record is the combined score from all the sections of that profile. “FacultyProfile” table now contains columns ‘facultyResearch’ and ‘facultyPublication’. When “facultyIndex” is created, it will contain each of the rows of FacultyProfile table as a document. When “nano” keyword is being searched, this index will return the documents matching the keyword. The score will be calculated over all the fields of the document, in other words, all the columns of that table. Hence we get a combined score from ‘research’ and ‘publication’ section of the faculty profile.

Section 5: Redesign Database Tables

After facing the problem of merging scores from different sections of a profile, the database was redesigned. In the new design, each of the profile tables is created by denormalizing the record for that profile type. The new design is presented in Figure 9.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.