Specify 6 - Web Portal

Specify Software Project Staff
2 April 2008
Version 1.0

Web Portal

The Web Portal for Specify needs to present a pleasant Web 2.0 Look and Feel. Although the core bits of information for each collection resides in the same place within the Specify database, it needs to be somewhat flexible to provide auxiliary information including images.

As I have mentioned in our weekly meetings, I have been looking into using Lucene to index a Specify database and then make it available as a 'search portal' using HTML, JavaScript, jQuery with the back-end server using Apache Solr.

There are two approaches to integrating Lucene into a search module:

  1. Use the Lucene index without storing any of the 'content' of the records into the index itself. Meaning the Lucene index acts as a cross-reference from a full text search of the Specify Collection Object records to Specify database table record Ids. The search is performed and the primary key indexes that are stored as part of the indexing process are returned for the various hits on the tables. The records ids are then used to search a full schema or partial schema Specify database with the content then being displayed on a web page.
  2. Use the Lucene index and store the content into the index documents. Then use that content to be displayed to the user.

Approach #1 - Lucene Index without Content

The following diagram shows the architecture of a 'no-content' approach:

Without Content
Figure #1 - Using Lucene/Solr Without Storing Content

 

The Pros

The advantages of using this approach:

  1. The Lucene index files are 25% smaller for very large databases and 50% smaller small databases.
  2. Because the information being displayed is not stored during the indexing processes, there is more flexibility as to what is displayed and the display can be augmented without re-indexing.

The Cons

  1. The overall architecture is considerably more complicated plus it requires an additional server-side app to be written for getting indexes and sending the content information. There are two options for this approach:
    1. The front-end sends the search to the Solr server and plays 'broker' by getting the indexes back and sending them to the 'content collector' in the back-end that retrieves the content and sends it back to the browser. This makes for a more complicated front-end and two requests over the internet to fulfill the search request from
    2. The front-end makes the search requests directly to the 'content-collector' that searches the Lucene index without using Solr. Then takes the indexes from Lucene, searches the Specify DB and returns the results content (see Figure #2). This would be a much better approach.

All BAckend
Figure #2 - All the Processing On the Back-End

 

Note: In Figure #2, instead of accessing the 'live' Specify database, it could be setup to access a replicated database. Another alternative to a replicated Specify database would have the indexing process create a simple denormalized Specify DB and the Content Collector could use that instead.

 

Approach #2 - Lucene Index with Content

This approach eliminates the necessity for an Apache server instance and a separate web app, see Figure #3.

With Content
Figure #3 - Solr Back-End Only

The Pros

  1. No back-end server or 'content collector' app is required. All the content results are returned by Solr as JSON and are processed in the front-end for display in a Grid widget (table).
  2. Searches are returned very fast.
  3. This solution is extremely easy to develop and again, requires no back-end coding.
  4. The indexing time does not increase significantly.

The Cons

  1. As mention in the first approach, storing all the content in the Lucene index reduces the flexibility of what can be shown without re-indexing, although I do not see this as significant.
  2. The index files do get very large, but probably not as large as a Specify replicated database.

 

Displaying Hierarchical Information - Trees (Taxonomy, Geography, Stratigraphy)

Several front-end toolkits have 'tree' widgets for displaying hierarchical information, most accept JSON from a server. The idea is simple, retrieve a parent node and its children nodes, as well as, whether a each of the children is itself a parent.

Full text indexing does not lend itself well for indexing the information in such a way to make it easy to traverse a tree structure from a SQL database table. The best approach is a simple back-end script / app for querying the database for each parent node and its direct children nodes when it is visually expanded.

When displaying tree information the Web Portal may need to display a collection's entire tree, OR just the branches and leaf nodes of those nodes that have associated information. For example, display only the branches and leaf nodes of a Taxonomy tree where the leaf nodes are used in a determination, or for Geography only the counties that are used by Localities associated with Collecting Events.

The Web Portal UI could have a simple switch that enables the tree to be viewed in either mode.

There are two approaches for displaying the tree information:

  1. Using the Specify database 'as is' accessing the current tree structure. This would require additional queries to be executed for each node expansion in the Taxon tree for determining whether the nodes are on a branch that has determinations. This will reduce performance)
  2. Create a new tree from the existing data that would have additional information stored on each node as to weather it has children and whether it is part of a branch that has associated information. (I would recommend this approach for performance and flexibility reasons).

Lucene Indexing Times and Index Sizes

The table shows the elapsed time for indexing a small and large database:

Database Number of Col Object Records Indexing Time Holds Content Index Size
KU Fish
40,214
14 sec.
No
13MB
KU Fish
40,214
16 sec.
Yes
25MB
KU Ento
814,166
268 sec. (4.45 min.)
No
91MB
KU Ento
814,166
268 sec.
Yes
458MB

 

UI Recommendation

Although the demo UI was created using jQuery for both the Grid and the browsable Tree, my recommendation is to go with ExtJS. It has a much better tree widget and the overall L&F is better.

Lucene / Solr Demo

The demo application is using Approach #2 that stores all the content in the Lucene index. It is not fully functional but does give you the idea of the performance and how Solr can be used. This demo did not require any back end code for the 'Simple' and 'Advance' searches. The Taxonomic Tree browser required a small PHP script to retrieve node information (Approach #1 in the section 'Displaying Hierarchical Information').

Each of the last two columns in the Grid display of the results may or may not show an icon. When the 'map' icon displays, this means the Locality has a Lat/Lon and can be mapped. The Image column will show an icon to indicate the record has images. If either of these two icons are displayed you can click on the green plus button at the front of the row to expand the row to see additional information. In this demo each of the rows contains most of the information for the Col. Obj. so there isn't any additional text to be displayed, just a map or image. This pace could be used to display publishing/Author information.

In the queries below, do not include the single quote in the search.

If anything doesn't work perfectly, keep in mind this is a prototype that I put together in a couple of days.

Things to try in the Simple Search:

  1. Search for: 'bentley' (then expand the row)
  2. Search for: 'bentley and viridis'
  3. bentley or viridis
  4. (bentley or david) and kuntee
  5. (bentley or david) and kansas

Advanced Search

You can try out various different combinations with the 'Match Any' which is 'OR' or 'Match All' which is 'AND'

Another species to try would be 'caprodes'

 

The Browse Tab

The browse tab displays the Taxon tree and if you expand the very top path you will end up at a species that has a Collection Object, a number will display underneath. The idea is to be able to click on the any of the nodes and get detailed information at them. This tree widget does not support this.

The demo can be found here.

Summary

At this time I would recommend using Lucene/ Solr with storing the content in the index and using EXTJS for the front end. This would enable us to quickly create a search portal with no required coding on the back end except for trees (which I mostly already have implemented).