Java Lucene, and all its ports to other languages, including Zend Lucene, are search libraries.
This means that in order to use Zend Lucene, we have to wrap it in other (PHP) code that will integrate the search with the rest of our application. The code generally needs to manage indexing, retrieval and usually some housekeeping of Lucene. We communicate with Zend Lucene using PHP function calls.
Solr, On the other hand, is a search server built on top of Lucene. This means that a Solr instance can run as a stand-alone server webapp inside a servlet container (That could be Tomcat, Jetty or one of several other such programs). It is much easier to set up a Solr server than a Lucene application. We can do a lot with Solr without writing a single line of Java - just by tweaking some XML configuration files. Setting up a Solr server may take as few as several minutes. The default way to communicate with Solr is using HTTP calls.
So basically Zend Lucene installation requires having a PHP server and proper indexing and retrieval using a PHP library. Solr installation requires running a Java servlet container and deploying a war file into it.
Some investigation was done regarding performance of Solr vs Zend Lucene for indexing and searching. After researching user experience shared in different technology webpages it was found that Solr is a better option for text searching in comparison to Zend Lucene. Some limitations of Zend Lucene:
- It is fit for small websites with fewer amounts of data
- Creating index from database records need more time
- When working with large amount of data, index creation is very slow
- When searching on large index, search performance is poor
Experience shared by different software professionals in various websites are presented below.
- In my experience Zend Lucene is good for small amounts of data, but slows down very quick as you add more data. I had to research a new alternative to zend lucene because it's performance just wasn't cutting it on my current project. To make a long story short, we went with Solr, which is built on Apache Lucene. Indexing of 70k + articles went from hours to minutes.
- We went from 9 - 10 seconds waiting for search results with Zend_Lucene down to to milliseconds with Solr. And that was for 70k records. -- Jeff Busby Mar 4 '11 at 15:16
- Zend Lucene and Java Lucene are built in PHP and java repectively, and PHP language has a higher level than java. Just wondering how big the performance difference among these two, regarding to index building and data searching? Is it much more effective to let java create and rebuild index, and let php use the index?
This is a quote from a Zend Certified Engineer.
Against my better judgment, the company I work for migrated our previous search solution to Zend_Search_Lucene. On pretty heavy-duty hardware, indexing a million documents took several hours, and searches were relatively slow. The indexing process consumed vast amounts of memory, and the indexes frequently became corrupted (using 1.5.2). A single wild card search literally brought the web server to its knees, so we disabled that feature. Memory usage was very high for searches, and as a result requests per second necessarily declined heavily as we had to reduce the number of Apache child processes.
We have since moved to Solr (a Lucene-based Java search server) and the difference is dramatic. Indexing now takes around 10 minutes and searches are lightning fast. What a difference a language makes.
After my adventures with Zend-Lucene-Search, and discovering it isn't all its cracked up to be when indexing large datasets, I've turned to Solr (thanks to Bill Karwin for that :) )
I've got Solr indexing the db far far quicker now, taking just over 8 minutes to index a table of just over 1.7million rows - which I'm very pleased with.
However, when I come to try and search the index with the Zend port, I run into the following error;
Fatal error: Uncaught exception 'Zend_Search_Lucene_Exception' with message 'Unsupported segments file format' in /var/www/Zend/Search/Lucene.php:407 Stack trace: #0 /var/www/Zend/Search/Lucene.php(555): Zend_Search_Lucene->readSegmentsFile() #1 /var/www/z_search.php(12): Zend_Search_Lucene->_construct('tmp/feeds_index') #2
thrown in /var/www/Zend/Search/Lucene.php on line 407
I've tried to have a search around but can't seem to find anything about this problem, everyone just seems to be able to get them to work?
Any help as always much appreciated :) - Thanks, Tom
This blog post has a single purpose: to write down a warning of using the Zend Framework PHP implementation of Lucene.
I was forced to use it in a project for some reasons. My customer wanted a sophisticated search engine for his Typo3 based website, especially for his business objects. I don’t do Typo3, and my first thought was to build this search engine as an external service in Java Lucene, let Lucene index both website and database business objects, and let Typo3 query Lucene through some HTTP based service – SOAP, XML-RPC, HTTP + JSON, whatever.
Next came some objections by my customer: We only had some “managed root servers”. Those were preconfigured by the hosting company, and the contract wouldn’t allow any major changes in the configuration due to service and support issues and warranties. In particular it wasn’t allowed to install a Java web container such as Tomcat to run Lucene as a webapp, and a monitoring service would look for unwanted threads on each server and instantly kill them. No root access, and only PHP (+ Apache + MySQL) was allowed to run.
Therefore we decided to give the PHP implementation of Lucene inside the Zend Framework a try. But honestly, this turned out to be a nightmare. Here are some reasons:
- Memory. We also had a memory limit of 64 MB per PHP thread on that server. It was not possible to add 10.000 documents to the index within one php-cli run. We also requested 128MB from the hoster, it didn’t help. I had to shut down the indexing process, restart and re-open the index repeatedly, no matter what I tried. Even closing the index and unsetting all references didn’t help.
- Memory. In addition to this, my observation is that Java a far superior class loading, memory management and garbage collection in comparison with PHP. If you don’t pay strong attention to it, this makes any serious and bigger project really difficult. PHP Lucene seems to fail in this concern to me.
- Speed. If you want to build a big index: forget it. A Lucene index can contain some million documents – I wouldn’t know how to build such an index with PHP Lucene. My index currently contains 60.000 documents, and it takes PHP about 2 hours to build it from scratch. Also updating or optimizing an index is much faster in the Java version.
- Speed. This is where PHP Lucene really went bad: querying the index. I had no really complex queries with maybe 10 terms on different fields, including a few range query terms. It sometimes took more than 60 seconds to get a result – when I got a result at all instead of a fatal out of memory error or max execution timeout. The same query in Java Lucene, tested with Luke: some milliseconds! Unbelievable!
- Features. Java Lucene is a well known, proven project. There are many add-ons and related projects such as Solr that make your life with Lucene a joy. There is no comparable eco system in the PHP world. E.g. my customer came up with the idea a location based search. In Java I could use the Local Lucene add-on to do geographical search. I wouldn’t know how to do this in PHP without reinventing such an algorithm in PHP.
- Features. Another request was: sort results by frequency or count of a term in a field. This sounds simple, but it’s close to impossible – at least in PHP. To the best of my knowledge, you can’t add a field with a TermVector or equivalent for that.
So what happened to my project? We ended up leaving the indexing part in PHP (until for now, I feel a final end coming as the amount of data will be constantly growing which will exceed all limitations of PHP Lucene), and some post processing and optimizing of the index and the querying part has been ported to Java, after I finally managed run a small footprint Jetty “illegally” and unsupported on that server.
Conclusion: Don’t consider using PHP Lucene ever. (Unless your project and amount of data is quite small and any limitations don’t matter.)
In Phase – 1, we are considering UT Arlington Profile System as our data source for indexing and searching. We have to consider other partnership institution profiles in consecutive phases.From the above discussions, it seems like we should move to Solr from Zend Lucene so that incorporating other data sources like collaborative partners and non-academic institutions will fit smoothly and the indexing and searching performance will not degrade.