Full Text Search Engine. Fawaz Aziz (0622597)

From mi-linux
Revision as of 14:22, 3 April 2009 by 0622597 (talk | contribs) (New page: '''Roll Your Own Search Engine with Zend_Search_Lucene''' Creating the Index ---- <?php require_once 'Zend/Feed.php'; require_once 'Zend/Search/Lucene.php'; function sanitize($input) ...)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigationJump to search

Roll Your Own Search Engine with Zend_Search_Lucene

Creating the Index



<?php

require_once 'Zend/Feed.php'; require_once 'Zend/Search/Lucene.php';

function sanitize($input) { return htmlentities(strip_tags( $input )); }

//create the index $index = new Zend_Search_Lucene('/tmp/feeds_index', true);

$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone', 'http://www.planet-php.net/rss/', 'http://www.sitepoint.com/blogs/category/php/feed/', );

//grab each feed foreach ($feeds as $feed) {

$channel = Zend_Feed::import($feed);

echo $channel->title()."\n";

// index each item foreach ($channel->items as $item) { if ($item->link() && $item->title() && $item->description()) {

$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::Keyword('link', sanitize($item->link())));

$doc->addField(Zend_Search_Lucene_Field::Text('title', sanitize($item->title())));

$doc->addField(Zend_Search_Lucene_Field::Unstored('contents', sanitize($item->description())));

echo "\tAdding: ".$item->title()."\n"; $index->addDocument($doc); } } } $index->commit();

echo $index->count()." Documents indexed.\n";


The first step after we include the framework files is to actually create the Zend_Search_Lucene object and specify the location to store it. The second parameter indicates that we want to create a fresh index:



//create the index $index = new Zend_Search_Lucene('/tmp/feeds_index', true);


Next, we specify the RSS feeds we are interested in and fetch them in a loop. Then, with each feed we loop through the articles and index each one as a seperate Zend_Search_Lucene document. Here is the feed fetching and looping code displayed once again so you can differentiate the feed processing from the indexing. Note that in these code examples I've omitted most error checking for the sake of clarity.



$feeds = Array('http://feeds.feedburner.com/ZendDeveloperZone', 'http://www.planet-php.net/rss/', 'http://www.sitepoint.com/blogs/category/php/feed/', );


//grab each feed foreach ($feeds as $feed) {

$channel = Zend_Feed::import($feed);

echo $channel->title()."\n";

// index each item foreach ($channel->items as $item) { if ($item->link() && $item->title() && $item->description()) {

//Create and index a ZSearch Document

} }



To add a document to our index, we create the document object and specify content for the document's fields. Zend_Search_Lucene provides different ways to analyze and store fields depending on how we need to search them and return the results. In this example, for each RSS item, we want to index the link, title, and description.




$doc = new Zend_Search_Lucene_Document();

$doc->addField(Zend_Search_Lucene_Field::Keyword('link', sanitize($item->link())));

$doc->addField(Zend_Search_Lucene_Field::Text('title', sanitize($item->title())));

$doc->addField(Zend_Search_Lucene_Field::Unstored('contents', sanitize($item->description())));

echo "\tAdding: ".$item->title()."\n"; $index->addDocument($doc);



Searching the Index Now that we have created a Zend_Search_Lucene index, let's put it to use by performing some searches. You can implement search on an index in just a couple dozen lines of code:




<?php

require_once 'Zend/Search/Lucene.php';

//open the index $index = new Zend_Search_Lucene('/tmp/feeds_index');

$query = 'framework';

$hits = $index->find($query);

echo "Index contains ".$index->count()." documents.\n\n";

echo "Search for '".$query."' returned " .count($hits). " hits\n\n";

foreach ($hits as $hit) { echo $hit->title."\n"; echo "\tScore: ".sprintf('%.2f', $hit->score)."\n"; echo "\t".$hit->link."\n\n"; }

?>Could it be any easier? We include the library, open our index, seach for a term, and iterate through the result set.You should note that since we used the default case insensitive text analyzer to build the index, the search query should be lowercase.

The Zend_Search_Lucene query format is powerful but simple. It's a snap to specify multiple query terms with a special syntax.

To search our RSS index for articles that must contain the word 'framework' in the 'contents' field:

$query = '+framework';For articles with 'Zend' in the title:

$query = 'title:zend';For articles with containing the word 'framework' but without the word 'Zend' in the title:

$query = 'framework -title:zend';




Conclusion In these simple examples, we have seen that the Zend_Search_Lucene module provides an easy way to add customized search functionality to an any php application without a dependance on external software packages. As the Zend_Search_Lucene module matures, it will no doubt prove to be a prized component of the Zend Framework. In future articles I hope to explore advanced indexing and search capabilities of Zend_Search_Lucene, and put the module through some real-life benchmarks using large data sets, comparing indexing and search performance against some other current popular methods.

--0622597 15:22, 3 April 2009 (BST)