The Lucene text search engine library (from the Apache Jakarta project) provides fast and flexible search capabilities that can be easily integrated into many kinds of applications. Lucene provides a number of advanced capabilities “out of the box”, and can be extended to accomodate special needs.
For large text collections, you will almost always want to use disk-based indices that can be updated and reused across multiple executions of an application. For small collections, especially when running in an unsigned applet or WebStart application where disk access is not permitted, Lucene provides a mechanism for maintaining an in-memory index. The example below provides a simple illustration of this capability.
At a minimum, using Lucene typically involves the following steps:
- Build an index using
IndexWriter
- For file-based indexes, a directory name can be passed to the
IndexWriter
constructor. In this example, however, we use theRAMDirectory
class to maintain an in-memory index.
- Add
Document
objects representing each object to be searched to theIndexWriter
. ADocument
is a collection ofField
objects. Different subclasses ofField
support indexed or unindexed content.
- Optimize and close the
IndexWriter
object.
- For file-based indexes, a directory name can be passed to the
- Update the index, by either rebuilding it from scratch or deleting (and, where appropriate, re-adding)
Document
s. Somewhat unintuitively, adding and deletingDocument
s from an index is done with anIndexReader
object.
- Search the index using an
IndexSearcher
object.
- As with
IndexWriter
s,IndexSearcher
s can be constructed with a directory name for file-based indexes. In this example, we pass in theRAMDirectory
object that we created when the index was built.
- A
Query
object encapulates the search query. These can be created using theQueryParser
class.
- The
Query
object is passed to theIndexSearcher
‘ssearch(…)
method, which returns aHits
object that provides access to theDocument
objects that match the query.
- As with
There are ways to customize practically every aspect of Lucene. The example in Figure 1 illustrates a minimal usage of the library.
/** * A simple example of an in-memory search using Lucene. */ import java.io.IOException; import java.io.StringReader; import org.apache.lucene.search.Hits; import org.apache.lucene.search.Query; import org.apache.lucene.document.Field; import org.apache.lucene.search.Searcher; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.document.Document; import org.apache.lucene.store.RAMDirectory; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.queryParser.QueryParser; import org.apache.lucene.queryParser.ParseException; import org.apache.lucene.analysis.standard.StandardAnalyzer; public class InMemoryExample { public static void main(String[] args) { // Construct a RAMDirectory to hold the in-memory representation // of the index. RAMDirectory idx = new RAMDirectory(); try { // Make an writer to create the index IndexWriter writer = new IndexWriter(idx, new StandardAnalyzer(), true); // Add some Document objects containing quotes writer.addDocument(createDocument("Theodore Roosevelt", "It behooves every man to remember that the work of the " + "critic, is of altogether secondary importance, and that, " + "in the end, progress is accomplished by the man who does " + "things.")); writer.addDocument(createDocument("Friedrich Hayek", "The case for individual freedom rests largely on the " + "recognition of the inevitable and universal ignorance " + "of all of us concerning a great many of the factors on " + "which the achievements of our ends and welfare depend.")); writer.addDocument(createDocument("Ayn Rand", "There is nothing to take a man's freedom away from " + "him, save other men. To be free, a man must be free " + "of his brothers.")); writer.addDocument(createDocument("Mohandas Gandhi", "Freedom is not worth having if it does not connote " + "freedom to err.")); // Optimize and close the writer to finish building the index writer.optimize(); writer.close(); // Build an IndexSearcher using the in-memory index Searcher searcher = new IndexSearcher(idx); // Run some queries search(searcher, "freedom"); search(searcher, "free"); search(searcher, "progress or achievements"); searcher.close(); } catch(IOException ioe) { // In this example we aren't really doing an I/O, so this // exception should never actually be thrown. ioe.printStackTrace(); } catch(ParseException pe) { pe.printStackTrace(); } } /** * Make a Document object with an un-indexed title field and an * indexed content field. */ private static Document createDocument(String title, String content) { Document doc = new Document(); // Add the title as an unindexed field... doc.add(Field.UnIndexed("title", title)); // ...and the content as an indexed field. Note that indexed // Text fields are constructed using a Reader. Lucene can read // and index very large chunks of text, without storing the // entire content verbatim in the index. In this example we // can just wrap the content string in a StringReader. doc.add(Field.Text("content", new StringReader(content))); return doc; } /** * Searches for the given string in the "content" field */ private static void search(Searcher searcher, String queryString) throws ParseException, IOException { // Build a Query object Query query = QueryParser.parse( queryString, "content", new StandardAnalyzer()); // Search for the query Hits hits = searcher.search(query); // Examine the Hits object to see if there were any matches int hitCount = hits.length(); if (hitCount == 0) { System.out.println( "No matches were found for \"" + queryString + "\""); } else { System.out.println("Hits for \"" + queryString + "\" were found in quotes by:"); // Iterate over the Documents in the Hits object for (int i = 0; i < hitCount; i++) { Document doc = hits.doc(i); // Print the value that we stored in the "title" field. Note // that this Field was not indexed, but (unlike the // "contents" field) was stored verbatim and can be // retrieved. System.out.println(" " + (i + 1) + ". " + doc.get("title")); } } System.out.println(); } }
To compile and run this class, you will need to include the lucene jar file (downloaded from http://jakarta.apache.org/lucene/) in your classpath. Figure 2 shows the output from the class.
Hits for "freedom" were found in quotes by: 1. Mohandas Gandhi 2. Ayn Rand 3. Friedrich Hayek Hits for "free" were found in quotes by: 1. Ayn Rand Hits for "progress or achievements" were found in quotes by: 1. Theodore Roosevelt 2. Friedrich Hayek
InMemoryExample
class.