Table of Contents 1) Understanding Lucene 2) Lucene Indexing 3) Types of Fields in Lucene Index 4) An example of Lucene Index fields 5) Core Searching classes 6) Types of Queries 7) Incremental Indexing 8) Score Boosting and relevance ranking 9) Scoring Algorithm 10) Sorting search results 11) Handling multiple pages of search results 12) Examples of queries possible with Lucene 13) Abstract storage in Index 14) Security 15) Composition of Segments in Lucene Index 16) Debugging lucene indexing process 17) Lucene in Alfresco 18) Alfresco repository architecture 19) Why do we sometimes have redundant data in Index and Database 20) Caching 21) Experience of lucene implementation 22) Good articles on Lucene

Understanding Lucene There are two things we need to know to create a Lucene search database: • How you are going to get information to spider, and • What processing you are going to do on the text to increase the likelihood of a successful query Back to Content page

Understanding Lucene Lucene doesn’t care about the source of the data, its format, or even its language, as long as you can convert it to text. This means you can use Lucene to index and search data stored in files: web pages on remote web servers, documents stored in local file systems, in databases, simple text files, Microsoft Word documents, HTML or PDF files, or any other format from which you can extract textual information. The quality of a search is typically described using precisionand recallmetrics. Recall measures how well the search system finds relevant documents, whereas precision measures how well the system filters out the irrelevant documents. Back to Content page

Lucene Indexing As you saw in our Indexer class, you need the following classes to perform the simplest indexing procedure: ■ IndexWriter (creates a new index and adds documents to an existing index) ■ Directory (represents the location of a Lucene index. Subclasses : FSDirectory and RAMDirectory ) ■ Analyzer (extracts tokens out of text to be indexed and eliminates the rest) ■ Document (a collection of fields ) ■ Field ( Each field corresponds to a piece of data that is either queried against or retrieved from the index during search) Back to Content page

Types of Fields in Lucene Index ■ Keyword - Isn’t analyzed, but is indexed and stored in the index verbatim. This type is suitable for fields whose original value should be preserved in its entirety, such as URLs, file system paths, dates, personal names, Social Security numbers, telephone numbers, and so on. ■ UnIndexed - Is neither analyzed nor indexed, but its value is stored in the index as is. This type is suitable for fields that you need to display with search results (such as a URL or database primary key), but whose values you’ll never search directly. Since the original value of a field of this type is stored in the index, this type isn’t suitable for storing fields with very large values, if index size is an issue. ■ UnStored- The opposite of UnIndexed. This field type is analyzed and indexed but isn’t stored in the index. It’s suitable for indexing a large amount of text that doesn’t need to be retrieved in its original form, such as bodies of web pages, or any other type of text document. ■ Text - Is analyzed, and is indexed. This implies that fields of this type can be searched against, but be cautious about the field size. If the data indexed is a String, it’s also stored; but if the data is from a Reader, it isn’t stored. This is often a source of confusion, so take note of this difference when using Field.Text. Back to Content page

An example of Lucene Index fields Back to Content page

Core Searching classes ■ IndexSearcher ■ Term (basic unit for searching, consists of the name of the field and the value of that field) ■ Query (subclasses : TermQuery, BooleanQuery, PhraseQuery, PrefixQuery, PhrasePrefixQuery, RangeQuery,FilteredQuery, and SpanQuery.) ■ TermQuery (primitive query types) ■ Hits (simple container of pointers to ranked search results) Back to Content page

Types of Queries TermQuerys are especially useful for retrieving documents by a key. A TermQuery is returned from QueryParser if the expression consists of a single word. PrefixQuery matches documents containing terms beginning with a specified string. QueryParser creates a PrefixQuery for a term when it ends with an asterisk (*) in query expressions. RangeQuery facilitates searches from a starting term through an ending term. RangeQuery query = new RangeQuery(begin, end, true); BooleanQuery The various query types discussed here can be combined in complex ways using BooleanQuery. BooleanQuery itself is a container of Boolean clauses. A clause is a subquery that can be optional, required, or prohibited. These attributes allow for logical AND, OR, and NOT combinations. You add a clause to a BooleanQuery using this API method: public void add(Query query, boolean required, boolean prohibited) PhraseQuery An index contains positional information of terms. PhraseQuery uses this information to locate documents where terms are within a certain distance of one another. FuzzyQuery matches terms similar to a specified term. Back to Content page

Incremental Indexing • Need to create a Process Tablein Db which stores the last indexed time and has a column called status which can have possible values like InProgess and Complete denoting whether indexing is going on or not. • So while fetching data for indexing from the database we filter to get only those records which have been inserted or modified after the last indexing time. • A modified record in database will mean that we delete the earlier entry from the index and the insert the modified version into the index. • It is also possible to have different indexing cycle for different modules. Like message might require immediate indexing but library might not. So in the Process Table mentioned above we can also this column having different indexing schedules for different modules. Back to Content page

Incremental Indexing (IndexModifier ) • The IndexModifier (org.apache.lucene.index.IndexModifier ) is a wrapper object for IndexWriter and IndexReader • It makes sure there is only one instance modifying the index (imagine what will happen when 2 people write to the index at the same time). • IndexModifier im=new IndexModifier("file:///c:/temp/index",an,true); • The first argument is the directory name of the index. We use a file based index, but it’s also possible to create an in-memory index. The second argument is the analyzer we just created and the third parameter is whether we want to create a new index (true is create a new index, false is do an incremental index). • Incremental indexing can be very useful, but we need to be careful. Since Lucene has no notion of primary keys or unique keys like we have in a database you have to remove your old document if you want to update a document. If you add a document twice it will be found twice. • Note that you cannot create more than one IndexModifier object on the same directory at the same time. • Source of this information : http://technology.amis.nl/blog/?p=1288 Back to Content page

Incremental Indexing (IndexModifier ) • Example usage: • Analyzer analyzer = new StandardAnalyzer(); // create an index in /tmp/index, overwriting an existing one: IndexModifier indexModifier = new IndexModifier("/tmp/index", analyzer, true); Document doc = new Document(); doc.add(new Field("id", "1", Field.Store.YES, Field.Index.UN_TOKENIZED)); doc.add(new Field("body", "a simple test", Field.Store.YES, Field.Index.TOKENIZED)); indexModifier.addDocument(doc);int deleted = indexModifier.delete(new Term("id", "1")); System.out.println("Deleted " + deleted + " document"); indexModifier.flush(); System.out.println(indexModifier.docCount() + " docs in index"); indexModifier.close(); • Not all methods of IndexReader and IndexWriter are offered by this class. If you need access to additional methods, either use those classes directly or implement your own class that extends IndexModifier. • Although an instance of this class can be used from more than one thread, we will not get the best performance. We might want to use IndexReader and IndexWriter directly for that (but we will need to care about synchronization yourself then). • While we can freely mix calls to add() and delete() using this class, we should batch our calls for best performance. For example, if you want to update 20 documents, you should first delete all those documents, then add all the new documents. Back to Content page

Score Boosting • Lucene allows influencing search results by "boosting" in more than one level: • Document level boosting - while indexing - by calling document.setBoost() before a document is added to the index. • Document's Field level boosting - while indexing - by calling field.setBoost() before adding a field to the document (and before adding the document to the index). • Query level boosting - during search, by setting a boost on a query clause, calling Query.setBoost(). • The ranking of documents in Lucene is carried out by the Similarity abstract class. I think that we can use the DefaultSimilarity subclass by playing about with the boost actor. • Indexing time boosts are preprocessed for storage efficiency and written to the directory (when writing the document) in a single byte as follows: For each field of a document, all boosts of that field (i.e. all boosts under the same field name in that doc) are multiplied. The result is multiplied by the boost of the document, and also multiplied by a "field length norm" value that represents the length of that field in that doc (so shorter fields are automatically boosted up). The result is decoded as a single byte (with some precision loss of course) and stored in the directory. The similarity object in effect at indexing computes the length-norm of the field. Source : http://lucene.apache.org/java/docs/scoring.html

Boosting Documents and Fields By default, all Documents have no boost—or, rather, they all have the same boost factor of 1.0. By changing a Document’s boost factor, you can instruct Lucene to consider it more or less important with respect to other Documents in the index. The API for doing this consists of a single method, setBoost(float), which can be used as follows: doc.setBoost(1.5); writer.addDocument(doc); When you boost a Document, Lucene internally uses the same boost factor to boost each of its Fields. To give field boost : subjectField.setBoost(1.2); The boost factor values you should use depend on what you’re trying to achieve; you may need to do a bit of experimentation and tuning to achieve the desired effect. It’s worth noting that shorter Fields have an implicit boost associated with them, due to the way Lucene’s scoring algorithm works. Boosting is, in general, an advanced feature that many applications can work very well without. Document and Field boosting comes into play at search time. Lucene’s search results are ranked according to how closely each Document matches the query, and each matching Document is assigned a score. Lucene’s scoring formula consists of a number of factors, and the boost factor is one of them.

Relevancy scoring mechanism The formula used by lucene to calculate the rank of a document Source : http://infotrieve.com/products_services/databases/LSRC_CST.pdf

Query level boosting • Boosting a Term • Lucene provides the relevance level of matching documents based on the terms found. • To boost a term use the caret, "^", symbol with a boost factor (a number) at the end of the • term you are searching. • The higher the boost factor, the more relevant the term will be. • Boosting allows you to control the relevance of a document by boosting its term. For example, if you are searching for jakarta apache and you want the term "jakarta" to be more relevant boost it using the ^ symbol along with the boost factor next to the term. • You would type: jakarta^4 apache • This will make documents with the term jakarta appear more relevant. You can also boost Phrase Terms as in the example: "jakarta apache"^4 "Apache Lucene" • By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2) Back to Content page Source : http://lucene.apache.org/java/docs/queryparsersyntax.html

How ServerSide.com used boost to solve it’s problem The list of the fields to which boost was added with an explanation as to why. Quoted directly from ServerSide.com : “The date boost has been really important for us”. We have data that goes back for a long time, and seemed to be returning “old reports” too often. The date-based booster trick has gotten around this, allowing for the newest content to bubble up. The end result is that we now have a nice simple design which allows us to add new sources to our index with minimal development time! Source : http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene Back to Content page

Scoring Algorithm Back to Content page

Scoring Algorithm • In the typical search application, a Query is passed to the Searcher , beginning the scoring process. • Once inside the Searcher, a Hits object is constructed, which handles the scoring and caching of the search results. • The Hits constructor stores references to three or four important objects: • The Weight object of the Query. • The Weight object is an internal representation of the Query that allows the Query • to be reused by the Searcher. • The Searcher that initiated the call. • A Filter for limiting the result set. Note, the Filter may be null. • A Sort object for specifying how to sort the results if the standard score based sort method is not desired. • ---------- Continued on next page ------- Back to Content page

Scoring Algorithm Now that the Hits object has been initialized, it begins the process of identifying documents that match the query by calling getMoreDocs method. Assuming we are not sorting (since sorting doesn't effect the raw Lucene score), we call on the "expert" search method of the Searcher, passing in our Weight object, Filter and the number of results we want.This method returns a TopDocs object, which is an internal collection of search results. The Searcher creates a TopDocCollector and passes it along with the Weight, Filter to another expert search method (for more on the HitCollector mechanism, see Searcher .) The TopDocCollector uses a PriorityQueue to collect the top results for the search. If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise, we ask the Weight for a Scorer for the IndexReader of the current searcher and we proceed by calling the score method on the Scorer . At last, we are actually going to score some documents. The score method takes in the HitCollector (most likely the TopDocCollector) and does its business. Of course, here is where things get involved. The Scorer that is returned by the Weight object depends on what type of Query was submitted. In most real world applications with multiple query terms, the Scorer is going to be a BooleanScorer2. Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the coord() factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query. Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the Scorer#next() method. The next() method advances to the next document matching the query. This is an abstract method in the Scorer class and is thus overriden by all derived implementations. If you have a simple OR query your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms. Back to Content page

Sorting search results Sorting comes at the expense of resources. More memory is needed to keep the fields used for sorting available. For numeric types, each field being sorted for each document in the index requires that four bytes be cached. For String types, each unique term is also cached for each document. Only the actual fields used for sorting are cached in this manner. We need to plan our system resources accordingly if we want to use the sorting capabilities, knowing that sorting by a String is the most expensive type in terms of resources.

Handling multiple pages of search results • Handling multiple pages of search results is an interesting problem. We do not want to have to save and manage search results in a session variable so that page 2 can be displayed when a user clicks the next page link. • Fortunately, this problem is easily solved: it turns out that Lucene is fast enough to just requery on every results page and then ask for nth page of results.

Examples of queries possible with Lucene • Once we integrate Lucene, users of your applications can make searches such as • +George +Rice -eat -pudding, • Apple –pie +Tiger, • animal:monkey AND food:banana, • ‘lucene AND doug‘ or 'lucene AND NOT slow' or '+lucene +book‘ Back to Content page

Handling of various types of queries by the QueryParser Back to Content page

Abstract storage in Index • There a two approaches here • We store a summary in the index • We fetch the location where the keyword searched is present and display as Google does it • But for our requirement the first option should work well. • We can store the summary directly into the index, so that we do not need to hit the database in order to fetch the summary. • So the abstract will be a field in the index document like author and title. • (use Field.Store.YES when creating that field) Back to Content page

Security A security filter is a powerful example, allowing users to only see search results of documents they own even if their query technically matches other documents that are off limits. An example of document filtering constrains documents with security in mind. Our example assumes documents are associated with an owner, which is known at indexing time. We index two documents; both have the term info in their keywords field, but each document has a different owner: public class SecurityFilterTest extends TestCase { private RAMDirectory directory; protected void setUp() throws Exception { IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true); // Elwood Document document = new Document(); document.add(Field.Keyword("owner", "elwood")); document.add(Field.Text("keywords", "elwoods sensitive info")); writer.addDocument(document); // Jake document = new Document(); document.add(Field.Keyword("owner", "jake")); document.add(Field.Text("keywords", "jakes sensitive info")); writer.addDocument(document); writer.close(); } } Source : Pg 211 from Lucene in action Back to Content page

Security Suppose, though, that Jake is using the search feature in our application, and only documents he owns should be searchable by him. Quite elegantly, we can easily use a QueryFilter to constrain the search space to only documents he is the owner of, as shown in listing 5.7. public void testSecurityFilter() throws Exception { directory = new RAMDirectory(); setUp(); TermQuery query = new TermQuery(new Term("keywords", "info")); IndexSearcher searcher = new IndexSearcher(directory); Hits hits = searcher.search(query); assertEquals("Both documents match", 2, hits.length()); QueryFilter jakeFilter = new QueryFilter( new TermQuery(new Term("owner", "jake"))); hits = searcher.search(query, jakeFilter); assertEquals(1, hits.length()); assertEquals("elwood is safe", "jakes sensitive info", hits.doc(0).get("keywords")); } For using this approach we will have a field in the Index called owner. Back to Content page

Security You can constrain a query to a subset of documents another way, by combining the constraining query to the original query as a required clause of a BooleanQuery. There are a couple of important differences, despite the fact that the same documents are returned from both. QueryFilter caches the set of documents allowed, probably speeding up successive searches using the same instance. In addition, normalized Hits scores are unlikely to be the same. The score difference makes sense when you’re looking at the scoring formula (see section 3.3, page 78). The IDF factor may be dramatically different. When you’re using BooleanQuery aggregation, all documents containing the terms are factored into the equation, whereas a filter reduces the documents under consideration and impacts the inverse document frequency factor. Back to Content page

Composition of Segments in Lucene Index Each segment index maintains the following: Field names. This contains the set of field names used in the index. Stored Field values. This contains, for each document, a list of attribute-value pairs, where the attributes are field names. These are used to store auxiliary information about the document, such as its title, url, or an identifier to access a database. The set of stored fields are what is returned for each hit when searching. This is keyed by document number. Term dictionary. A dictionary containing all of the terms used in all of the indexed fields of all of the documents. The dictionary also contains the number of documents which contain the term, and pointers to the term's frequency and proximity data. Term Frequency data. For each term in the dictionary, the numbers of all the documents that contain that term, and the frequency of the term in that document. Term Proximity data. For each term in the dictionary, the positions that the term occurs in each document. Normalization factors. For each field in each document, a value is stored that is multiplied into the score for hits on that field. Term Vectors. For each field in each document, the term vector (sometimes called document vector) may be stored. A term vector consists of term text and term frequency. To add Term Vectors to your index see the Field constructors Deleted documents. An optional file indicating which documents are deleted. Back to Content page

Debugging lucene indexing process We can get Lucene to output information about its indexing operations by setting Index-Writer’s public instance variable infoStream to one of the OutputStreams, such as System.out  IndexWriter writer = new IndexWriter(dir, new SimpleAnalyzer(), true); writer.infoStream = System.out; Back to Content page

Lucene In Alfresco There are three possible approaches we can follow. 1) Let alfresco do the indexing, use its implementation of the search, use the search results it returns and load it into our page. 2) Let Alfresco do the indexing and directly access its indexes to get query results 3) Let alfresco only do the content management, and we take care of both the indexing and the searching Back to Content page

Advantages of using Alfresco created lucene indexes • Alfresco gives the option of customizing the Analyzer to our needs. • It gives control over the what content should be indexed and what not through an XML. • Retains the document structure as is present in Alfresco comprising of Nodes and stores it in the index. • Even if later on we wish to change the CMS, we can create the indexes based on the template created for the modules other than the library. • When alfresco is internally indexing docs once, it is an overhead to repeat the process. • Also saves us the overhead of parsing the various document types again. • Another possible problem we might face if we don’t use alfresco indexes is that Alfresco changes the extension of docs before it stores. This might create a problem. Back to Content page

Data dictionary options in Alfresco • The indexing behavior of each property can be set in the data dictionary. • The default is that properties are indexed, indexed atomically, that the property value is not stored in the index, and that the property is tokenised when it is indexed. • Enabled • If this is false there will be no entry for this property in the index. • Atomic • If this is true then the property is indexed in the transaction, if not the property is indexed in the background. • Stored • If true, the property value is stored in the index and may be obtained via the lucene low level query API. • Tokenised • If true, the string value of the property is tokenised before indexing; if false, it is indexed "as is" as a single string. Back to Content page

Alfresco Repository Architecture

Lucene Index Structure in Alfresco • Alfresco, it seems has different folders for storing index of each node ( or space). • File in an index folder : • .cfs file • Segment file • Deletable • IndexInfoDeletions • .del file • There is one .cfs file ( compound index file) per segment • The segments file stores the names of all existing index segments. Before accessing • any files in the index directory, Lucene consults this file to figure out which • index files to open and read. • It also has a folder called “backup-lucene-indexes” where it stores copy of the lucene index • periodically to handle crashes. We could have a similar system for other modules also apart from • Library. • Deletable File • Prior to Lucene 2.1 there was a file "deletable" that contained details about files that need to be • deleted. As of 2.1, a writer dynamically computes the files that are deletable, instead, so no file is written. • Lock File • A write lock is used to indicate that another process is writing to the index. IndexInfoBackup IndexInfo These 2 files are placed outside the node folders Source : http://lucene.apache.org/java/docs/fileformats.html

Why do we sometimes have redundant data in Index and Database The database can redundantly keeps some of the information that can be found in the Lucene index for two specific reasons: ■ Failure recovery—If the index somehow becomes corrupted (for example, through disk failure), it can easily and quickly be rebuilt from the data stored in the database without any information loss. This is further leveraged by the fact that the database can reside on a different machine. ■ Access speed—Each document is marked with a unique identifier. So, in the case that the application needs to access a certain document by a given identifier, the database can return it more efficiently than Lucene could. (the identifier is the primary key of a document in the database). If we would employ Lucene here, it would have to search its whole index for the document with the identifier stored in one of the document’s fields. Back to Content page

A possible approach to improve hit relevancy in Alfresco If we are unable to get access to Alfresco’s indexing and scoring process then we possibly add boost to the query itself. It is still not confirmed whether it will work first of all, and if it works, whether it will work fast enough. “Title:Lucene”^4 OR “Keywords:Lucene”^3 OR “Contents:Lucene”^1 Back to Content page

Caching mechanism Lucene has an internal caching mechanism in case of filters. Lucene does come with a simple cache mechanism, if you use Lucene Filters. The classes to look at are CachingWrapperFilter and QueryFilter. For example lets say we wanted to let users search JUST on the last 30 days worth of content. We could run the filter ONCE and then cache it with the Term clause used to run the query. Then we could just use the same filter again for every user until you have to optimize() the index again. As long as the document numbers stay they same we don't have much more to do. But this will probably not be of much use to us, since we will need to optimize the index often. Back to Content page

Caching mechanism Top searched keywords obtained from logs Searcher checks if query matches top keywords Searcher (Query) List of Top Keywords logs If query term doesn’t match one of the cached keywords then search in the Index If query term matches one of the cached keywords then results are fetched from cache Top keywords are searched for in the index and cached beforehand Lucene Index Top Keyword resultscache Cache expiring and refreshing mechanism ( including regular updating of top keywords list ) Results UI Back to Content page

Experience of lucene implementation @ webshots.com Question : “ I gave parallelMultiSearcher a try and it was significantly slower than simply iterating through the indexes one at a time. Our new plan is to somehow have only one index per search machine and a larger main index stored on the master. What I'm interested to know is whether having one extremely large index for the master then splitting the index into several smaller indexes (if this is possible) would be better than having several smaller indexes and merging them on the search machines into one index. I would also be interested to know how others have divided up search work across a cluster.” Answer : “ I'm responsible for the webshots.com search index and we've had very good results with lucene. It currently indexes over 100 Million documents and performs 4 Million searches / day. We initially tested running multiple small copies and using a MultiSearcher and then merging results as compared to running a very large single index. We actually found that the single large instance performed better. To improve load handling we clustered multiple identical copies together, then session bind a user to particular server and cache the results, but each server is running a single index. Our index is currently about 40Gb. The advantage of binding a user is that once a search is performed then caching within lucene and in the application is very effective if subsequent searches go back to the same box. Our initial searches are usually in the sub 100milliS range while subsequent requests for deeper pages in the search are returned instantly.” Back to Content page

Example of Mail messages Indexing • Each message is automatically indexed as it enters the system. Each mailbox has an index file associated with it. The tokenizing and indexing process is not configurable by administrators or users. • The process is as follows: • The Zimbra MTA routes the incoming email to the Zimbra mailbox server that contains the account’s mailbox. • The mailbox server parses the message, including the header, the body, and all readable file attachments such as PDF files or Microsoft Word documents, in order to tokenize the words. • The mailbox server passes the tokenized information to Lucene to create the index files. Note: Tokenization: The method for indexing is by each word. Certain common patterns, such as phone numbers, email addresses, and domain names are tokenized as shown in the following figure. Source :http://wiki.zimbra.com/index.php?title=Zimbra_Server Back to Content page

Good Articles on Lucene http://www.theserverside.com/tt/articles/article.tss?l=ILoveLucene http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html?page=1 http://technology.amis.nl/blog/?p=1288 http://powerbuilder.sys-con.com/read/42488.htm http://www-128.ibm.com/developerworks/library/wa-lucene2/ Spell Checking : http://today.java.net/pub/a/today/2005/08/09/didyoumean.html Lucene integration with hibernate: http://www.hibernate.org/hib_docs/search/reference/en/html_single/ Lucene with Spring : http://technology.amis.nl/blog/?p=1248 It talks about spring modules. Back to Content page

Table of Contents