Wednesday, October 20, 2010

Lucene Index Update


Scenario:
Using Lucene as the IR library and facing a issue with one of the limitations of Lucene.

Limitation of Lucene:
Lucene doesn't allow you to modify a index once you have created it. If you want to modify, you would have to delete and reindex again which is not feasible if you have a huge index.

Issue:
A index having n number of documents. I am unable to update a document as every document would get a new document ID's even if I delete a single document out of the list. So I want to update the document, in turn update the index BUT without changing document ID's for the documents in first index.

For beginners: Imagine you have a table of rows and columns and are unable to change the value in one of the cell after you have created the table. So value you give at beginning to each cell is final. I am currently trying to look for work around to update that particular cell.

Solution:
Hopefully, ParallelReader

Goal:
To provide a work around for updating a index using parallel reader.

Steps:
  1. Your original index, index1, has n documents. This will be a static index
  2. Create new document for each document in index1 that you want to update. You will add these documents in new Index which will be a dynamic index
  3. Create index2, with n documents. All the documents are empty except the ones you want to update. The documents are added in the same order as index1
  4. You make changes to the dynamic index all the time as reindexing would be cheap in the dynamic index because it doesn't have many entries[as most of the entries are empty]. There would be a threshold after which it would not be efficient to reindex it, we will discuss that later.




INDEX 1

/* Creating a document with fields: Field1, Field2 and Values: "Value1", and "Value2" */

Document doc1 =new Document();
addField(doc1, "Field1","Value1");
addField(doc1, "Field2","Value2");

/* Creating a document with fields: Field3, Field4 and Values: "Value3", and "Original Value4". */

Document doc2 = new Document();
addField(doc2, "Field3","Value3");
addField(doc2, "Field4","Original Value4");

INDEX 2

/* Creating a document with no fields as I don't want to update this document */
Document doc1 =new Document();

/* Creating a document with fields: Field4 and Values: "Modified Value4" as I want to update Field4. */

Document doc2 = new Document();
addField(doc2,"Field4", "Modified Value4");


PARALLEL READER

ParallelReader pr = new ParallelReader();
/* Providing the location of both the indexes to parallel reader, in order of latest to oldest.

pr.add( IndexReader.open(dir2));
pr.add(IndexReader.open(dir1));
dir2 and dir1 are directories for index2 and index1 respectively.
The index2 is first in the order of being processed by the parallel reader as I have added dir2 before dir1. Following are the queries I tried and the results I received.

What is a query here?
The query would contain a term or set of terms. A term is a combination of Field and its value. For eg: Field4:Value4 is a term.

What is a result here?
The result would be set of documents containing that particular term. For e.g. Field1:Value1 would produce the result as doc Id 0 as it exists in Document 1. Oh this reminds me to inform you, the document id's start from 0, like the first position in an array.

Query: (Field4:Modified value4) AND (Field4:Original Value4)
Result: No matches were found for "(Field4:Modified value4) AND (Field4:Original Value4)"


Query: (Field4:Modified value4) OR (Field4:Original Value4)
Result: Hits for "(Field4:Modified value4) OR (Field4:Original Value4)" were found in quotes by:

-----------------------------------------------------------
docId: 1 docScore: 0.13013145
Field4:stored/uncompressed,indexed,tokenized
-----------------------------------------------------------


Query: (Field4:Modified value4) AND NOT (Field1:Value1)
Result: Hits for "(Field4:Modified value4) AND NOT (Field1:Value1)" were found in quotes by:

-----------------------------------------------------------
docId: 1 docScore: 0.38393006
Field4:stored/uncompressed,indexed,tokenized
-----------------------------------------------------------


Query: Field4:Original value4
Result: No matches were found for "(Field4:Original value4)"


Conclusion:
The parallel reader first looks up in the latest index and returns if it finds a match. Thus, if you update a field(for eg: field4), it takes that value as new value for field 4.
Now even if you search for (Field4:Original Value4), it would first consult index2 where it would find a field named Field4. It would compare the value "Original Value4" with value of "Field4" in index2. It would find a miss and thus would return without consulting index1

No comments:

Post a Comment