Let's Extend that Dang ElasticSearch Plugin
ElasticSearch is still awesome, but I had to write some more Java.
A little over half a year ago, David wrote an article about a dang ElasticSearch plugin that he wrote for a project we were working on. As David stated, the project involved "a complex interactive query builder to search a large collection of news items." The plugin he wrote for it allowed users to "filter for articles that contain a term a minimum number of times." Well, the time has come to add to that functionality. Now we want to allow users to filter for articles that contain a multi-term phrase a minimum number of times.
While it took me awhile to come up with a solution for this problem, I managed to get something working, and I'll walk through the code in case anyone else ever needs to read or implement this feature down the line. We'll start by looking at the plugin before I changed anything.
package com.projectname.containsmultiple;
import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.index.PostingsEnum;
import org.apache.lucene.index.Term;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.plugins.Plugin;
import org.elasticsearch.plugins.ScriptPlugin;
import org.elasticsearch.script.FilterScript;
import org.elasticsearch.script.FilterScript.LeafFactory;
import org.elasticsearch.script.ScriptContext;
import org.elasticsearch.script.ScriptEngine;
import org.elasticsearch.script.ScriptFactory;
import org.elasticsearch.search.lookup.SearchLookup;
import java.io.IOException;
import java.io.UncheckedIOException;
import java.util.Collection;
import java.util.Map;
import java.util.Set;
/**
* A script for finding documents that match a term a certain number of times
*/
public class ContainsMultiplePlugin extends Plugin implements ScriptPlugin {
@Override
public ScriptEngine getScriptEngine(
Settings settings,
Collection<ScriptContext<?>> contexts
) {
return new ContainsMultipleEngine();
}
// tag::contains_multiple
private static class ContainsMultipleEngine implements ScriptEngine {
@Override
public String getType() {
return "expert_scripts";
}
@Override
public <T> T compile(
String scriptName,
String scriptSource,
ScriptContext<T> context,
Map<String, String> params
) {
if (context.equals(FilterScript.CONTEXT) == false) {
throw new IllegalArgumentException(getType()
+ " scripts cannot be used for context ["
+ context.name + "]");
}
// we use the script "source" as the script identifier
if ("contains_multiple".equals(scriptSource)) {
FilterScript.Factory factory = new ContainsMultipleFactory();
return context.factoryClazz.cast(factory);
}
throw new IllegalArgumentException("Unknown script name "
+ scriptSource);
}
@Override
public void close() {
// optionally close resources
}
@Override
public Set<ScriptContext<?>> getSupportedContexts() {
return Set.of(FilterScript.CONTEXT);
}
private static class ContainsMultipleFactory implements FilterScript.Factory,
ScriptFactory {
@Override
public boolean isResultDeterministic() {
return true;
}
@Override
public LeafFactory newFactory(
Map<String, Object> params,
SearchLookup lookup
) {
return new ContainsMultipleLeafFactory(params, lookup);
}
}
private static class ContainsMultipleLeafFactory implements LeafFactory {
private final Map<String, Object> params;
private final SearchLookup lookup;
private final String field;
private final String term;
private final int count;
private ContainsMultipleLeafFactory(
Map<String, Object> params, SearchLookup lookup) {
if (params.containsKey("field") == false) {
throw new IllegalArgumentException(
"Missing parameter [field]");
}
if (params.containsKey("term") == false) {
throw new IllegalArgumentException(
"Missing parameter [term]");
}
if (params.containsKey("count") == false) {
throw new IllegalArgumentException(
"Missing parameter [count]");
}
this.params = params;
this.lookup = lookup;
field = params.get("field").toString();
term = params.get("term").toString();
count = Integer.parseInt(params.get("count").toString());
}
@Override
public FilterScript newInstance(LeafReaderContext context)
throws IOException {
PostingsEnum postings = context.reader().postings(
new Term(field, term));
if (postings == null) {
/*
* the field and/or term do not exist in this segment,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
return new FilterScript(params, lookup, context) {
int currentDocid = -1;
@Override
public void setDocument(int docid) {
/*
* advance has undefined behavior calling with
* a docid <= its current docid
*/
if (postings.docID() < docid) {
try {
postings.advance(docid);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
currentDocid = docid;
}
@Override
public boolean execute() {
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
try {
return postings.freq() >= count;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
};
}
}
}
// end::contains_multiple
}
All of my changes will come in the newInstance
method that we see towards the bottom of the above code block.
I honestly just sort of stumbled upon the solution you are about to see while reading through Apache Lucene docs trying to figure out any way that I could do this. I noticed you could add the PostingsEnum.POSITIONS
flag to the postings
method to get the positions of the term within each document along with the frequency and I realized that I could accomplish this task with that.
It may not be the cleanest solution, but it was a working solution, which was the goal at the time. If you have a better solution, I would be very curious to hear.
Either way, here is the solution I came up with in the newInstance
method.
@Override
public FilterScript newInstance(LeafReaderContext context)
throws IOException {
/*
* create an array of PostingsEnums,
* one for each word in the phrase
*/
String[] terms = term.split("\\s+");
if (terms.length == 0) {
/*
* the term did not have any words to search for,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
PostingsEnum[] postingsArr = new PostingsEnum[terms.length];
for (int i = 0; i < terms.length; i++) {
String currTerm = terms[i];
/*
* use PostingsEnum.POSITIONS to get the positions
* of each occurence of the word in the item
*/
PostingsEnum postings = context.reader().postings(
new Term(field, currTerm), PostingsEnum.POSITIONS);
if (postings == null) {
/*
* the field and/or term do not exist in this segment,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
postingsArr[i] = postings;
}
return new FilterScript(params, lookup, context) {
int currentDocid = -1;
@Override
public void setDocument(int docid) {
/*
* advance has undefined behavior calling with
* a docid <= its current docid
*/
for (PostingsEnum postings : postingsArr) {
/*
* set each of the PostingsEnum's in the array to
* the current docid
*/
if (postings.docID() < docid) {
try {
postings.advance(docid);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}
currentDocid = docid;
}
@Override
public boolean execute() {
if (postingsArr.length == 1) {
/*
* if only 1 PostingsEnum, just check freq
*/
PostingsEnum postings = postingsArr[0];
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
try {
return postings.freq() >= count;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
/*
* Create map to store with keys as the positions in the
* item that have the first word of the phrase and values
* as the index of the PostingsEnum that the phrase is
* still correct through,
* this allows us to see every point where the phrase could
* potentially begin, and how far the phrase is correct for
* after that initial word
*/
HashMap<Integer, Integer> phraseTesterMap = new HashMap<Integer, Integer>();
for (int i = 0; i < postingsArr.length; i++) {
PostingsEnum postings = postingsArr[i];
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
int freq = 0;
try {
freq = postings.freq();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
if (freq < count) {
/*
* if this word does not appear count times, then
* it is impossible for the phrase to appear count
* times, so return false
*/
return false;
}
// the count of phrases that are still valid at this
// iteration of the loop
int stillValidPhrases = 0;
for (int j = 0; j < freq; j++) {
/*
* get the position of the next occurence of the word
*/
int pos = 0;
try {
pos = postings.nextPosition();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
if (i == 0) {
/*
* for the first word, we just need to store
* the position in our map
*/
phraseTesterMap.put(pos, i);
stillValidPhrases++;
} else if (phraseTesterMap.containsKey(pos - i)
&& phraseTesterMap.get(pos - i).intValue() == i - 1) {
/*
* for the following words, we check to make
* sure that the phrase started exactly i words
* in front of this word by checking to make sure
* the map contains the key of our current position - i,
* we also need to check that phrase is not already
* broken, by confirming the value in the map is
* i - 1, meaning the phrase was still correct as
* of the last PostingsEnum.
*/
phraseTesterMap.put(pos - i, i);
stillValidPhrases++;
}
}
if (stillValidPhrases < count) {
return false;
}
}
return true;
}
};
}
Even with the inline comments, this code is tough to parse. I'll walk through this one segment at a time.
1. Splitting the Phrase #
/*
* create an array of PostingsEnum's,
* one for each word in the phrase
*/
String[] terms = term.split("\\s+");
if (terms.length == 0) {
/*
* the term did not have any words to search for,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
First, we need to take term
that is already being passed into newInstance
and split it into the terms
we are going to search by since we are now passing in a phrase instead of a singular term. Then, if there are no valid terms
to search by, we just return a FilterScript
with an execute
method that always returns false. We do this because we want to return no articles if they do not have a valid search.
2. Getting the Positions & Frequency #
PostingsEnum[] postingsArr = new PostingsEnum[terms.length];
for (int i = 0; i < terms.length; i++) {
String currTerm = terms[i];
/*
* use PostingsEnum.POSTIONS to get the positions
* of each occurence of the word in the item
*/
PostingsEnum postings = context.reader().postings(
new Term(field, currTerm), PostingsEnum.POSITIONS);
if (postings == null) {
/*
* the field and/or term don't exist in this segment,
* so always return 0
*/
return new FilterScript(params, lookup, context) {
@Override
public boolean execute() {
return false;
}
};
}
postingsArr[i] = postings;
}
Now, for each term in the phrase we need to get the frequency that it appears in each document, as well as each position in the document that it appears at. We do so by looping through the terms
array we created in the previous step and calling postings
for each of them. This time, we use the PostingsEnum.POSITIONS
flag though. That allows us to call the nextPosition
method on the returned PostingsEnum
to loop through all the positions of the term in the document, which we will need down the line. Just like in the original plugin, we will still always return false if any of the postings
are null. As we go, we add each of these postings into an array for later analysis.
3. Setting the Document #
int currentDocid = -1;
@Override
public void setDocument(int docid) {
/*
* advance has undefined behavior calling with
* a docid <= its current docid
*/
for (PostingsEnum postings : postingsArr) {
/*
* set each of the PostingsEnum's in the array to
* the current docid
*/
if (postings.docID() < docid) {
try {
postings.advance(docid);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
}
currentDocid = docid;
}
Note that the above code is in the FilterScript
that is returned if you make it past the conditions described in the first two steps.
This step is pretty close to the original plugin. The difference is that instead of advancing to the docid
for one postings
object, we are now updating the docid
for each postings
object in postingsArr
.
4. Handling the Initial Feature #
if (postingsArr.length == 1) {
/*
* if only 1 PostingsEnum, just check freq
*/
PostingsEnum postings = postingsArr[0];
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
try {
return postings.freq() >= count;
} catch (IOException e) {
throw new UncheckedIOException(e);
}
}
Another note: This time you should remember that rest of the code you will see is within the execute
method of the returned FilterScript
.
We first handle the case where there was only one term
and therefore only one postings
object. We do so by getting that one postings
object and doing exactly what the original plugin was doing. We compare the frequency of this singular term in the article, retrieved with the freq
method, and compare it to count
, the total number of times we need the phrase to appear in the document. We do still check if the postings.docId()
equals the currentDocid
as confirmation that we are checking the correct document as well.
5. Handling a Multi-Word Phrase #
/*
* Create map to store with keys as the positions in the
* item that have the first word of the phrase and values
* as the index of the PostingsEnum that the phrase is
* still correct through,
* this allows us to see every point where the phrase could
* potentially begin, and how far the phrase is correct for
* after that initial word
*/
HashMap<Integer, Integer> phraseTesterMap = new HashMap<Integer, Integer>();
for (int i = 0; i < postingsArr.length; i++) {
PostingsEnum postings = postingsArr[i];
if (postings.docID() != currentDocid) {
/*
* advance moved past the current doc, so this
* doc has no occurrences of the term
*/
return false;
}
int freq = 0;
try {
freq = postings.freq();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
if (freq < count) {
/*
* if this word does not appear count times, then
* it is impossible for the phrase to appear count
* times, so return false
*/
return false;
}
// the count of phrases that are still valid at this
// iteration of the loop
int stillValidPhrases = 0;
for (int j = 0; j < freq; j++) {
/*
* get the position of the next occurence of the word
*/
int pos = 0;
try {
pos = postings.nextPosition();
} catch (IOException e) {
throw new UncheckedIOException(e);
}
if (i == 0) {
/*
* for the first word, we just need to store
* the position in our map
*/
phraseTesterMap.put(pos, i);
stillValidPhrases++;
} else if (phraseTesterMap.containsKey(pos - i)
&& phraseTesterMap.get(pos - i).intValue() == i - 1) {
/*
* for the following words, we check to make
* sure that the phrase started exactly i words
* in front of this word by checking to make sure
* the map contains the key of our current position - i,
* we also need to check that phrase is not already
* broken, by confirming the value in the map is
* i - 1, meaning the phrase was still correct as
* of the last PostingsEnum.
*/
phraseTesterMap.put(pos - i, i);
stillValidPhrases++;
}
}
if (stillValidPhrases < count) {
return false;
}
}
return true;
If you have followed along fine up to this point, good job, but this is where the code gets really confusing. We start by creating the phraseTesterMap
which maps one Integer
to another Integer
. The map keys will represent each position of the first term in the phrase. Let's look at an example to make that more clear.
Julius Caesar was quoted as saying "Veni Vidi Vici". He did not say "Vici Vidi Veni". He also did not say "Veni Vidi". "Veni Vidi Vici" means "I came, I saw, I conquered."
If we were looking for the phrase "Veni Vidi Vici" in the above sentences, which will represent our document, the keys to the map would be 6, 15, 21, and 23. This is because the word "Veni" appears at the 6th, 15th, 21st, and 23rd positions of the document. In other words, "Veni" is the 6th, 15th, 21st, and 23rd word in the sentences.
The values of the phraseTesterMap
are even more confusing. They represent the index of the term in the phrase up to which the phrase does not fail. After looping through all of the postings
, our map will look like this:
Key | Value |
---|---|
6 | 2 |
15 | 0 |
21 | 1 |
23 | 2 |
To better understand this, we will go back to the original terms
array created in step 1. For this example, the array that would be created would be ["Veni", "Vidi", "Vici"]
. Now, to go through the map, we see that the phrase started at position 6 is matching the passed-in phrase all the way through the word "Vici", which is at index 2 in the array, so the value in the map is 2. Similarly, the phrase started at position 15 is only matching through the first word "Veni", which is at index 0 in the array, and the phrase started at position 21 is matching through the second word "Vidi", or index 1.
With this map, we know how many times the phrase matches by checking how many of the values in phraseTesterMap
are equal to the length of the terms
array minus 1, or the index of the last value in the array. This is true because if the value in the map is 2 in this example that indicates that the phrase was matching all the way through "Vici", so in the text we see the whole phrase "Veni Vidi Vici".
But, you might ask, how exactly is this map constructed? We start by looping through postingsArr
. Remember that each postings
object represents one term from the terms
array in order.
First, we confirm that the term represented by the postings
object is actually in the current document and if it is not present at all, we return false because if that term is not present then it is impossible for the phrase to be. Then, we check that if it does exist, it exists at least count
times by comparing the frequency of the current term in the article retrieved with the freq
method to count
. If it is not, we again return false because if that term is not present count
times, it is impossible for the phrase to be present count
times.
For now, I am going to skip over stillValidPhrases
, but rest assured that we will return to it.
If we pass those two checks just described, we then loop through each of the positions using the nextPosition
method while being careful not to call it more times than the frequency of the current term, which we stored in freq
, because if we do it will throw an exception. On the first time through the loop, we just add each of the positions to the map with their value as 0. At the end of the first loop for our example, the map will look like this:
Key | Value |
---|---|
6 | 0 |
15 | 0 |
21 | 0 |
23 | 0 |
For the remaining loops, we only update the map if the position we find has a matching first term position and the phrase was still matching up to that point. So in our example, if we are looking at the term "Vidi", we are making sure the word "Veni" is directly before it. If we are looking at the term "Vici", we are checking for the word "Veni" is two words before it and the phrase was still matching all the way through "Vidi". We do this by looking at the indices. Since the keys in phraseTesterMap
are the positions of the first term, all we need to do is make is make sure the current position minus the current index is in the map. This is because your current index is always exactly the number of terms after the first term that the current term should be at. In the example, "Vidi" is at index 1. It is also 1 term after "Veni". Similarly, "Vici" is at index 2 and 2 terms after "Veni". Since "Vidi" is at positions 7, 13, 22, and 24 in our sentences, we can update the phraseTesterMap
for keys 6, 21, and 23 to value 1, the current index, because they are all 1 position before and the phrase is still matching at this point because we have only completed the first term. Now, our map looks like this:
Key | Value |
---|---|
6 | 1 |
15 | 0 |
21 | 1 |
23 | 1 |
Now, on our third time through the loop is where the second condition to update the map really comes into play. We are still checking for the first term in its correct position using phraseTesterMap
keys, but now we also need to look at the values of the map. Specifically, before updating that map entry, we must confirm that the value before the update is equal to the current index minus 1, which confirms that the phrase was still matching through the previous term. There is not a version of this check failing in our example, but if we had the phrase "Veni Very Vici" in our sentences, the first term would be in the correct position but the value would still be 0 instead of 1 since "Veni" was not immediately followed by "Vidi", so we would not update that entry in the map. Now, our example map would look like:
Key | Value |
---|---|
6 | 2 |
15 | 0 |
21 | 1 |
23 | 2 |
Now, that you have completed the loops you can see we have 2 matching phrases since we have two entries in phraseTesterMap
, the entries with keys 6 and 23. They are matching since their values are the length of terms
minus 1, which we covered above.
Let's now take a look at stillValidPhrases
. We initialize this to 0 before looping through the positions and every time we update phraseTesterMap
we increment it by 1, since updating the map shows that the entry is still valid, so it should be included in the count of still valid phrases. Then, after looping through all the positions for the current term, if stillValidPhrases
is less than count
, it is impossible for that number to be higher in the future and therefore impossible for this document to meet the requirements, so we return false. Then, if that final check is never true, that means we had enough valid phrases after all the terms have been checked, so we return true. In our example, if count
is 4, we would return false after the second term, because we would only have 3 still valid phrases. If count
is 3, we would return false after the third term, because we would only have 2 still valid phrases. Finally, if count
was 1 or 2, we would return true at the end because stillValidPhrases
would always be greater than or equal to count
when we get to that check.
Conclusion #
Hopefully you have a better understanding of how this works for the enhancement of only retrieving documents with a certain number of multi-word phrases. If not, feel free to let me know. If you can think of a way to make this better or cleaner, you can let me know that as well. Once again, I would recommend you check out David's article about how to add this plugin to your app and then you can make the enhancements I described here.