Searching Discontinuous Dates in Elasticsearch using Lucene Expression Script

Elasticsearch is powerful, but very different to SQL and its scripting capabilities are limited in cloud hosting environments. This makes it appear impossible to use for certain sum of script and threshold queries—for example on discontinuous date ranges—but with some clever query structuring this can be done.

Situation

Our work app JobLab is an automated recruitment platform. One key feature is letting candidates (mostly students) set availability periods (e.g. summer holiday, Christmas, post graduation), then employers search for candidates who available to work the duration of their job (often fixed length internships).

A search query to power this would be easy for candidates available "for the foreseeable future", or for the entirety of the job length window. But it becomes very tricky when reality kicks in. Many students will have a re-sit exam stuck in the middle of a holiday period, or graduates may have booked a celebratory holiday in the summer after they finish. Most employers are happy to consider candidates available for the majority of the work range they set, so we need search to include these too.

The below candidate has 2 availability periods, a portion of each falling within the job window. They are available 4 out of 5 "time units" within the window, so we'd expect them to be shown in search.

Candidate Availability Periods

We have many other advanced filters that carefully rank and boost candidates. For the other filters we found Elasticsearch was the only viable system, so we have to find a way of performing this availability query in ES.

Problem

The query we really want to ask Elasticsearch is "please filter out candidates who are not available for more than 75% of the job length window, then rank the remaining candidates by how available they are; from 100% down to 75%." Alas there are no time queries in ES and the range query would only tell us "who has a single availability period long enough"; but we want to allow multiple short gaps in the middle!

Below is a graphical example using simple numbers. At the top is the 8 "time unit" job window, followed by 6 candidates available different proportions of the window. 4 meet the 75% threshold, 2 do not. It shows several edge cases; you can work out that approaches such as centre point, total availability or weighting would not work.

Candidate Availability Grid

Elasticsearch supports writing custom queries using scripts. Multiple languages are supported, but only the Lucene Expression language is available in cloud hosting environments for security reasons. Expressions are loosely based on JavaScript math and boolean logic, so only support simple numeric fields, basic mathematical functions & operators, boolean operators, and the ternary operator.

For our situation we want to for each candidate: loop over every availability period; find how much of it falls within the job window; sum all of those; then limit the results to those who's contained availability sum is over a 75% threshold. How do you perform loops in Lucene Expression scripts? How do you sum results from nested documents in Lucene Expression scripts?

The short answer is you don't. Lucene Expressions don't have loops or arrays and you can't use the result of one script in another. You can only do mathematical "expressions". This sounds too limiting, but there is a way...

How To

We're going to use two function_score queries combined with a nested query to achieve summing of script results on nested fields and ensuring the total is over a 75% threshold. Here is an overview of how it's put together in pseudo query code:

function_score:
  min_score: 0.75 # 75%
  nested:
    score_mode: sum
    function_score:
      script: "find length of availability within job window"

There are two key parts. First is score_mode:sum on the nested query, which means the scores of running the inner query on each nested document will be added together. Next is min_score:0.75 on the outer function_score query which will only include documents whose score meets that threshold. So, only documents who's sum of script results is over 0.75 will be included.

Now here is the actual script run on each availability period to find the proportion of that period falling within the job window; from 0 to 1.0. Split onto multiple lines and commented for readability (but it must be on 1 line for JSON / Elasticsearch).

// `start` and `end` are the job window, passed in from params object.

// Negative results mean availability outside job window,
// but this confuses ES so we cap at 0 using `max(n, 0)`.
max(
  // availability window is the earliest end date minus latest start date
  min(end, doc["availabilities.to"].value) -
    max(start, doc["availabilities.from"].value),
  0
) /
  // scale answer to availability as proportion of job window from 0 to 1.0
  (end - start)

Finally here are CURL commands to create and index sample documents, and the full search query.

The documents I have put in are the same as in the grid above so you can see which results are supposed to be returned and verify that the query works perfectly.

Also note that the Expression language compiles to very fast Java bytecode and by passing start and end as params variables (rather than interpolating the values) the script can be automatically cached for even faster performance!

#!/bin/bash
export ELASTICSEARCH_ENDPOINT="http://localhost:9200"

# Delete indexes
curl -XDELETE "$ELASTICSEARCH_ENDPOINT/play"

# Create indexes
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play?pretty" -d '{
    "mappings": {
        "type": {
            "properties": {
                "availabilities": {
                    "type": "nested",
                    "properties": {
                        "from": { "type": "integer" },
                        "to": { "type": "integer" }
                    }
                }
            }
        }
    }
}
'

# Index documents
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/1?pretty" -d '{
  "availabilities": [
    { "from": 0, "to": 2 },
    { "from": 12, "to": 17 }
  ]
}
'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/2?pretty" -d '{
  "availabilities": [
    { "from": 3, "to": 10 }
  ]
}
'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/3?pretty" -d '{
  "availabilities": [
    { "from": 0, "to": 5 },
    { "from": 9, "to": 13 }
  ]
}
'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/4?pretty" -d '{
  "availabilities": [
    { "from": 5, "to": 18 }
  ]
}
'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/5?pretty" -d '{
  "availabilities": [
    { "from": 3, "to": 6 },
    { "from": 7, "to": 10 }
  ]
}
'
curl -XPUT "$ELASTICSEARCH_ENDPOINT/play/type/6?pretty" -d '{
  "availabilities": [
    { "from": 3, "to": 5 },
    { "from": 6, "to": 7 },
    { "from": 8, "to": 11 }
  ]
}
'

# Ensure docs are ready
curl -XPOST "$ELASTICSEARCH_ENDPOINT/play/_refresh"

# Perform search
echo "\n\n\nRESULTS:"
curl -XPOST "$ELASTICSEARCH_ENDPOINT/play/type/_search?pretty" -d '
{
  "query": {
    "function_score": {
      "min_score": 0.75,
      "query": {
        "nested": {
          "path": "availabilities",
          "score_mode": "sum",
          "query": {
            "function_score": {
              "script_score": {
                "script": {
                  "lang": "expression",
                  "inline":
                    "max((min(end, doc[\"availabilities.to\"].value) - max(start, doc[\"availabilities.from\"].value)), 0) / (end - start)",
                  "params": {
                    "start": 3.0,
                    "end": 11.0
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}
'

Notes

Above I'm just using simple numbers, in real code you'd use days or seconds (UNIX timestamp) indexed as an integer.

As I have often found with Elasticsearch, there is an obscure intermittent bug. With real-world data on ES 2.2.2 this technique intermittently returned incorrect results ignoring the min_score clause; even returning scores of zero. The bug is no longer present in ES 2.3.5 so only use 2.3.5 or greater for anything involving function_score min_score. Elastic.io cloud hosting offers the latest ES version.

Conclusion

This discovery wasn't intuitive; it took a lot of experimenting. I couldn't find examples of similar things being done. No ES query names obviously imply that this kind of thing can be done, and Expressions appear too simplistic.

However, combining Expressions with multiple standard Elasticsearch queries you can mimic loops summing script results and matching only documents where that sum is over a threshold. You don't have to turn to the unsafe, slow(er), cloud-unfriendly sledgehammer of Groovy or Java. We now have fine-grained availability search that handles the complexity of real life.