#WTF Elasticsearch: Phrase "windows 7" not found

What's going on? A simple search for "windows 7" doesn't return an expected document. And for sure: The document fields contain the phrase with the popular operation system. What's the issue here?

Let's have a look. We are doing a simple match_phrase query like this:

match_phrase 'windows 7' query

No results. However the JSON content of the available document looks fine:

{ ..... "_source": { "_docType": "jobs", "header": "IT-Support (m/w)", "url": "https://www......", "mainContent": """ Für unseren Standort in München suchen wir ab sofort Mitarbeiter/-innen für den **IT-Support (m/w)** ....... - Erfahrung mit den Betriebssystemen (Windows 7, Linux) ....... """ } ..... }

First guess: Do we use the "Length Token Filter" in our Analysis Chain? I just remembered that we configured it for some fields to eliminate undefined one letter tokens. A check with the GET jobpushy/_analyze { "analyzer": "default", "text": "windows 7" } shows that the tokenizing works as expected: { "tokens": [ { "token": "windows", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 }, { "token": "7", "start_offset": 8, "end_offset": 9, "type": "word", "position": 1 } ] }

The problem

Could the bracket be the problem? Trying GET jobpushy/_analyze { "analyzer": "default", "text": "(windows 7" } leads to: { { "tokens": [ { "token": "(windows", .... }, { "token": "7", ..... } ] }

Ok, that's unexpected. The "(" is part of the token. That's the reason why "windows 7" is not matching. Instead "(windows 7" would match. What did we do wrong?

The solution

It becomes apparent that our tokenizer definition of the index is buggy. We don't use the standard tokenizer because it doesn't make sense for some IT terminology (e.g. node.js, .NET). We are using a self-defined pattern for tokenizing. The solution is to add the brackets to the tokenizer pattern. Simplified something like this:

{ "tokenizer": { "jobpushy_tokenizer": { "pattern": """,|;|:|\s|-|/|\(|\)""", "type": "pattern" } }

About the #WTF posts on this blog:
These are short articles that show simple solutions to problems that lead to a headache in the first place. This serves two goals: 1.) Remembering the solution (and own stupidity) for the next time 2.) Helping other people with similiar problems.

