Handling Spelling Mistakes with Postgres Full Text Search
Handling misspelled search terms with Postgres Full Text Search and PgSearch can be tricky, depending on how your site is set up. Keep reading to find out a way to work around misspellings while using PgSearch's multisearch in a performant way!
Background #
Postgres Full Text Search (FTS) is a great way to implement site search on a website running Postgres already, without requiring additional infrastructure.
On a recent engagement with a client, we were deciding between Postgres FTS and ElasticSearch. Ultimately we chose FTS because we could spin it up without having to add extra infrastructure, as we would with ElasticSearch.
Since the project was written in Ruby on Rails, we were able to use the excellent PgSearch gem to implement FTS in ActiveRecord.
Multisearch #
As we wanted a general site search, we needed to utilize multisearch. Multisearch combines multiple ActiveRecord models into one search 'document' table that you can search against. For example, if a user searches for some search term, and the search is configured for multisearch, then every single model that we mark as multisearchable
will be searched for that term at the same time. See here for more detail.
Search Features #
PgSearch allows for different search features, tsearch
, trigram
, and dmetaphone
. The default is tsearch
, which uses the built-in Postgres Full Text Search.
This was great for our use case, since it also comes with highlighting, a feature that was required. The highlighting is from a field returned by Postgres FTS, where it returns the text around the search term for context and bolds the search terms.
Spelling Mistakes #
Unfortunately, tsearch
does not handle misspelled words. However, as I mentioned before, PgSearch allows for other search features!
And trigram
is a feature that can be installed via a Postgres extension (pg_trgm
) that does just that.
Trigram #
- The idea behind trigram search is to split pieces of text into sets of three-letter segments, and compare the sets to one another
- If two trigram sets are similar enough, we assume there was a spelling mistake, and return the document with the correctly-spelled term.
- As a quick example (ignoring whitespace): Consider the word Viget. Viget would make trigrams:
[vig, ige, get]
- Now, consider our evil twin agency, Qiget. They would make trigrams
[qig, ige, get]
- The two trigram sets match very closely, with only one of the trigrams not being the same. Thus, if we were to compare these with
pg_trgm
, we could reasonably tell that anyone typing 'Qiget' must have been actually looking for 'Viget', and just misspelled it.
Working Trigram into our existing solution #
PgSearch allows us to use multiple search features at once, so we can use tsearch
and trigram
side by side. Note that we cannot just replace tsearch
with trigram
due to needing some features in tsearch
that are exclusive to it. Here is what an example configuration might look like.
PgSearch.multisearch_options = {
using: {
tsearch: {
prefix: true,
highlight: {
MaxFragments: 1
}
},
trigram: {
only: [:content]
}
}
}
Trigram (and timelines) causing issues #
While it was easy to slot Trigram into our multisearch, it caused a pretty serious performance hit. We were seeing 50x-75x slower searches with both features combined than with just tsearch
. We needed to find a way to balance performance with handling misspellings
At the point that handling misspellings became prioritized, the entire search feature was almost fully QA'd and about ready to go out. There wasn't much time left in the budget to find a good solution for the issue.
This thread from the PgSearch repo sums it up pretty well – there were multiple other users that were/are having similar issues as we were. The top-rated comment in this thread is someone mentioning that the solution was to just use ElasticSearch ('top-rated' is doing a lot of heavy lifting. It did have the most likes...at two). We needed to find some sort of middle ground solution that we could act on quickly.
Postgres Documentation saves the day #
In the docs for the Trigram Postgres extension, the writers give an idea for using Trigram in conjunction with Full Text Search. The general idea is to create a separate words
table that has a Trigram index on it.
Something like this worked for us. Note that we added an additional step with a temporary table. This was to allow us to filter out words that included non-alphabet characters.
execute <<-SQL
-- Need to make a temp table so we can remove non-alphabet characters like websites
CREATE TEMP TABLE temp_words AS
SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', content) FROM pg_search_documents');
CREATE TABLE pg_search_words (
id SERIAL PRIMARY KEY,
word text
);
INSERT INTO pg_search_words (word)
SELECT word
FROM temp_words
WHERE word ~ '^[a-zA-Z]+$';
CREATE INDEX pg_words_idx ON pg_search_words USING GIN (word gin_trgm_ops);
DROP TABLE temp_words;
SQL
This words
table is therefore populated with every unique word that exists in your search content table. For us, this table was pretty large.
result = ActiveRecord::Base.connection.execute("SELECT COUNT(*) FROM pg_search_words").first['count']
puts result.first['count']
# => 1118644
Keeping the words table up-to-date #
As mentioned in the docs, this table is separate from your search table. Therefore, it needs to be either periodically regenerated or at least have any new words added to search content also added to this table.
One way to achieve this is with a trigger, which adds all new words (still filtering out non-alphabet characters) that are inserted into the documents table to the words table
create_trigger("pg_search_documents_after_insert_update_row_tr", generated: true, compatibility: 1)
.on("pg_search_documents")
.after(:insert, :update) do
<<-SQL_ACTIONS
CREATE TEMP TABLE temp_words AS
SELECT word FROM ts_stat('SELECT to_tsvector(''simple'', ' || quote_literal(NEW.content) || ')');
INSERT INTO pg_search_words (word)
SELECT word
FROM temp_words
WHERE word ~ '^[a-zA-Z]+$';
DROP TABLE temp_words;
SQL_ACTIONS
end
Note that this does not handle records being deleted from the table – that would need to be something separate.
How we used the words table #
Assuming for simplicity the user's search term is a single word, if the search returns no results, we compare the search term's trigram set to the trigram index on the words
table, and return the closest match.
Then, we'd show the closest match in a "Did you mean {correctly-spelled word}?" that hyperlinks to a search of the correctly-spelled word
Given more time, I would have liked to explore options to speed up the combined FTS and Trigram search. I'm certain we could have improved on the performance issues, but I can't say for sure that we could have gotten the search time down to a reasonable amount.
A future enhancement that would be pretty simple is to automatically search for that correctly-spelled word, removing the prompt to click the link. We could also change the text to something like "Showing results for {correctly-spelled word}".
Ultimately, I think with the situation at hand, we made the right call implementing Trigram this way. The search is just as fast as before, and now in the case of misspellings, a user just has to follow the link to the correctly-spelled word and they will see the results they wanted very quickly.