Regular Expression Matching with a Trigram Index (2012)

secure · on Feb 7, 2023

Debian Code Search is based on this work! :)

https://codesearch.debian.net/

(If you want a lot more details, check out my thesis on it: https://codesearch.debian.net/research/bsc-thesis.pdf)

donio · on Feb 7, 2023

plocate is another nice simple use case of a trigram index. It's a faster version of the classic locate/slocate/mlocate tools.

https://plocate.sesse.net/

JoeyBananas · on Feb 7, 2023

searching for dirty words is very fun

zelphirkalt · on Feb 7, 2023

I have seen the term "trigram" twice today on HN. I am familiar with the term only for something like on the South Korean flag [1]. Interesting to see this being used for a special case of n-grams in language processing.

[1]: https://en.wikipedia.org/wiki/Bagua

tgv · on Feb 7, 2023

It's not really special. It used to be quite common, as well as bigrams (n=2). In the Roman alphabet, 3 characters give a fair trade-off between precision and recall, making them suitable to things like spelling error correction.

Everything old will be new again.

msm_ · on Feb 8, 2023

They also give you a good performance. I'm a maintainer of a specialised open-source 3gram database, and there's a lot of performance tricks you can do with 3grams. There are around 2*24 different 3grams (less if you only care about printable characters). This is small enough to pack into a single array on a modern computer. For example the file format I use has a header that reserves 8 bytes of data [1] for every possible 3gram - that's just 128MB of overhead per index, not a lot of disk storage. For 4grams, that's 32GB, much less acceptable even now. And 2grams are just not very useful for large datasets.

[1] for the curious - the header is basically uint64_t file_offsets_to_3gram_data[2**24].

akomtu · on Feb 8, 2023

Afaik, GPT works with trigrams, and calls them tokens.

Rimintil · on Feb 7, 2023

If you want a challenge, try solving the Eye Messages puzzle from Noita, which is a bunch of trigrams!

https://noita.wiki.gg/wiki/Eye_Messages

dekhn · on Feb 7, 2023

Much of google search ranking is based on (well, was ten years ago) on bigram and trigram statistics. Short of a real n-gram model this is remarkably effective but has some obvious failure modes.

abdullah2993 · on Feb 7, 2023

Github code search blog probably resurfaced this one.

heipei · on Feb 7, 2023

Trigram indexing is also used by the Elasticsearch "wildcard" field type for fast arbitrary regex/infix matching: https://www.elastic.co/blog/find-strings-within-strings-fast...

talolard · on Feb 7, 2023

Postgres implements this[0] as well, and it's really wonderful. It doesn't give a human the search experience they are used to, but for the superhuman who can write regex , this becomes a very cheap way to search data at scale.

[0]https://www.postgresql.org/docs/current/pgtrgm.html

jkingsman · on Feb 7, 2023

I use trigram indices on a project I run[0] where I want to do cheap filtering of DB results and the performance is just outstanding; I didn't think free text search could be so fast!

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX IF NOT EXISTS lowercase_title ON streams (lower(title));

CREATE INDEX IF NOT EXISTS title_trgm ON streams USING gin (lower(title) gin_trgm_ops);

And boom, super performant search via `LIKE %{}%`.

I also love taking advantage of `TABLESAMPLE system_rows()` which lets me do hyperfast random selection without needing to randomly sort the entire table. PG has so many hidden gems.

[0] https://nobody.live

eyelidlessness · on Feb 7, 2023

> but for the superhuman who can write regex

Usually when I write regex someone describes me as a monster. I guess both could be accurate!

asicsp · on Feb 7, 2023

(2012)

Past discussions: https://hn.algolia.com/?q=regexp4.html

cjbprime · on Feb 7, 2023

Hound is an excellent implementation of this for code search:

https://github.com/hound-search/hound