Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Regular Expression Matching with a Trigram Index (2012) (swtch.com)
122 points by colinprince on Feb 7, 2023 | hide | past | favorite | 16 comments


Debian Code Search is based on this work! :)

https://codesearch.debian.net/

(If you want a lot more details, check out my thesis on it: https://codesearch.debian.net/research/bsc-thesis.pdf)


plocate is another nice simple use case of a trigram index. It's a faster version of the classic locate/slocate/mlocate tools.

https://plocate.sesse.net/


searching for dirty words is very fun


I have seen the term "trigram" twice today on HN. I am familiar with the term only for something like on the South Korean flag [1]. Interesting to see this being used for a special case of n-grams in language processing.

[1]: https://en.wikipedia.org/wiki/Bagua


It's not really special. It used to be quite common, as well as bigrams (n=2). In the Roman alphabet, 3 characters give a fair trade-off between precision and recall, making them suitable to things like spelling error correction.

Everything old will be new again.


They also give you a good performance. I'm a maintainer of a specialised open-source 3gram database, and there's a lot of performance tricks you can do with 3grams. There are around 2*24 different 3grams (less if you only care about printable characters). This is small enough to pack into a single array on a modern computer. For example the file format I use has a header that reserves 8 bytes of data [1] for every possible 3gram - that's just 128MB of overhead per index, not a lot of disk storage. For 4grams, that's 32GB, much less acceptable even now. And 2grams are just not very useful for large datasets.

[1] for the curious - the header is basically uint64_t file_offsets_to_3gram_data[2**24].


Afaik, GPT works with trigrams, and calls them tokens.


If you want a challenge, try solving the Eye Messages puzzle from Noita, which is a bunch of trigrams!

https://noita.wiki.gg/wiki/Eye_Messages


Much of google search ranking is based on (well, was ten years ago) on bigram and trigram statistics. Short of a real n-gram model this is remarkably effective but has some obvious failure modes.


Github code search blog probably resurfaced this one.


Trigram indexing is also used by the Elasticsearch "wildcard" field type for fast arbitrary regex/infix matching: https://www.elastic.co/blog/find-strings-within-strings-fast...


Postgres implements this[0] as well, and it's really wonderful. It doesn't give a human the search experience they are used to, but for the superhuman who can write regex , this becomes a very cheap way to search data at scale.

[0]https://www.postgresql.org/docs/current/pgtrgm.html


I use trigram indices on a project I run[0] where I want to do cheap filtering of DB results and the performance is just outstanding; I didn't think free text search could be so fast!

CREATE EXTENSION IF NOT EXISTS pg_trgm;

CREATE INDEX IF NOT EXISTS lowercase_title ON streams (lower(title));

CREATE INDEX IF NOT EXISTS title_trgm ON streams USING gin (lower(title) gin_trgm_ops);

And boom, super performant search via `LIKE %{}%`.

I also love taking advantage of `TABLESAMPLE system_rows()` which lets me do hyperfast random selection without needing to randomly sort the entire table. PG has so many hidden gems.

[0] https://nobody.live


> but for the superhuman who can write regex

Usually when I write regex someone describes me as a monster. I guess both could be accurate!



Hound is an excellent implementation of this for code search:

https://github.com/hound-search/hound




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: