I've tried something somewhat similar in the past. I was looking at implementing an extremely fast decompressor, with ratio similar to LZ4. I was able to get 2x the decompression speed of LZ4, but struggled with compression ratio. The idea was to have 16 byte matches, and allow the matches to apply a 16-bit mask, telling whether each byte is part of the match or a literal. Then I restricted the compressor to only be able to use 16 distinct masks.
This was extremely fast to decompress, because each 16-byte match is: load the 16-byte match into an AVX2 register, load 16 bytes of literals, load the mask you're using, shuffle the literals, then blend the literals and the match. And because the matches are fixed size, you can start the fetch for multiple matches in parallel.
However, the problem I ran into, and would love to solve, is that I also wanted fast-ish compression speed. And it is very hard to search for good matches quickly. Since you have holes in the match.
I guess the author is looking at GPU compression, so they are taking a somewhat brute-force approach. But I'd be interested to see how they're doing the match finding, and what kind of speed they're getting.
The author is focused on GPU texture compression for games[1] so they're not too concerned with fast compression speed. Given a game will be played on (at a minimum) tens of thousands of machines, some of which may be quite limited (e.g. handheld consoles or mobile devices), the game developers are only more than happy to trade off once-off compression times for decompression size / speed. They'd likely only perform this on a later "release build" of their game too.
The author mentions in a tweet going from minutes to seconds for compression when switching CPU for GPU[2]. From memory he has made other references to a few seconds for compression being entirely reasonable for such tasks but I can't find a direct reference.
I've tried something somewhat similar in the past. I was looking at implementing an extremely fast decompressor, with ratio similar to LZ4. I was able to get 2x the decompression speed of LZ4, but struggled with compression ratio. The idea was to have 16 byte matches, and allow the matches to apply a 16-bit mask, telling whether each byte is part of the match or a literal. Then I restricted the compressor to only be able to use 16 distinct masks.
This was extremely fast to decompress, because each 16-byte match is: load the 16-byte match into an AVX2 register, load 16 bytes of literals, load the mask you're using, shuffle the literals, then blend the literals and the match. And because the matches are fixed size, you can start the fetch for multiple matches in parallel.
However, the problem I ran into, and would love to solve, is that I also wanted fast-ish compression speed. And it is very hard to search for good matches quickly. Since you have holes in the match.
I guess the author is looking at GPU compression, so they are taking a somewhat brute-force approach. But I'd be interested to see how they're doing the match finding, and what kind of speed they're getting.