> but I don't think any language is going to offer that out of the box. That's w...

cb321 · on Nov 26, 2022

Note: Just concatenating the bibles keeps your hash map artificially small (EDIT: relative to more organic natural language vocabulary statistics)...which matters because as you correctly note the big deal is if you can fit the histogram in the L2 cache as noted elsethread and this really matters if you go parallel where N CPUs*L2 caches can speed things up a lot -- until your histograms blow out CPU-private L2 cache sizes. https://github.com/c-blake/adix/blob/master/tests/wf.nim (or a port to your favorite lang instead of Nim) might make it easy to play with these ideas (and see at least one way to avoid almost all "allocation" - under some interpretations).

A better way to "scale up" is to concatenate various other things from Project Gutenberg: https://www.gutenberg.org/ At least then you have "organic" statistics on the hash.