Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> but I don't think any language is going to offer that out of the box.

That's what compilers are for. I tried to improve the C version to make it friendlier to the compiler. Clang does a decent job:

https://godbolt.org/z/o35edavPn

I'm getting 1.325s (321MB/s) instead of 1.506s (282MB/s) on a 100 concatenated bibles. That's still not a 10x improvement though; the problem is cache locality in the hash map.



Note: Just concatenating the bibles keeps your hash map artificially small (EDIT: relative to more organic natural language vocabulary statistics)...which matters because as you correctly note the big deal is if you can fit the histogram in the L2 cache as noted elsethread and this really matters if you go parallel where N CPUs*L2 caches can speed things up a lot -- until your histograms blow out CPU-private L2 cache sizes. https://github.com/c-blake/adix/blob/master/tests/wf.nim (or a port to your favorite lang instead of Nim) might make it easy to play with these ideas (and see at least one way to avoid almost all "allocation" - under some interpretations).

A better way to "scale up" is to concatenate various other things from Project Gutenberg: https://www.gutenberg.org/ At least then you have "organic" statistics on the hash.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: