May reduce wall-clock time but increase total compute time (and so also power). It's less an optimization than a tradeoff.
> please stop malloc'ing for each token
It doesn't, only when it gets put in the map. (And while the particular allocation could be smaller, something guaranteed to represent a specific arbitrary-length string has to be put in the map, which is going to malloc.)
> prealloc map for found tokens (better to just allocate room for 200k words).
Has no meaningful effect on performance.
> SIMD would optimise your inner-loop quite a lot.
No, as pointed out elsethread, it's a measurable boost but nowhere near the 10x you need to make the main claim (I/O not the bottleneck) be wrong. Not even 2x.
> It doesn't, only when it gets put in the map. (And while the particular allocation could be smaller, something guaranteed to represent a specific arbitrary-length string has to be put in the map, which is going to malloc.)
You could use one big buffer for all your words. Arguably that's bump allocation but it's much simpler than malloc.
I also tried this and didn't see much improvement. Go already uses a fast allocator for small sizes so they are likely to all end up similar memory regions regardless. A bump allocator reduces the GC pressure a tiny bit compared to that, but that's not significant.
The real alloc win would likely be some kind of small-string optimization, which Go (specifically, the requirements of its precise GC) makes difficult. This is probably my biggest performance frustration with Go, 16 bytes for a string and especially 24 for a slice is so much waste when often 99% of your data is smaller than that.
May reduce wall-clock time but increase total compute time (and so also power). It's less an optimization than a tradeoff.
> please stop malloc'ing for each token
It doesn't, only when it gets put in the map. (And while the particular allocation could be smaller, something guaranteed to represent a specific arbitrary-length string has to be put in the map, which is going to malloc.)
> prealloc map for found tokens (better to just allocate room for 200k words).
Has no meaningful effect on performance.
> SIMD would optimise your inner-loop quite a lot.
No, as pointed out elsethread, it's a measurable boost but nowhere near the 10x you need to make the main claim (I/O not the bottleneck) be wrong. Not even 2x.
> `word = append(word, c)` <= this is very slow
Has no meaningful effect on performance.
Perhaps you should read the whole post.