Wow, okay. I would imagine this makes mathematicians quite angry? I guess you're responsible for all the operations you use in your proof being well-behaved.
It sounds like subtraction over Nats needs to be split into `sub?`, and `sub!`, the former returning an option, and the latter crashing, on underflow, as is the Lean convention?
To use the default `sub`, you should need to provide a witness that the minuend is >= the subtrahend...
The version with silent underflow is still useful, it should just be called `saturatingSub`, or something, so that mathematicians using it know what they're getting themselves into...
It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.
For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.
In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.
So I actually did try that, but and IIRC it didn't produce a CMOV with either gcc or clang. I didn't put it in the repo because it wasn't an improvement (on my machine) and I decided not to write about it.
Very interesting approach. I should probably have specified that the somewhat naive assembly in `02-the-same-speed-as-c/loop-5.x64.s` is the fastest version I have.
On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.
I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...
I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.
I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration.
Though maybe the CPU can run `adc` on more ports compared to a load from memory?
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
Summary
'01-six-times-faster-than-c/bench-x64-8 1000 1' ran
1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'
It sounds like subtraction over Nats needs to be split into `sub?`, and `sub!`, the former returning an option, and the latter crashing, on underflow, as is the Lean convention?
To use the default `sub`, you should need to provide a witness that the minuend is >= the subtrahend...
The version with silent underflow is still useful, it should just be called `saturatingSub`, or something, so that mathematicians using it know what they're getting themselves into...