More

414owen · 2025-12-27T11:49:05 1766836145

Wow, okay. I would imagine this makes mathematicians quite angry? I guess you're responsible for all the operations you use in your proof being well-behaved.

It sounds like subtraction over Nats needs to be split into `sub?`, and `sub!`, the former returning an option, and the latter crashing, on underflow, as is the Lean convention?

To use the default `sub`, you should need to provide a witness that the minuend is >= the subtrahend...

The version with silent underflow is still useful, it should just be called `saturatingSub`, or something, so that mathematicians using it know what they're getting themselves into...

414owen · on June 4, 2024

Word.

ge96 · on June 4, 2024

Libre Office

414owen · on Feb 22, 2024

re2c works like a charm. I'm generally confident that I couldn't write faster scanners by hand if I tried.

414owen · on Aug 7, 2023

This code was written a while ago, but I decided it merited a write-up recently.

414owen · on July 24, 2023

I have DietPi running on a Raspberry pi Zero W (armv6l), which is a system that's supported by barely any distros nowadays.

It was easy to preseed, has the absolute bare minimum amount of processes running, and is altogether a fantastic distro. Highly recommended.

kwk1 · on July 25, 2023

For what it's worth, plain Debian runs fine on those as well, most easily via https://raspi.debian.net/tested-images/

414owen · on July 9, 2023

This is pretty much `assembly language the game`: https://tomorrowcorporation.com/humanresourcemachine

It's not a useful architecture, but it teaches the thought process really well, and you end up discovering a lot of optimization naturally.

For this article, I'm measuring every step to see what the performance implications of the changes are, which, along with some educated guesses and some googling/reading other articles, was enough for me to figure out what was going on.

In part two (https://owen.cafe/posts/the-same-speed-as-c/) especially, I didn't know what was going on with the benchmarks for a long time. Eventually I got lucky and made a change, which led to a hypothesis, which lead to more tests, which led to a conclusion.

414owen · on July 9, 2023

The array lookup approach taken in part two:

https://owen.cafe/posts/the-same-speed-as-c/

But taking the length of the string as a parameter is not, because that changes the problem statement (making the solution vectorizable)

Also note that you'll try to read element -1 of the input. You probably want to change the `>=` to a `>`

414owen · on July 7, 2023

Kind words, much appreciated!

414owen · on July 6, 2023

So I actually did try that, but and IIRC it didn't produce a CMOV with either gcc or clang. I didn't put it in the repo because it wasn't an improvement (on my machine) and I decided not to write about it.

Maybe you get different results though?

414owen · on July 6, 2023

Very interesting approach. I should probably have specified that the somewhat naive assembly in `02-the-same-speed-as-c/loop-5.x64.s` is the fastest version I have.

On my machine I'm getting 0.244s for `loop-5.x64.s` and 0.422s for your implementation above.

I'm not sure why exactly we're seeing this discrepancy, and for what it's worth your implementation looks faster to me. I guess this is why you need to always benchmark on the hardware you're going to be running the code on...

xoranth · on July 6, 2023

I rerun the benchmark vs loop-5 and loop-7 from the second post. Runtime is basically the same on my machine.

I would have expected yours to be faster given that it needs to execute fewer instructions per loop iteration. Though maybe the CPU can run `adc` on more ports compared to a load from memory?

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.00 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-7 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'

    Summary
      '01-six-times-faster-than-c/bench-x64-8 1000 1' ran
        1.01 ± 0.00 times faster than '02-the-same-speed-as-c/bench-x64-5 1000 1'
        1.66 ± 0.00 times faster than '01-six-times-faster-than-c/bench-x64-7 1000 1'