You try your best, and if you provide enough examples, it will undoubtedly get f...

lolinder · on March 17, 2024

I think you're misunderstanding OP's objection. It's not simply a matter of going back and forth with the LLM until eventually (infinite monkeys on typewriters style) it gets the same binary as before: Even if you got the exact same source code as the original there's still no automated way to tell that you're done because the bits you get back out of the recompile step will almost certainly not be the same, even if your decompiled source were identical in every way. They might even vary quite substantially depending on a lot of different environmental factors.

Reproducible builds are hard to pull off cooperatively, when you control the pipeline that built the original binary and can work to eliminate all sources of variation. It's simply not going to happen in a decompiler like this.

blagie · on March 17, 2024

Well, no, but yes.

The critical piece is that this can be done in training. If I collect a large number of C programs from github, compile them (in a deterministic fashion), I can use that as a training, test, and validation set. The output of the ML ought to compile to the same way given the same environment.

Indeed, I can train over multiple deterministic build environments (e.g. different compilers, different compiler flags) to be even more robust.

The second critical piece is that for something like a GAN, it doesn't need to be identical. You have two ML algorithms competing:

- One is trying to identify generated versus ground-truth source code

- One is trying to generate source code

Virtually all ML tasks are trained this way, and it doesn't matter. I have images and descriptions, and all the ML needs to do is generate an indistinguishable description.

So if I give the poster a lot more benefit of the doubt on what they wanted to say, it can make sense.

lolinder · on March 17, 2024

Oh, I was assuming that Eager was responding to klik99's question about how we could identify hallucinations in the output—round tripping doesn't help with that.

If what they're actually saying is that it's possible to train a model to low loss and then you just have to trust the results, yes, what you say makes sense.

blagie · on March 17, 2024

I haven't found many places where I trust the results of an ML algorithm. I've found many places where they work astonishingly well 30-95% of the time, which is to say, save me or others a bunch of time.

It's been years, but I'm thinking back through things I've reverse-engineered before, and having something which kinda works most of the time would be super-useful still as a starting point.

incrudible · on March 17, 2024

Have you ever trained a GAN?

blagie · on March 17, 2024

Technically, yes!

A more reasonable answer, though, is "no."

I've technically gone through random tutorials and trained various toy networks, including a GAN at some point, but I don't think that should really count. I also have a ton of experience with neural networks that's decades out-of-date (HUNDREDS of nodes, doing things like OCR). And I've read a bunch of modern papers and used a bunch of Hugging Face models.

Which is to say, I'm not completely ignorant, but I do not have credible experience training GANs.

weinzierl · on March 18, 2024

That's true but a solvable problem. I once tried to reproduce the build of an uncooperative party and it was mainly tedious and boring.

The space of possible compiler arguments is huge, but ultimately what is actually used is mostly on a small surface.

Apart from that, I wrote a small tool to normalize the version string, timestamps and file path' in the binaries before I compared them. I know there are other sources of non-determinism, but these three things were enough in my case.

The hardest part were the numerous file path' from the build machine. I had not expected that. In hindsight, stripping both binaries before comparison might have helped, but I don't remember why I didn't do that.

junon · on March 18, 2024

Err, no, sorry, it won't. Compilers don't work that way. There's a lot of ways to compile down source to machine code and the output changes from compiler version to compiler version. The LLM would have to know exactly how the compiler worked at which version to do this. So the idea is technically possible but not technically feasible.

thfuran · on March 17, 2024

What exactly are you suggesting will get figured out?

spqrr · on March 17, 2024

The mapping from binary to source code.

thfuran · on March 17, 2024

Even ignoring all sources of irreproducibility, there does not exist a bijection between source and binary artifact irrespective of tool chain. Two different toolchains could compile the same source to different binaries or different sources to the same binary. And you absolutely shouldn't be ignoring sources of irreproducibility in this context, since they'll cause even the same toolchain to keep producing different binaries given the same source.

achrono · on March 17, 2024

Exactly, but neither the source nor the binary is what's truly important here. The real question is: can the LLM generate the functionally valid source equivalent of the binary at hand? If I disassemble Microsoft Paint, can I get code that will result in a mostly functional version of Microsoft Paint, or will I just get 515 compile errors instead?

Brian_K_White · on March 17, 2024

This is what I thought the question was really about.

I assume that an llm will simply see patterns that look similar to other patterns and make assosciations and assume ewuivalences on that level, meanwhile real code is full of things where the programmer, especially assembly programmers, modify something by a single instruction or offset value etc to get a very specific and functionally important result.

Often the result is code that not only isn't obvious, it's nominaly flatly wrong, violating standards, specs, intended function, datasheet docs, etc. If all you knew were the rules written in the docs, the code is broken and invalid.

Is the llm really going to see or understand the intent of that?

They find matching patterns in other existing stuff, and to the user who can not see the infinite body of that other stuff the llm pulled from, it looks like the llm understood the intent of a question, but I say it just found the prior work of some human who understood a similar intent somewhere else.

Maybe an llm or some other flavor of ai can operate some other way like actually playing out the binary like executing in a debugger and map out the results not just look at the code as fuzzy matching patterns. Can that take the place of understanding the intents the way a human would reading the decompiled assembly?

Guess we'll be finding out sooner of later since of course it will all be tried.

layer8 · on March 17, 2024

The question was about the reverse mapping.

fao_ · on March 17, 2024

Except LLMs cannot reason.

LoganDark · on March 17, 2024

LLMs can mimic past examples of reasoning from the dataset. So, it can re-use reasoning that it has already been trained on. If the network manages to generalize well enough across its training data, then it can get close to reproducing general reasoning. But it can't yet fully get there, of course.

mrtesthah · on March 18, 2024

Do you have evidence LLMs can indeed generalize outside their training data distribution?

https://twitter.com/abacaj/status/1721223737729581437/photo/...

LoganDark · on March 18, 2024

No. I know only that they can generalize within it, and only to a limited degree, but don't have solid evidence of even that.

fao_ · on March 22, 2024

So what you're saying is there's tenuous-at-best, non-"solid" evidence that LLMs can reason even within their training data.

And yet I'm currently sitting at -1 for stating the blisteringly obvious. Lmao

LoganDark · on March 22, 2024

Yes, that's basically what I'm saying. Just less bluntly. It's slightly more nuanced than "LLMs cannot reason" because lines of reasoning are often in their dataset and can sometimes be used by the model. It's just that the model can't be relied on to know the correct reasoning to use in a given situation.