Hacker Newsnew | past | comments | ask | show | jobs | submit | kukas's commentslogin

Hey, I am working on my own LLM-based decompiler for Python bytecode (https://github.com/kukas/deepcompyle). I feel there are not many people working on this research direction but I think it could be quite interesting, especially now that longer attention contexts are becoming feasible. If anyone knows a team that is working on this, I would be quite interested in cooperation.


Is there a benefit from using an LLM for Python byte code? Python byte code is high enough level that it's possible to translate it directly to source code from my experience.


My motivation is that the existing decompilers work only for Python versions till ~3.8. Having a model that could be finetuned with every new Python version release might overcome the need for highly specialized programmer that is able to update the decompiler to be compatible with the new version.

It is also a toy example for me to set up a working pipeline and then try to decompile more interesting targets.


Why Python? First, python is a language with a large open-source library. Second, I do not think it is used for software that is distributed as binaries?


Closed-source python exists, and it is frequently distributed in compiled binaries (especially in mediocre malware).

As a (supposedly) non-malicious example, the "Nightshade" watermarking tool is distributed as closed-source pre-compiled Python https://nightshade.cs.uchicago.edu/downloads.html


There is [PyLingual](https://pylingual.io/), but it is not open source unfortunately. I am not sure if it is also LLM based.


I found lots of decompilation work are conducted on C. It seems not much python projects are compiled into binaries.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: