Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting benchmark, but worth noting the methodology: skills are generated before the task, with no feedback loop. In practice, useful skills tend to emerge from doing — you attempt, observe what failed, then codify what worked. Generate → execute → observe → refine. The paper tests cold generation, which is a different (and less realistic) setup.
 help



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: