That is true. One can also document the regex’s and rules well with examples to ...

generalizations · on Sept 15, 2024

Well, even building and commenting the regex is something that LLMs can do pretty well these days. I actually did exactly that, in a different domain: wrote a prompt template that included the current (python-wrapped) regex script and some autogenerated test case results, and a request for a new version of the script. Then passed that to sonnet 3.5 in an unattended loop until all the tests passed. It actually worked.

The secret sauce was knowing what sort of program architecture is suited to that process, and knowing what else should go in the code that would help the LLM get it right.

Which is all to say, use the LLM directly to parse the html, or use an LLM to write the regex to parse the html: both work, but the latter is more efficient.