That is true. One can also document the regex’s and rules well with examples to help visualize it.
I think development time will be the real winner for LLM’s since building the right set of regex’s takes a long time.
I’m not sure which is faster to iterate on when sites change. The regex’s require the human learning one or more regex’s for sites that broke. Then, how they interact with other sites. The LLM might need to be retrained, maybe just see new examples, or might generalize using previous training. Experiments on this would be interesting.
Well, even building and commenting the regex is something that LLMs can do pretty well these days. I actually did exactly that, in a different domain: wrote a prompt template that included the current (python-wrapped) regex script and some autogenerated test case results, and a request for a new version of the script. Then passed that to sonnet 3.5 in an unattended loop until all the tests passed. It actually worked.
The secret sauce was knowing what sort of program architecture is suited to that process, and knowing what else should go in the code that would help the LLM get it right.
Which is all to say, use the LLM directly to parse the html, or use an LLM to write the regex to parse the html: both work, but the latter is more efficient.
I think development time will be the real winner for LLM’s since building the right set of regex’s takes a long time.
I’m not sure which is faster to iterate on when sites change. The regex’s require the human learning one or more regex’s for sites that broke. Then, how they interact with other sites. The LLM might need to be retrained, maybe just see new examples, or might generalize using previous training. Experiments on this would be interesting.