Tag soup hasn’t been a problem for years. The HTML 5 specification goes into a lot more detail than previous specifications when it comes to parsing malformed markup and browsers follow it. So no matter the quality of the markup, if you throw it at any HTML 5 implementation, you will get the same consistent, unambiguous DOM structure.
yeah, you could just pull the parser out of any open source browser and voila a parser not only battle-tested, but probably the one the page was developed against
That's why the best strategy is to feed the whole page into LLM. (After removing html tags) and just ask LLM to give you the date you need in the format you need.
If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.
A nicely formatted subset of html is very different from a dom tag soup that is more or less the default nowadays.