There’s html and then there’s… html. A nicely formatted subset of html is very d...

JimDabell · on Sept 12, 2024

Tag soup hasn’t been a problem for years. The HTML 5 specification goes into a lot more detail than previous specifications when it comes to parsing malformed markup and browsers follow it. So no matter the quality of the markup, if you throw it at any HTML 5 implementation, you will get the same consistent, unambiguous DOM structure.

mithametacs · on Sept 12, 2024

yeah, you could just pull the parser out of any open source browser and voila a parser not only battle-tested, but probably the one the page was developed against

faangguyindia · on Sept 12, 2024

That's why the best strategy is to feed the whole page into LLM. (After removing html tags) and just ask LLM to give you the date you need in the format you need.

If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.

mithametacs · on Sept 12, 2024

My intuition is that you’d get better results emptying the tags or replacing them with some other delimiter.

Keep the structural hint, remove the noise.