Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There’s html and then there’s… html.

A nicely formatted subset of html is very different from a dom tag soup that is more or less the default nowadays.



Tag soup hasn’t been a problem for years. The HTML 5 specification goes into a lot more detail than previous specifications when it comes to parsing malformed markup and browsers follow it. So no matter the quality of the markup, if you throw it at any HTML 5 implementation, you will get the same consistent, unambiguous DOM structure.


yeah, you could just pull the parser out of any open source browser and voila a parser not only battle-tested, but probably the one the page was developed against


That's why the best strategy is to feed the whole page into LLM. (After removing html tags) and just ask LLM to give you the date you need in the format you need.

If there is lots of javascript dom manipulation happening after pageload. Then just render in webdriver and screenshot, ocr and feed the result into LLM and ask it the right questions.


My intuition is that you’d get better results emptying the tags or replacing them with some other delimiter.

Keep the structural hint, remove the noise.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: