Tag soup hasn’t been a problem for years. The HTML 5 specification goes into a lot more detail than previous specifications when it comes to parsing malformed markup and browsers follow it. So no matter the quality of the markup, if you throw it at any HTML 5 implementation, you will get the same consistent, unambiguous DOM structure.
yeah, you could just pull the parser out of any open source browser and voila a parser not only battle-tested, but probably the one the page was developed against