Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.



The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: