← Back to Upcase

Regex Flashcard on parsing HTML opening tags

(Sergey Lukin) #1

Hi,

I wanted to share my thought on this flashcard related to using RegEx to parse HTML and see what you think about it. While I can kind of agree with accepted answer in referenced StackOverflow question in that hardcore HTML parsing is probably not a proper task for pure RegEx I still believe you could retrieve only opening HTML tags names with something as simple as <([a-z]+).*(?<!\/)> (which utilizes “Negative lookbehind” matcher), see demo: https://regex101.com/r/kX5mJ9/1

I may be missing something so please kindly let me know if this is the case.

Thanks,
Sergey

P.S. Thank you amazing flashcards! I love them. And I learned at minimum a few tricks from every deck already, including RegEx deck (didn’t know about non-greedy matchers and word boundaries).

(Ben Orenstein) #2

Glad you liked the flashcards!

As for your question: sure, you can fix the specific problem from the flashcard with some changes to the regex, but the real question is should you.

The point we wanted to make is that parsing HTML with regex can get nasty. There’s a point where you should switch to a real parser.

Where that point is is debatable.

From another SO answer: “it’s sometimes appropriate to parse a limited, known set of HTML.” I agree with this. But when you start running into lots of edge cases, it’s time to switch to a better tool.