Regex Flashcard on parsing HTML opening tags

sergeylukin · August 14, 2015, 8:14pm

Hi,

I wanted to share my thought on this flashcard related to using RegEx to parse HTML and see what you think about it. While I can kind of agree with accepted answer in referenced StackOverflow question in that hardcore HTML parsing is probably not a proper task for pure RegEx I still believe you could retrieve only opening HTML tags names with something as simple as <([a-z]+).*(?<!\/)> (which utilizes “Negative lookbehind” matcher), see demo: regex101: build, test, and debug regex

I may be missing something so please kindly let me know if this is the case.

Thanks,
Sergey

P.S. Thank you amazing flashcards! I love them. And I learned at minimum a few tricks from every deck already, including RegEx deck (didn’t know about non-greedy matchers and word boundaries).

r00k · August 17, 2015, 4:21pm

Glad you liked the flashcards!

As for your question: sure, you can fix the specific problem from the flashcard with some changes to the regex, but the real question is should you.

The point we wanted to make is that parsing HTML with regex can get nasty. There’s a point where you should switch to a real parser.

Where that point is is debatable.

From another SO answer: “it’s sometimes appropriate to parse a limited, known set of HTML.” I agree with this. But when you start running into lots of edge cases, it’s time to switch to a better tool.

Topic		Replies	Views
What's regex to parse to return all match datas? Ruby on Rails	2	758	January 14, 2015
Getting better at Ag Workflow	2	935	August 12, 2014
Regular Expressions - Regular Expressions: Character Classes Exercise Questions	8	1735	December 31, 2015
Am I using html_safe, and raw() correctly? Tying to append a string with special characters to a query Ruby on Rails	2	873	October 23, 2013
Extract Class Upcase	27	5109	August 6, 2018

Regex Flashcard on parsing HTML opening tags

Related topics