Now, if you are someone who has heard of regular expressions and HTML/XML, you have also probably heard that regular expressions cannot parse HTML as HTML is not a regular language. If you've dug a little deeper, or read any of the comments surrounding this, you may also have heard that the term "regular expression" has come to mean something more than the official computer science term. So, can these super-regex match HTML?
When I first asked this question -- PCRE supports recursive regular expressions. Does that mean it _can_ parse HTML? -- I got several "answers" about HTML not being regular, and that I shouldn't even be thinking about parsing HTML with regex as I should be using a real parser. I was linked to this website , which gives a proof (over my head, at the moment) that XML is not regular, and therefore cannot be matched by a regular expression. None of those "answers" deterred me in my pursuit of knowledge, however.
The article mentions that regular expressions cannot guarantee palindrome-edness, which I knew for a fact PCRE could (at least to a limited extent). Take this for example:
- Concatenation, so { "a" }{ "b" } becomes { "ab" }
- Alternation, so { "a" } | { "b" } becomes { "a", "b" }
- Kleene Star , which is the same star used in nearly every other regex. It means the preceding thing 0 or more times.
Parenthesis are used for grouping (to give different precedence to the operators). The ? and + included in many flavours of regex are simplified cases of the three operations listed above. The same is also true with the squiggly braces { and }.
At this point it is really late, so I'm now just going to give you the progression my regex has taken, and hopefully come back later to fill in the gaps. Please, if you see this still here in the future, send me an email. The contact link is at the bottom/top.
The first thing I got was something that could recursively match tags.
Then I added tags that didn't have a separate closing one, like <img />
Then we got to comments. I was inexperienced with (read never before used) negative/positive lookahead/behind, so it took a bit of jiggering to get to this:
The very next thing was the annoying special cases for <!DOCTYPE> and <?xml?> at the very start. That yielded:
The next thing I did (after this was originally posted, but not long after ;)) was to fix the script issue ajanata came up with by requiring an exact end to the start tag, and to nail down the allowed tag names.
Everything up until this point had just seemed like special cases. Then ajanata pointed me to the spec, and I saw the link for tables. I still haven't confirmed it, but something that would definitely discount PCRE for HTML matching would be if the td count inside of each tr of the same table need to match (allowing for the possibility of colspan). At that point I switched pages to the XML spec ;)
Addendum: I still haven't come back to write the rest of this post. It's now the 11th, and I haven't done a whole lot more with the XML matching regex. I did, however, improve the palindrome matching one so that it can match arbitrarily long palindromes (up to recursion depth * 2 + 1 long):
Another addendum: I was just asked how the palindrome matching regex works, so I'll put a copy of my reply here.
Let's handle the first group first. Here's a simpler version that only matches 2 and 3 character palindromes:
You can match "squares" -- words that are the same letters twice -- using this, by doing something like:
So, the the we have so far:
Now here is where it get's fun :) Just as you can reference what you have already captured, you can also reference previous patterns. So
There are a few modifiers so that all strings match, and so the palindrome
must be the whole line. In practice, I modify the string beforehand to
strip out punctuation, space and case (though the ignore case modifier may
have been better). This is actually working on a bot of mine living in
#uakroncs on slashnet.
Basically, either the first group matches (which is the definition of a
palindrome), or the final group matches (the "(.*)" bit). In the
replacement string I use conditional replacement. It's syntax is