Note: This is a post transferred from Laurii for historical and consolidation purposes.
A common problem I have to deal with quite often is to remove all HTML tags from a document. While this is easy for XML (well formatted etc.) and you could do it by hand with a regexp, HTML has a bit more problems. There are several solutions to get around this…
This is a bit overkill, but you could use a browser’s renderer to “display” the string content and then get it as simple as from a test editor/widget. This is by far the most reliable variant for badly formatted HTML, simply because HTML is designed for display (mostly). Unfortunately, this option is unfeasible due to resource constraints, if doing batch processing…
As I’ve mentioned, you can do it by hand, looking for the < and > tags in the document, and it works for most cases. This solution is what I call blunt force. An example to strip img tags can be:
def strip_img_tags(value): """Returns the given HTML with IMG, BR tags stripped.""" return re.sub(r'<(img|br)[^>]*?>', '', force_unicode(value))
The above code looks for
<img ... > in a non-greedy manner, so we don’t gobble up everything until the last
> :) If you’re using django, things get slightly simpler; in django.utils.html, you find
strip_tags(value) with the following implementation (at the time of writing):
def strip_tags(value): """Returns the given HTML with all tags stripped.""" return re.sub(r'<[^>]*?>', '', force_unicode(value)) strip_tags = allow_lazy(strip_tags)
As you can see, this was the inspiration for the above strip_img_tags(). Another option is to use a third party tool. After looking at various options (from lxml to minidom), I’ve stumbled over an interesting package named BeautifulSoup. From their website:
Beautiful Soup is an HTML/XML parser for Python that can turn even invalid markup into a parse tree. It provides simple, idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
Although it’s more resource intensive than the regexp variant above, the nice heuristics make it more robust. The removing of all tags and extraction of the text off the HTML document is as simple as:
from BeautifulSoup import BeautifulSoup, NavigableString def strip_html(src): p = BeautifulSoup(src) text = p.findAll(text=lambda text:isinstance(text, NavigableString)) return u" ".join(text)
In other words, we let BeautifulSoup to parse the source src, we look for all NavigableString (aka text) nodes, and join them. Easy.
Of course, if you need to isolate paragraphs or DIVs or something else, the parsing gets more complex, but the amount of code you need to write is less than the handmade python, and also more robust (can you imagine all the tricks you’d need to do for the misaligned tags?).