I was having a hard time finding an HTML parser for my latest C++ project, so I decided to write up a quick summary of what I ended up using.
My #1 requirement for a parser was that it had to provide some mechanism of searching for elements. There are a couple of parsers available that only provide SAX-style parsing, which is very inconvenient for all but the simplest of parsing tasks. An ideal API would provide searching using XPath expressions, or something similar.
The only decent sources of information I found were these three questions from Stack Overflow: Library Recommendation: C++ HTML Parser, Parse html using C, and XML Parser for C. Below is a summary of what I considered along with my take on each:
- QWebElement - Part of the Qt framework. Although it provides a rich API, I couldn’t figure out how to compile any Qt code outside of Qt Creator (I’m using Code::Blocks.)
- htmlcxx – Standalone, tiny library. I got some code up and running with this library very fast. However, I quickly realized how limited it is (e.g. poor attribute accessors, no way to search for elements.) Limited documentation.
- Tidy – The classic HTML cleaner/repairer has a built-in SAX-style parser. Simple to use, but like htmlcxx, limited in what it can do.
- Tidy + libxml++ – Tidy can transform HTML into XML, so all that’s needed is a good XML parser. This was the solution I ended up using.
My final solution was to use Tidy to clean up the markup and convert it into XML. Then, I use libxml++ (a C++ wrapper for libxml) to traverse the DOM. libxml++ supports searching for elements with XPath, so I was happy.
Here’s some sample code demonstrating Tidy and libxml++.
Step 1: Using Tidy to clean HTML and convert it to XML:
#include <tidy/tidy.h>
#include <tidy/buffio.h>
std::string CleanHTML(const std::string &html){
// Initialize a Tidy document
TidyDoc tidyDoc = tidyCreate();
TidyBuffer tidyOutputBuffer = {0};
// Configure Tidy
// The flags tell Tidy to output XML and disable showing warnings
bool configSuccess = tidyOptSetBool(tidyDoc, TidyXmlOut, yes)
&& tidyOptSetBool(tidyDoc, TidyQuiet, yes)
&& tidyOptSetBool(tidyDoc, TidyNumEntities, yes)
&& tidyOptSetBool(tidyDoc, TidyShowWarnings, no);
int tidyResponseCode = -1;
// Parse input
if (configSuccess)
tidyResponseCode = tidyParseString(tidyDoc, html.c_str());
// Process HTML
if (tidyResponseCode >= 0)
tidyResponseCode = tidyCleanAndRepair(tidyDoc);
// Output the HTML to our buffer
if (tidyResponseCode >= 0)
tidyResponseCode = tidySaveBuffer(tidyDoc, &tidyOutputBuffer);
// Any errors from Tidy?
if (tidyResponseCode < 0)
throw ("Tidy encountered an error while parsing an HTML response. Tidy response code: " + tidyResponseCode);
// Grab the result from the buffer and then free Tidy's memory
std::string tidyResult = (char*)tidyOutputBuffer.bp;
tidyBufFree(&tidyOutputBuffer);
tidyRelease(tidyDoc);
return tidyResult;
}
Step 2: Parse the XML with libxml++:
The following code parses the HTML contained in ‘response’ (passing it to CleanHTML first.) Then, we search for the element with id ‘some_id’. After outputting how many elements match that criteria (should be 1), we output the line in the XML at which the element occurs. For the sake of saving space I omit error checking.
#include <libxml++/libxml++.h>
xmlpp::DomParser doc;
// 'response' contains your HTML
doc.parse_memory(CleanHTML(response));
xmlpp::Document* document = doc.get_document();
xmlpp::Element* root = document->get_root_node();
xmlpp::NodeSet elemns = root->find("descendant-or-self::*[@id = 'some_id']");
std::cout << elemns.size() << std::endl;
std::cout << elemns[0]->get_line() << std::endl;
More info
To compile the example code, I use the g++ flags: `pkg-config --cflags glibmm-2.4 libxml++-2.6 --libs` -ltidy. As the flags suggest, you’ll need the glibmm library in addition to Tidy and libxml++ (and their dependencies.)
See the libxml++ class references: http://developer.gnome.org/libxml++/stable/annotated.html