Nov 292014
 

About a year ago I published an article entitled Parsing HTML with C++. It is by far my most popular article (second most popular being this one), and is a top result on Google for queries such as “html c++ parsing”. Nevertheless there is always room for improvement. Today, I present a revisit of the topic including a simpler way to parse, as well as a self-contained ready to go example (which many people have been asking me for.)
 

Old solution

Before today, my prescription for HTML parsing in C++ was a combination of the following libraries and associated wrappers:

cURL, of course, is needed to perform HTTP requests so that we have something to parse. Tidy was used to transform the HTML into XML that was then consumed by libxml2. libxml2 provided a nice DOM tree that is traversable with XPath expressions.
 

Shortcomings

This kludge presents a number of problems, with the primary one being no HTML5 support. Tidy doesn’t support HTML5 tags, so when it encounters one, it chokes. There is a version of Tidy in development that is supposed to support HTML5, but it is still experimental.

But the real sore point is the requirement to convert the HTML into XML before feeding it to libxml2. If only there was a way for libxml2 to consume HTML directly… Oh, wait.

At the time, I hadn’t realized that libxml2 actually had a built in HTML parser. I even found a message on the mailing list from 2004 giving a sample class that encapsulates the HTML parser. Seeing as though the last message posted was also in 2004, I suppose that there isn’t much interest.
 

New solution

With knowledge of the native HTML parser in hand, we can modify the old solution to completely remove libtidy from the mix. libxml2 by default isn’t happy with HTML5 tags either, but we can fix this by silencing errors (HTML_PARSE_NOERROR) and relaxing the parser (HTML_PARSE_RECOVER).

The new solution, then, requires solely cURL, libxml2, and their associated wrappers.

Below is a self-contained example that visits iplocation.net to acquire the external IP address of the current computer:

#include <libxml/tree.h>
#include <libxml/HTMLparser.h>
#include <libxml++/libxml++.h>

#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>

#include <iostream>
#include <string>

#define HEADER_ACCEPT "Accept:text/html,application/xhtml+xml,application/xml"
#define HEADER_USER_AGENT "User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.70 Safari/537.17"

int main() {
    std::string url = "http://www.iplocation.net/";
	curlpp::Easy request;

	// Specify the URL
	request.setOpt(curlpp::options::Url(url));

	// Specify some headers
	std::list<std::string> headers;
	headers.push_back(HEADER_ACCEPT);
	headers.push_back(HEADER_USER_AGENT);
	request.setOpt(new curlpp::options::HttpHeader(headers));
    request.setOpt(new curlpp::options::FollowLocation(true));

	// Configure curlpp to use stream
	std::ostringstream responseStream;
	curlpp::options::WriteStream streamWriter(&responseStream);
	request.setOpt(streamWriter);

	// Collect response
    request.perform();
    std::string re = responseStream.str();

    // Parse HTML and create a DOM tree
    xmlDoc* doc = htmlReadDoc((xmlChar*)re.c_str(), NULL, NULL, HTML_PARSE_RECOVER | HTML_PARSE_NOERROR | HTML_PARSE_NOWARNING);

    // Encapsulate raw libxml document in a libxml++ wrapper
    xmlNode* r = xmlDocGetRootElement(doc);
    xmlpp::Element* root = new xmlpp::Element(r);

    // Grab the IP address
    std::string xpath = "//*[@id=\"locator\"]/p[1]/b/font/text()";
    auto elements = root->find(xpath);
    std::cout << "Your IP address is:" << std::endl;
    std::cout << dynamic_cast<xmlpp::ContentNode*>(elements[0])->get_content() << std::endl;

    delete root;
    xmlFreeDoc(doc);

    return 0;
}

Install prerequisites and compile like this (Linux):

sudo apt-get install libcurlpp-dev libxml++2.6-dev
g++ main.cpp -lcurlpp -lcurl -g -pg `xml2-config --cflags --libs` `pkg-config libxml++-2.6 --cflags --libs` --std=c++0x
./a.out

 

Future work

In the near future, I will be releasing my own little wrapper class for cURL which simplifies a couple of workflows involving cookies and headers. It will make it easy to perform some types of requests with very few lines of code.

Something I need to investigate a little further is a small memory leak that occurs when I grab the content: dynamic_cast(elements[0])->get_content(). On my computer, it seems to range between 16-64 bytes lost. It may be a problem with libxml++ or just a false alarm by Valgrind.

Finally, I may consider following up on that mailing list post to see if we can get the HTML parser included in libxml++.

Nov 092014
 

With the holidays fast approaching, I thought it might be fun to share my list of on-the-go tech essentials for any geek technology enthusiast.

Leatherman Sidekick Multi Tool

What’s a “tech essentials” list without at least one multitool? I like this Leatherman Sidekick because of its great tool selection and reasonable price. The pliers are solidly built, and the locking blades are a welcome safety feature.

Fenix PD35 850 Lumen Flashlight

I discovered Fenix flashlights a few years ago, and have been hooked. The Fenix PD32 (link) was my first foray into the PD line, and is still the light I carry most often. Its successor, the Fenix PD35, is pictured below. With a removable clip, six output modes (including strobe), and full one-handed operation, this is a great choice for EDC (everyday carry). If it’s built half as good as my PD32, it will last you for many, many years.

Fenix PD12 360 Lumen Flashlight

Why another flashlight? This Fenix PD12 is small enough to fit on a keychain so there is little chance of you forgetting to take it with you. It runs on a single CR123A battery, which is the same type that the PD35 uses.



Swiss+Tech Utili-Key 6-in-1

This is one of those subtle little tools that you don’t even notice until you need it. Great for when you accidentally leave your full-sized multitool at home.



Pluggable USB 3.0 Memory Card Reader

Unlike most other memory card readers, this one has a built-in cable, which makes it great for travel. And, in addition to the usual SD, microSD, and MMC families, it supports the Sony Memory Stick (MS, MS Pro Duo, etc.) card types.


 

Micro USB to USB On-the-go adapter cable

This little USB OTG adapter cable is great for letting you access USB storage from your phone. When paired with the memory card reader above, you’ll be able to upload photos from your camera without a computer!

Anker 3.6A Dual USB Wall Charger

This one kind of speaks for itself – when you’re on the go, a quality USB charger is a must. The Anker 3.6A Dual USB Wall Charger has two ports – one designed for Android and the other for Apple devices. I like this one because of its slim design and lack of annoying LEDs.

Panasonic In-Ear Headphones

These Panasonic In-Ear Headphones have no business being as good as they are, especially considering that they’re under $10. They sit comfortably in the ear and produce surprisingly good sound.

Anker 10000mAh Portable USB Charger

For long car rides, a quality portable USB charger is a necessity. This Anker 10000mAh Portable USB Charger has two ports, and holds enough juice to charge an iPhone 4+ times or a Galaxy S4 2+ times.

Targus XL Backpack

With all of these essentials in hand, you’ll need a way to store and transport them. For the last 2.5 years, I have been carrying the Targus XL backpack.

Let’s get something out of the way: this backpack is huge. It’s designed to hold a 17″ laptop and it does so with ease. Even with a laptop, you’ll have enough room to hold multiple textbooks and most of the items mentioned on this page. It has an incredible number of pockets and zippered compartments for storing anything you could imagine. It is easily the most quality constructed backpack I have used, as well.

Product images owned by Amazon

Jul 132014
 

Since Linux 3.13, Radeon power management is enabled by default. This is great if you have a supported card, but if you don’t, you may encounter issues such as overheating and overeager cooling fans. If you fall into the latter category, you can use these instructions to disable the new power management features.

Disable Radeon power management

  1. Add the parameter radeon.runpm=0 to the Linux kernel boot parameters by editing /etc/default/grub. This is accomplished by appending the parameter to the value of GRUB_CMDLINE_LINUX_DEFAULT. After doing so, that line in my grub file looks like this:
    GRUB_CMDLINE_LINUX_DEFAULT="quiet splash radeon.runpm=0"
  2. Run sudo update-grub
  3. Reboot your computer.

 

Manually control the GPU

You can now manually manage the power of your graphics card, by using vgaswitcheroo or similar.

Personally, I completely disable my discrete graphics card under Linux by adding the following line to /etc/rc.local:

echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

You can verify that the card is off by running the sensors command. When my discrete GPU is switched off, sensors reports the temperature as 511°C:

radeon-pci-0100
Adapter: PCI adapter
temp1:       +511.0°C  (crit = +120.0°C, hyst = +90.0°C)

 

More resources

These pages were useful to me while researching how to disable Radeon power management:

Feb 212014
 

I was planning on writing a beginner’s tutorial for using PWM on raw AVR chips, but I found that Arduino already has a nice guide here: http://arduino.cc/en/Tutorial/SecretsOfArduinoPWM

The only change you need to make to their code to use it without the Arduino software is to remove calls to “pinMode”. Do so by using the appropriate DDR register twiddling or macros such as these: http://www.avrfreaks.net/index.php?name=PNphpBB2&file=printview&t=66939&start=0

Jan 202014
 

I have just finished migrating all of my BitBucket repositories from Mercurial to Git. All of the projects are still accessible at the same URLs they were before, but they now use Git. The reason for the change is that I have been using Git for a number of projects since the summer, and have grown to like it more than Mercurial. Sorry for the inconvenience.

Nov 032013
 

Almost three years ago I released css2xpath#, a port of Andrea Giammarchi’s project by the same name. Today, I’m excited to announce a new project, css2xpath Reloaded, which will supplant my previous project. css2xpath Reloaded does the same thing that css2xpath# did (convert CSS selectors to XPath selectors), but is instead based off of Ian Bicking’s excellent Python app (also somewhat confusingly named css2xpath).

Whereas css2xpath# (and Andrea’s original project) relied on regular expressions to perform the conversion, css2xpath Reloaded recreates Ian’s CSS selector parser.

Although more complicated, a benefit of using a parser is that your CSS selectors can be validated during conversion. It’s also a little bit easier to extend the code. And most importantly (for me at least) it was a lot of fun to write.

I’d consider the code to be beta quality at the moment; it’s lacking in documentation and rigorous testing. However, if you’re interested in testing it out, proceed to BitBucket.

css2xpath Reloaded on BitBucket

Oct 192013
 

I just released version 1.2.1 of VirtualScroller. This minor update adds two enhancements:

  • Any number of items (at least 1) is supported. If less than 6 items are specified, the VirtualScroller falls back on a standard ScrollableView with the fancy scroll logic disabled. This is transparent to developers and users. Resolves issue 3.
  • VirtualScroller automatically detects whether you want finite or infinite scrolling. If the itemCount property is omitted, infinite scrolling is assumed. Otherwise finite scrolling is used. As such, the infinite property is no longer used. Resolves issue 4.

Download the latest version
Visit the Wiki

Enjoy!

Aug 242013
 

I came across a nice example of a Twisted “man-in-the-middle” style proxy on Stack Overflow. This style of proxy is great for logging traffic between two endpoints, as well as modifying the requests and responses that travel between them.

The original was posted here, and I reproduce the vast majority of the code below with some modifications. My real motivation for posting this is to “get the code out there”, because I had a hard time finding it originally. A big thanks to the original author for posting his code on Stack Overflow.

All you need to do is change the three constants at the top, and add whatever validation/modification logic you want in the dataReceived and write methods. Those four methods are labeled so you know which “hop” the data is taking. A request is going to take the following path: client => proxy => server => proxy => client. For example, the first dataReceived method handles data travelling from the client to your proxy.
 

#!/usr/bin/env python

LISTEN_PORT = 8000
SERVER_PORT = 1234
SERVER_ADDR = "server address"

from twisted.internet import protocol, reactor


# Adapted from http://stackoverflow.com/a/15645169/221061
class ServerProtocol(protocol.Protocol):
    def __init__(self):
        self.buffer = None
        self.client = None

    def connectionMade(self):
        factory = protocol.ClientFactory()
        factory.protocol = ClientProtocol
        factory.server = self

        reactor.connectTCP(SERVER_ADDR, SERVER_PORT, factory)

    # Client => Proxy
    def dataReceived(self, data):
        if self.client:
            self.client.write(data)
        else:
            self.buffer = data

    # Proxy => Client
    def write(self, data):
        self.transport.write(data)


class ClientProtocol(protocol.Protocol):
    def connectionMade(self):
        self.factory.server.client = self
        self.write(self.factory.server.buffer)
        self.factory.server.buffer = ''

    # Server => Proxy
    def dataReceived(self, data):
        self.factory.server.write(data)

    # Proxy => Server
    def write(self, data):
        if data:
            self.transport.write(data)



def main():
    factory = protocol.ServerFactory()
    factory.protocol = ServerProtocol

    reactor.listenTCP(LISTEN_PORT, factory)
    reactor.run()


if __name__ == '__main__':
    main()

Aug 172013
 

It’s been almost a year (exactly a year in seven days), but I finally have a new version of VirtualScroller! Version 1.2 is versioned as a minor update (in the 1.x family), but it contains some significant bug fixes and stability improvements. With the exception of a change in default options (see below), this is a drop-in replacement for version 1.1.1 and 1.1.

Glitch free, smooth scrolling

Previously, scrolling through a VirtualScroller too fast could leave the control in a transient state. Before version 1.2, I had worked around this issue by essentially limiting the scrolling speed. This resulted in an annoying user experience, because it was impossible to quickly swipe through pages. In fact, it was only possible to scroll through one page at a time, with a short pause in between pages.

This is no longer the case with 1.2. Users should not notice any jittering as they swipe along, thanks to vastly improved scrolling logic and event handling.

Two important internal changes made this possible:

  1. Increased view cache: Previously, only three views were maintained in memory which meant that the active view was padded on both sides by only a single view. Because of this, it was possible to scroll to the end of the in-memory views before the VirtualScroller had loaded the next views. Version 1.2 works around this problem by using a cache size of five. This makes it harder to outpace the VirtualScroller’s caching. Note: This means that your VirtualScroller itemCount MUST be at least five (or infinite), or else the VirtualScroller constructor will return null.
  2. Simplified scrolling logic: The code, in general, has been heavily refactored and simplified. For example, VirtualScroller now has a much easier check to determine if a scrollEnd event actually resulted in a page advance. As a consequence, there is much less that can go wrong.


Touch support is assumed

A minor change: previously, the touch option defaulted to false. Since all Google Play apps require touch support, I realize it is more convenient to default it to true instead. In version 1.2, touch now defaults to true.

Updated documentation

The (previously neglected) Wiki has been updated to reflect the new changes. Also, the code example should actually work now.

Download and documentation

Download the latest release here, and get the documentation here. As usual, if you encounter any issues or have any suggestions, please report them here.

Future development

I’m in the process of adding methods for advancing the control forwards and backwards, but I don’t have estimates for when that will be done.


Enjoy!