Formatting Code Listings in Kindle Books using the Html Agility Pack

Well, that could have gone better. I got an email from Amazon suggesting that I buy a copy of the C# Yellow Book.  I get these from time to time, and this time I thought I'd tweet about it as above.  It turned into a very popular tweet (for one of mine).

Anyhoo, with my ego nicely built up I thought I'd took a look at the Amazon page for the book. And I found a 1 star review which noted that they would have liked the book a lot more if all the code samples were properly formatted....

Oh dear. Turns out that if you view the book on your iPad or iPhone the code samples all get printed on one line. I thought I'd checked this, but apparently I hadn't. So I did some digging. 

Kindle books are basically HTML documents, a bit like web pages. Like web pages, if you want to tell the renderer that you have already formatted the document you can use the <pre> </pre> enclosure to mark text that already has a layout. You can put code samples into text without their layout being damaged. 

But there is another way to do this.  You can add "white-space: pre-wrap; " to your styles for the pre-formatted text. This works fine on Kindle devices, Android devices. But not iOS devices.  Guess which technique I'd gone for. My reasoning was sensible enough, I wanted to add other stylistic touches to the code samples (a grey background for example) and it made sense to do it all in once place. But it didn't work.  Stupid me.

I had around 200 pages of text with lots of code samples, all of which were wrong. And a broken text up on Kindle that I really, really, needed to fix quickly. So I did some digging and came across the Html Agility Pack on CodePlex. This is completely wonderful. 

It provides a way of reading in a large HTML text and then traversing the notes in the document and fiddling with them.  Turns out all I needed to do was load each of the chapters and then do this:

void processNode(HtmlNode node)
{
    foreach (HtmlAttribute attribute in node.Attributes)
    {
        if(attribute.Name == "class" && attribute.Value == "CodeExplained")
        {
            node.Name = "pre";
            attribute.Value = "CodeExplainedPre";
            debugString.AppendLine(node.InnerHtml);
        }
    }
    if (node.ChildNodes != null)
    {
        foreach (HtmlNode childNode in node.ChildNodes)
        {
            processNode(childNode);
        }
    }
}

 

This starts at the base node and then looks for anything with the class CodeExplained. It then changes the name of the node to pre (for pre-formatted) and changes the attribute to CodeExplainedPre

It is not very elegant, but it does use recursion. If a node contains any child nodes it calls itself to sort those out too. I was going to figure out the structure of the document and only target the page for fixes, but I was in a hurry and this code meant I could reformat the document and make it to the coffee break in time. 

Note: There are probably lots of much cleverer ways of doing this using the Document Object Model or regular expressions or something. But at least this worked and got I was able to get the fixed version up on Kindle within the hour.