A brief XML rant

Is XML really so hard?

You had one job, Posterous, before shutting down. You promised us an archive before disappearing, and you delivered.

Sort of.

A general note for everyone: unless you really, really know what you’re doing, and boilerplate far exceeds the amount of data inserted, do not use template languages to generate XML.

Let’s take a look:

    <title>unexpected links of the day</title>
    <pubDate>Wed Jul 02 00:42:07 -0700 2008</pubDate>
    <dc:creator><![CDATA[Adam Lindsay]]></dc:creator>
    <category domain="category" nicename="uncategorized"><![CDATA[Uncategorized]]></category>
    <guid isPermaLink="false">http://misc.atl.me/unexpected-links-of-the-day</guid>
    <content:encoded><![CDATA[<a href="http://twitter.com/andy_murray">http://twitter.com/andy_murray</a> <br /> and <br /> <a href="http://twitter.com/jamie_murray">http://twitter.com/jamie_murray</a> <p />   Courtesy of the Guardian.]]></content:encoded>
    <excerpt:encoded><![CDATA[http://twitter.com/andy_murray and http://twitter.com/jamie_murray Courtesy of the Guardian.]]></excerpt:encoded> 
    <wp:post_date>Wed Jul 02 00:42:07 -0700 2008</wp:post_date>
    <wp:post_date_gmt>%= display_date %></wp:post_date_gmt>

This is representative of my Posterous blog export (that I’m trying to import here into this blog).

There are no namespace declarations. No self-respecting XML parser will have anything to do with this XML data.

Schema-design-wise, the content:encoded and excerpt:encoded element names are deeply suspect, as if someone looked at RSS 2.0, squinted, shrugged, and invented their own ad hoc analogous namespace prefix, rather than understanding the role of elements in XML.

I haven’t been able to determine the intended encoding of the files. One file looks sort of like Latin-1, another looks like UTF-8 gone wrong. If only XML supported some way of asserting the encoding of the file. Oh wait.

The most amazing line is this one, though:

<wp:post_date_gmt>%= display_date %></wp:post_date_gmt>

Excuse me, your template language is showing.

Never mind that a GMT-normalized date would be one of the most useful output fields, it’s just embarrassing that that line is replicated in every file of my blog archive.

I’ll give Posterous a pass on using the less-commonly used (and more difficult to parse) RFC 2822 (i.e. email-style) date format, considering their origins as a post-by-email service.

Okay, obviously top engineering talent isn’t going to be assigned to the sunset period of a service, but I am shocked that someone with a bit of expertise didn’t look at this before some intern shoved it out the door. I try not to be bitchy, but this isn’t that hard, people.

I enjoyed Posterous. I recommended Posterous to others. I thought it was a good, well-designed, well-engineered service. Twitter seemed smart to acquire the team. I wasn’t even all that upset by the closing of the service — there was a generous amount of time given before this last shutdown phase. But the quality of the data syntax is just aggressively poor. This feels like a kick in the back side as we’re being hustled out the door. So this rant is fueled by disillusionment in a respected engineering organization.

Get off my lawn, you kids.

[Update: O HAI Hacker News. There’s a livelier-than-expected HN discussion here.]

[Update 5 March 2013: I managed to get back into Posterous’s site, and regenerated my export. At least the erb-like template artifacts had disappeared, but non-7-bit characters were still a bit of a muddle.

To clarify some misconceptions that came up on Hacker News, the rant wasn’t all that much about Posterous, XML, or the particular non-well-formedness defects — it was about process and care and understanding of the craft. It was minor outrage that the cost to quality of employing a junior developer at Posterous was amortized over countless other developers’ time having to code around errors they should never have had to encounter.]