The promise of html5 and low-hanging fruit

I don’t have enough time in a month to get one third of what I would like accomplished. I’ve been brainstorming in various contexts, and I’ve decided to start giving ideas away wholesale. I have had no problem thinking up ideas in the past, and don’t foresee any problems in the near future. I may as well start putting them out there for anyone to pick up and use.

I spent too long this afternoon browsing through the HTML5 working draft, and found some nice things in there, such as the pre-defined BibTeX vocabulary. At first, it’s unsettling that HTML5 can exist in a well-formed XML, a decidedly non-XML form, and every step in between, but given the right tools, that might make for interesting opportunities.

Many forms, one document

For example, from what I understand so far, the following two documents will be identical in DOM5 HTML, thanks to optional tags:

<!DOCTYPE html>
<title>Sample Document</title>
<meta name=keyword content=example>
<h1>Sample Document</h1>
<p>Hello<br>World!
<table><tr><td>a<td>apple<tr><td>b<td>banana</table>
<p id=done>I'm done now.

and

<!DOCTYPE html>
<html>
  <head>
    <title>Sample Document</title>
    <meta name='keyword' content='example'/>
  </head>
  <body>
    <h1>Sample Document</h1>
    <p>Hello<br/>World!</p>
    <table>
      <tbody>
        <tr><td>a</td>
          <td>apple</td></tr>
        <tr><td>b</td>
          <td>banana</td></tr>
      </tbody>
    </table>
    <p id='done'>I'm done now.</p>
  </body>
</html>

On one hand, it’s faintly alarming. On another, it starts to look kind of cool. (On a third hand, is it old news? I don’t think HTML5 approaches this in quite the same way that HTML4 could sort of be in XML-ish form.) I could be as terse as is legal when authoring a document, then serialize it to a canonical well-formed XML document, and then use the end product in my XML toolchain of choice, whether for storage or transformation or editing.

Dear someone looking for something to do — make this tool for the world: a full HTML5 parser that serializes to well-formed XML. Replace all entities (except the necessary five — or even two) with their Unicode equivalents.

With that accomplished, you probably can make a big splash by also letting it output HTML that current browsers won’t choke on and/or conversion to HTML4 that retains semantics by converting new elements to div and span.

I’d really love this tool if implicit sectioning elements in an outline were converted into explicit section elements. Having easily manipulable outlining sections would enable a lot more tools — or allow you to consolidate many writing tools into one.

An archive format?

Why do I care about well-formed XML? Well, did you notice the difference in sizes between the HTML parsing and XHTML parsing sections in the HTML5 draft spec?

Working for years in MPEG made me appreciate why we should strive for data longevity. It might be merely an abstract ideal, but it’s one of our primary tasks today to be kind to our future selves. If I come across a document in fifteen years, I don’t want to have to look up which elements are void elements in order to parse it. But we owe it to ourselves to archive in a format with more structure than plain text, or even an enhanced text like one of countless wiki formats, Markdown, or Textile.

As I make and break websites and leave them online as a form of digital detritus, I’ve also been thinking a lot about the maintainability and migrability of data. I’m finding it’s easier to setup a new CMS than it is to migrate an existing CMS and its data to another machine. I’ve even considered migrating out of various CMSes by crawling my own websites. Uck.

Influenced quite a bit by Mark Pilgrim’s thoughts on The Format, I’m now considering a well-formed flavor of HTML5 as the format for now. It’s not as complete as docbook, but the structural elements are sufficiently complete for 95% of use cases for extended text. And it’s more compact. And it’ll be trivially viewable.

So, the other itch I’d love someone to scratch is to create an author’s profile for HTML5. The HTML5 spec describes what is essentially a delivery format. It has worked hard to separate presentation from semantics, and goes as far as it can in doing so. However, there’s a lot in the spec that has very little to do with an article or group of articles connected to form a multi-page resource. I would like a canonical version of an article to carry just the data (and metadata) necessary for the article, and nothing more. It should be self-contained and portable.

From an author’s point of view, I would like to concentrate on words and structuring those words. No navigation, no scripts, no unnecessary headers, footers, banners, or columns.

That is to say, the canonical format for the ages doesn’t have to be the same as what the user accesses — but they could share the same syntax and semantics. If I’m going to have my work mediated by a CMS, then I want it also maintaining a canonical format for each resource, free from the navigation and eye candy that seems to be necessary to get noticed, and free from existing only in a database that creates a lot of inertia for my data. Dokuwiki, as with so many other things, seems to get this right, with its do=export_* modes.

All roads lead to…

That’s starting to suggest my next itch to scratch: a CMS that embodies these principles. It can ingest data in various formats (so long as it can be rendered to HTML and potentially carry a little metadata so that an envelope format isn’t necessary, it sounds usable to me) and via different channels (REST and some flavor of DVCS feel like my current favorites). It could render these sources lazily, only when the web server misses from a pre-rendered cache, so that most of the rendering machinery stays out of the way most of the time.

This architecture isn’t miles away from how Blosxom does things, mashed up with Django’s FlatpageFallbackMiddleware. The core could be small, fast, and flexible, working with varied storage solutions and template languages. Really, it seems like the core code for this involves routing, knowing when to render, and how to ingest from diverse sources, and very little else.

This final idea obviously doesn’t rely upon HTML5, but thinking about data longevity and what a webpage is naturally leads me back down this road. Ooh. How original. Yet another CMS. Well, yes. But right now, I have to shelve the idea for later anyway — or hope somebody comes along and makes this exactly how I imagine it to be.