The web we lost


This is a long form document, or perpetual draft. For now you can just regard it as my personal scrap book. You can trace its history through the Atom feed. The Atom feed represents the diffs of the document, created by git.

The Internet is self destructing paper. A place where anything written is soon destroyed by rapacious competition and the only preservation is to forever copy writing from sheet to sheet faster than they can burn. If it’s worth writing, it’s worth keeping. If it can be kept, it might be worth writing… If you store your writing on a third party site like Blogger, Livejournal or even on your own site, but in the complex format used by blog/wiki software de jour you will lose it forever as soon as hypersonic wings of Internet labor flows direct people’s energies elsewhere. For most information published on the Internet, perhaps that is not a moment too soon, but how can the muse of originality soar when immolating transience brushes every feather? — Julian Assange on self destructing paper

You can browse the raw git on The site is generated off of org-mode files and served with jekyll. For convenience sake you can use the pattern[commit-sha]/path-to-file to see particular revision, for example the first version of this page.

Representation and semantics

The internet has failed. “What?!” I can hear you, and myself, think. Well it’s simple, it does not do its intended job, at least not very well. It used to be better, even. We are kinda killing it, really. With “we” I mean the web developers, with our Backbone, spineless, Angular.js hipster goodness.

“Why the rage?” you might ask. It’s simple, really. Today, and for the past weeks (months?) I have spent countless hours scrolling through PDF documents. It’s what they call research. And every so often I find a piece of text referring to another document. So I click it and go do that document, as hypertext envisioned. No, that is not what happened. What happened is: I scrolled to the bottom of the PDF, copy pasted (if I was very lucky, usually it meant retyping) the document name and prayed that a multi gazillion dollar search engine would find me the missing document. And for the data or underlying code, well that is a completely different story.

Let me quote you, from the inventor himself:

HyperText is a way to link and access information of various kinds as a web of nodes in which the user can browse at will. Potentially, HyperText provides a single user-interface to many large classes of stored information such as reports, notes, data-bases, computer documentation and on-line systems help. We propose the implementation of a simple scheme to incorporate several different servers of machine-stored information already available at CERN, including an analysis of the requirements for information access needs by experiments… A program which provides access to the hypertext world we call a browser — T. Berners-Lee, R. Cailliau, 12 November 1990, CERN

“But that’s just the sorry state of academic publishing”, you might think; and rightfully so. Academic publishing really is a mess, more so for political reasons than technological ones. (Where is Kuhn’s paradigm shift when you need it). I mean, the idea of hypertext is a lot older than 1990, you can trace it back to As We May Think by Vannevar Bush in 1940s, and even further.

So why the name calling rage? Well, you see, we’re building these internet applications that consist of a single document, which dynamically updates stuff from the source. “Web apps”, if you will. So what you get is a “refresh-less” experience. What you loose is:

  • Addresses that uniquely identify resources (documents, data, pages)
  • The ability to dereference these addresses and parse the data
  • The ability to associate addresses and create a graph of linked documents

We’re really doing a very backwards thing with the web apps. We completely destroyed the very sane idea of a markup language and decided “wait, we need an API”. So we started using JSON (of all things) instead of the actual document (HTML or XML) to represent the endpoints. And then we patted ourselves on the back claiming “accessibility”.

Back in 2004 I was asked to do an inventory system for a medium sized department in the government. Basically people wanted to know what computers and network equipment were in which room. I defined a simple, yet human readable, XML specification. It looked something like

<room number="12">
        <computer type="Dell something" ip="" />
        <computer type="Jet direct printer" ip="" />

You get the idea. Then the magic happened, I defined an XSLT that transformed the XML into HTML and displayed it in the browser, creating a pictographic representation of the room and clickable links to the specs, address book and whatnot. This allowed for a canonical, and optionally standardized, representation of the data. Seamless transformation to a different, browser readable, format. Complete separation of concerns. Better yet, the actual URI contained a fully machine readable resource, as it is simply a tree, namely the DOM. w3c schools still offers an example, view the source (the actual view source button) and then view the DOM with your inspector of choice.

With this representation it is trivial to create a semantic web.

I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A “Semantic Web”, which makes this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The “intelligent agents” people have touted for ages will finally materialize

It’s a lot harder if you keep mutating the DOM. It’s harder still if you don’t give those mutations an address. It’s next to impossible if the Document and the Data are separate.