Bringing documents back to life

Producing documents

The general process of epidemiological research is as follows. First the academic asks a question and hopes to answer it by doing some calculations. These calculations are usually done with various software tools, R, SPSS, SAS, Python, Excel, all intermingled with data, and spread out over several places. But the iteration in search of the desired results keeps the process alive.


Once satisfied with the results they are written down and published in a journal. The academic just killed their research. What was a lively process of experimentation and reproducibility is now nothing more than some numbers on a page, pretty graphics if you are lucky.

Since nobody is particularly keen on sharing data or methods, especially not in the past, people looking for some information usually only find these dead numbers. Which, for all intents and purposes, could have been the result of deep magic.

This is a problem. Because how do you now test if the results are still valid with, for example, new data. Or what if you want to try a slightly different method, maybe just fudge some parameters to check if the original author did not just pick particularly well suited ones without justification?

The best thing you can do is to redo the entire research. This is very time consuming and most of the time impossible because the data needed for the conclusions are nowhere to be found. So if you want to avoid drawing your own conclusions on bad or even fraudulent data, you are best off just to ignore the whole publication.

I paint a pretty bleak picture here, but when doing research you have to listen to Murphy’s law: Anything that can go wrong, will go wrong, and by extension if it is possible, it has happened.


So how can this burden be eased? What I propose is a toolkit for taking the dead documents and do some necromancy, in the form of tools, to resurrect them. This won’t mitigate the problem of missing data, shared data which is fraudulent, or undocumented processes. But it might help. Because offering interactivity allows you to understand the dynamics of the data. (A more elaborate review of this is given by Bret Victor in Explorable Explanations.)

Concretely the desired flow is as follows:


Take a dead document, this could be HTML, or a PDF document through PDF.js.


Then annotate the key parts by selecting those pieces of text. A pop-up should appear which should give some options such as naming the variable, whether it is input or output, and things like the numerical range (if applicable).

This annotation could take the form of an RDFa element, but more importantly it should allow the user to vary the value inline. Play with it, so to speak. For numerical values this could be realized with a slider, for example.


When finished this gives a document annotated with in-lined variables. These variables should be visually recognizable and have an anchor point.


The purpose of the anchor point is that the user can specify the underlying model used in the document in a visual programming tool, and use the anchors as inputs or outputs. This is an example of Flow Based Programming, and Functional Reactive Programming. For the uninitiated, it is like Excel but more flexible. As an example the Mac OS X Quartz Composer is based on (Visual) Flow Based programming


Since the focus is on academic research, back-ends for R, Python and Javascript would be most desirable. But to keep it simple a couple of highly specific domains could first be implemented as prototypes.


Developing the Visual Programming Toolkit is a primary research goal. Some modern ones exist already, like NoFlo, VVVV.js but none are adequate. They are either too complicated, too specific to a domain (or at least not specific enough for the scientific domain) or integrate poorly in the web browser.

Eventually I hope to give the researchers the ability to play with the document, so they might save some time in doing the research. It also offers a nice way of implementing some methods, without continuously going back and forth between the paper and the implementation.

Of course this is just a tiny step in the direction of reproducibility. The goal eventually is that everyone shares their data and methods, possibly through Literate Programing, Private Information Retrieval (if there is no other way), and other Semantic Web technologies. However when dealing with legacy you have to be a tad more post-modernist, and just take the things for what they are first.