joelkuiper.eu

Most academic publishing isn't academic (and how to deal with it)

Introduction

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way – Charles Dickens (A Tale of Two Cities)

Amazed that my cell phone has the computing power of last decades’ super computer and that I can collaborate at the speed of light, I am convinced that we are now more empowered by knowledge than ever before.

Yet it would be foolish to say that all is well in the world of academia. You only need to skim the recent publications of The Economist or the LA Times to realize that troubles are so wide-spread that they are no longer contained within pristine walls of an ivory tower, but instead have become part of everyday discussion.

Perhaps we are caught in a moral maze. Academia has become a place where I wonder if I am the only person that has an awful nagging feeling when people speak about finding correlations in enormous amounts of data, another raises their voice to say we should lock up this knowledge behind walls. Far from the ability of being reproduced, falsification or dare I say: science.

I would like to outline two venues of research to deal with what might be called The reproducibility crisis: how to deal with the past, and how to move forward.

How to deal with the past

The problem with scientific publishing

If it’s not open and verifiable by others, it’s not science, or engineering, or whatever it is you call what we do. – Victoria Stodden (The scientific method in practice)

The opinions on what is wrong with science, or at least the academic kind, are myriad in society. Diederick Stapel-esque fraud, questionable statistics (see Why Most Published Findings are False), “The truth wears off” (also “The Reproducibility Crisis”), or biased authors, they’re all frequent topics of conversation.

But regardless, a natural consequence of our growing body of knowledge is that most publications will be wrong at some point. If we believe that science is, or at least ought to be, an empirical accumulative knowledge gathering device then there is nothing inherently bad about this.

The growing problem, however, is that we have no way of telling which publications are wrong. Often publications lack the devices to tell if they are correct (they cannot be falsified). One would hope that peer-review prevents the publication of manuscripts that cannot be told correct or incorrect; but believing this would be naive.

Take, for example, a publication in the area of systematic reviews and realize that checking the conclusion would entail finding the trade-off on why some publications were included, and other not. This data is lacking to a staggering degree. Even worse: sometimes the data is locked up behind a pay-wall or committee who has to deem you worthy enough of providing criticism, inevitably inviting bias.

If academic publications are scientific publications, and science requires experimental results to be reproducible, then one must conclude that most are academic publications, are in fact, not scientific.

We are left with an ever increasing amount of publications, and no way to tell if the conclusions they contain are, or ever were, correct. This is the problem that will be addressed in the following sections.

There are always three things someone can do when faced with a problem:

  1. Do nothing
  2. Try to fix it with technology
  3. Try to fix it without technology

Here the focus will be fixing it by introducing technology. Society does not advance without advances in technology. But this does not mean the other two options are inherently bad. In fact doing nothing is often a valid strategy. And while I grew up with life-defining moments behind a computer screen, technology is not always the answer.

A post-modernist view

Post-modernism as opposed to modernism is, while definitions vary, the idea that there is no grand narrative. Things ought not to happen in any particular way; it is the way it is. When applying post-modernism to information technology one contrasts the ideas of standardization and formalization against things being without structure.

This is, I believe, the stance that needs to be taken when looking at legacy publications: they are the way they are. It would be an exercise in futility to define a narrative (such as an object-model, XML-schema or relational database) in which they would fit, and then somehow transform them. Futile because it would be a never ending story. So rather than transforming them, one should aim to take them as they are and then supplement them.

Four ideas will be outlined:

  • The scale of the problem needs to be assessed, do publications contain the data needed to validate their conclusions? And can that data be extracted? In short: is a publication reproducible?
  • If the data is available, are there methods to interactively and (semi-)automatically validate the conclusions and internal consistency?
  • What to do when data is absent, are there (statistical) ways to generate placeholder data or guess at the missing variables?
  • How can duplicate work be prevented? Are there methods to use the “collective memory” of the scientific community to improve the overall quality of research?

All of these ideas will take the post-modernist stance as outlined above.

Digging up the past

To gain insight into the problem a survey needs to be done about how reproducible publications are. A possible way to automate such a survey would be to look at a small sub-domain of publications for which the requirements are known. For example Randomized Clinical Trials (RCT) or Genome Wide Association Studies (GWAS) publications have, as a rule, a fairly mechanical nature. Research into Natural Language Processing could provide a method into finding out if a set of required variables exist in a publication to (reasonably) reproduce its conclusions.

Resurrecting legacy publications

If documents do contain the information needed to reproduce its conclusions, then it should be possible to bring the document back from the dead, and interactively validate, or cross-validate, its content. Borrowing from the ideas of Bret Victor, such as The Ladder of Abstraction and Explorable Explanations, it should be possible to take an arbitrary document, extract data using machine learning and then link the data with the original tables and figures, interactively.

While people like Bret Victor have pointed out the value of building interactive documents, stop drawing dead fish, it seem to come from the perspective of creating novel things. The idea here is however to take existing documents and somehow raise them from the dead.

What is needed for this to work is not only a simple yet empowering method of extracting the relevant input data, but also ways of reproducing the original methods. The toolkit for doing so could borrow from the ideas of visual programming, or from research efforts into stack-based programming languages such as sub-text.

Using machine learning to help with the (semi-)automatic identification and annotation of key elements in publications, and then interactively re-validate the hypothesis, will open up a new perspective on old publications. This perspective will allow researchers to quickly check if a publication is still valid within a new paradigm, or if the conclusions also hold under new data.

Patching it up

Nothing comes together as easily as it comes apart.

When data is missing this does need to be the end, sometimes the missing variables can be guessed at, or filled-in statistically. Furthermore sometimes the underlying data can’t be shared. In that case it should to simulate (pseudo-)data which can be used as placeholders.

A toolkit for versatile data generation and model-fitting could give more flexibility in interactively validating conclusions and methods.

Conversing with the dead

To prevent duplicate work and use the collaborative nature of the Internet one could look at methods for leaving comments for other researchers.

Monks studying the bible provided comments in the margins of the paper, resulting in Marginalia. Since the printing press was not yet invented books were often handed down to other monks. These would then read and comment on both the original text as well as the marginalia, resulting in a theological debate in the margins. Recently efforts have been made to bring this idea to the web by projects such as Factlink and hypothes.is. These efforts could provide valuable tools for academic publishing as well, allowing other users to identify issues and scribble in the margins in a collaborative way.

This contrasts with efforts by, for example PubMed and Mendeley (Elsevier), to provide Facebook-like comment systems to publications. Those efforts require changes by the publishers themselves, while the idea behind Marginalia is more post-modern: a layer on top of the documents without requiring any direct changes.

When these marginalia are persisted in a durable way they can provide a platform for researchers to leave tips and warnings to other researchers, maybe even decades into the future. While perhaps slightly Utopian, the hope is that these marginalia will prevent duplicate effort by allowing other researchers to leave notes on why certain conclusions are correct or incorrect in a collaborative way.

How to move forward

As the options for dealing with the past might be plentiful we can also be hopeful for the future. Living and interactive documents could no longer be an afterthought but a method in itself.

Entanglement of data, code and publications

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures. — Jon Claerbout

Literate programming

Tracing data

When combining interactive documents with methods for keeping track of data, such as semantic web technology or functional databases it will be possible to create a timeless narrative of research.

For this to work it is imperative that the underlying data has a unique identifier such as a hash or time stamp. Otherwise, since most (bio-medical) data comes from shielded repositories, one could trivially change the data without anyone noticing, fudging the results as one sees fit.

You could argue that adding the underlying data as a supplement to the publication would be enough. Unfortunately these links tend to suffer from Link rot:

A number of studies have examined the prevalence of link rot on the web, in academic literature, and in digital libraries. In a 2003 experiment, Fetterly et al. discovered that about one link out of every 200 disappeared each week from the internet. McCown et al. (2005) discovered that half of the Uniform Resource Locator cited in Dlib Magazine articles were no longer accessible 10 years after publication, and other studies have shown link rot in academic literature to be even worse (Spinellis, 2003, Lawrence et al., 2001). Nelson and Allen (2002) examined link rot in digital libraries and found that about 3% of the objects were no longer accessible after one year.

So the underlying data for living documents should come from a repository which provides addresses and content that are guaranteed to never change (embrace immutability). Hashing and checksum mechanisms or version control systems such a git or darcs could provide guarantees to this end.

While standards are problematic, it should be possible to agree upon a standardized format for tracking and signing data. Most of the utilities for tracking and signing could come from standardized cryptographical systems such as GPG.

Perspective

These ideas are not novel. ScienceDirect has experimented with literate programming in their publications, iPython has already been discussed as a vessel for reproducible science and the large scale Reproducibility Initiative provides tools and support for making research reproducible.

Although academia rewards new and exciting stuff more than progressing “the knows”, this seems to be a matter of “the more the merrier”.

Scope & Practical

Realizing the outlined vision will require the development of systems for large scale classification and data extraction, as well as methods for annotation and interactive (re)validation.

The relevant areas of research herein are:

  • Bayesian Hierarchical models
  • Machine Learning
  • Semantic Web Technology (ontologies, RDF, and OWL)
  • Data visualization
  • Natural Language Processing
  • Information Retrieval / Data mining

In addition to research publications in the relevant areas, usable prototypes should be delivered.

The complete research project will be a joint venture between BioSHaRE (genetics) and ADDIS (epidemiology). Therefore the development of prototypes will be a collaboration with both the ongoing BioSHaRE (MOLGENIS and Opal) and ADDIS 2.0 projects.

It would be desirable to develop the prototypes as independent modules which can be used with both the BioSHaRE tool suite as well as the ADDIS tool suite, optimizing utility as well as a larger base of supporting software developers.

The functionality should therefor be exposed for use in MOLGENIS, Opal, or Mica in a flexible way through REST (with JSON or XML data) APIs as well as into ADDIS 2.0 (which is a clinical trial oriented tool suite).

Where needed, single login user authentication can be provided to both parties in the form of OAuth and OpenID, possibly facilitated by the ORCID initiative.

The goal is not only to progress the knowledge of the research topics outlined above but also attempt to better the scientific method, hopefully resulting in a streamlined application suite for online and interactive data integration and validation.