The silliest of problems, fuzzy text alignment

Lately I have been working on Spá, a system which aims to give researchers a flexible way of doing machine-assisted data extraction from full-text PDFs. Its raison d’etre is that scientific knowledge only gets published as unstructured journal articles; but, researchers are usually only interested in specific bits of information.

Especially in the biomedical sciences publishing findings only as journal articles poses a huge problem. Because now to figure out if a certain gene mutation causes cancer, or whether a certain drug is really effective, you need to do a systematic review. Which, and I’m simplifying, means: Google (or PubMed) for keywords, download a bunch of PDFs, read and screen the documents for relevance and extract the elements of interest in a structured format for further (statistical) analysis. Thus literally millions of person-hours each year are spend (wasted, really) getting structured information from unstructured publications.

So, I like to think that Spá helps to solve a problem which should not exist in the first place: structured knowledge dissemination. The way Spá does this is by providing a front-end for various machine learning techniques. A user can upload documents of interest in projects, manually extract the relevant data and export it in a structured format such as RDF with W3C Open Annotation and PROV (optionally directly embedded in the original document).

More importantly, the documents also get send to services which can run all sorts of machinery for extracting relevant information. It takes the output from those services and highlights the relevant phrases, or provides comments in the sidelines (marginalia). The hope is that by providing a easy-to-use system for doing both machine-learning and manual extraction it can speed up the tedious task of extracting information from documents.


To make this philosophy work Spá uses Mozilla pdf.js, a library for rendering PDF documents directly in the browser. Out of the box pdf.js does two things: it renders the PDF to <canvas>, and it extracts the text as a series of <div> elements to provide text selection (called the textLayer). For Spá the textLayer was rewritten entirely in Facebook React to give a smooth experience for creating and displaying annotations.

But nothing is ever easy. Because the different services only receive the binary PDF they must do their own text extraction. And their text extraction may not be congruent with the one pdf.js offers (it’s more an art than a science to provide reliable text extraction). Currently it ignores this problem and just does exact string matching, however I hope to outline some of the possible solutions and technical nuisances.

Let’s introduce some formalisms: we’re interested in finding a string \(\mathcal{S_1}\) in a document \(\mathcal{D}\). \(\mathcal{D}\) consists of a series of variable length substrings \(\mathcal{D} = \{s_1,\dots,s_n\}\). To highlight string \(\mathcal{S_1}\) we need two things: the list of matching substrings in \(\{s_\alpha,\dots,s_\beta\} \subseteq \mathcal{D}\) and the offset in the first and last element.


In the example above we would need to find \(s_1,s_2,s_3\) with the start and end offsets for \(s_1\) and \(s_3\) to highlight “brown fox” in the document. Solving this problem is trivial: just concatenate all the strings in \(\mathcal{D}\) into \(S_2\) while simultaneously keeping a list of offsets. When matching the string find the offset using an exact string matching algorithm \(\texttt{match}(S_1, S_2)\), find all overlapping substrings, and add the offsets for the first and last one.

Unfortunately because \(S_1\) may be produced by a different serialization mechanism, the problem looks like this:


And there is no way of telling how the strings differ (although likely in whitespace), or if there even is an equivalent match. To solve this problem we must rely on approximate string matching, such as the bitap algorithm. However bitap (and derivatives) require a “best guess” of where in the document the string is likely to be found. To find this best guess we introduce a sliding window view to \(S_1\), essentially we try to find a substring that does exactly match. So when the exact match on brown fox fails, we attempt in order: brown fo, rown fox, brown f, rown fo, own fox … Until a lower threshold of string length (then we give up), or a match is found. The offset of that match then gets fed into the bitap algorithm, and we can run the resulting string against the original exact match algorithm.

This hopefully solves the problem of fuzzy string matching.

There is however another, similar, problem to be addressed. Occasionally there are multiple occurrences of “brown fox” in the document. Spá now highlights them all, but ideally we would like a unique reference. A reasonable strategy to do this is by “flanking” the string to be matched with some context. This increases the likelihood that the string is unique in the document. Unfortunately this will only make the problem of exact string matching harder, since the string to match will now be longer.

Another strategy is to partition the document into blocks, and provide in addition to the string a list of indexes to the blocks in which it was present. To create these blocks one could partition the text into equal length parts, and when the serialization on both server and client are (roughly) the same one only needs to run the matching on those blocks. Unfortunately there is no guarantee on the order of the blocks, or indeed the serialization algorithm.

An interesting solution to this is using Locality Sensitive Hashing (LSH) to provide access to the blocks, the technique employed by Sidenotes for a similar problem. LSH provides a hash that is non-sensitive to small changes. Where a cryptographic hash function would result in a completely different hash for each input, the idea of LSH is to cluster similar inputs in the same hash bucket.

Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).

Then, when a direct match can not be found a nearest-neighbor search can be used to find “the closest input”. Although tricky to implement client-side, I’ll likely be exploring this option for Spá.