joelkuiper.eu

Whatever happened to Semantic Web?

A very short introduction

Semantic Web?

Let’s talk about the semantic web of the early ‘00 (noughties?): the Resource Description Framework (RDF). It was promised to solve all the problems of knowledge representation and web scalability. However, while the techniques still have rich applications in many of the biomedical sciences, it rather quickly died in the mainstream web development community.

It took me far longer than I care to admit to finally “get it”. The point of all these technologies seems a bit strange at first. But, after the initial weirdness wears off you start to forgive its warts, and start to wonder: why aren’t more people talking about this? It’s especially strange since things like Graph Stores are regular headlines on the mainstream tech sites, yet nobody mentions the fact that we’ve already solved this once. So, I would like to say it’s at least fun and interesting to consider the history and technology of the semantic web a bit.

Graphs & Triples

RDF is a standard model for data interchange on the Web. RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.

RDF extends the linking structure of the Web to use URIs to name the relationship between things as well as the two ends of the link (this is usually referred to as a “triple”). Using this simple model, it allows structured and semi-structured data to be mixed, exposed, and shared across different applications.

This linking structure forms a directed, labeled graph, where the edges represent the named link between two resources, represented by the graph nodes 1. This graph view is the easiest possible mental model for RDF and is often used in easy-to-understand visual explanations. – http://www.w3.org/RDF/

Depending on your background that may, or may not, have been very helpful. Let’s dissect it a bit. RDF is about:

  • data from different sources
  • dealing with data that changes over time
  • representing data as URI’s (Uniform resource identifiers, addresses)
  • structuring links between elements of data
  • these links are described by triples
  • and, the triples form a directed labeled graph

So far, so good. Let’s try to follow their advice and create a visual representation of RDF, to get a better mental picture. Say I wish to represent the fact that this is my homepage. We would need three things: the address of this homepage, the address that encodes “me” (e.g. an email address, OpenID, Full Name, etc), and a third address that represents the homepage-relationship (this forms the triple). The third requirement seems the most complicated. What’s the address for encoding the homepage-relationship? Turns out there are arbitrarily many, you can even make up your own one (that’s part of the power). Many standards for encoding this relationship have been created. We’ll pick FOAF (short for Friend of a Friend), it’s most commonly used. The idea behind FOAF is not that different from Facebook Open Graph, except that it’s a standard and build around 1999. In FOAF the address for encoding the relationship is http://xmlns.com/foaf/0.1/homepage. Visiting the link brings you to the specification page2. The final graph looks like:

simple.png

Now this isn’t terribly interesting, but it shows something important: we can encode the relationship between three addressable entities as a graph by defining a triple.

Drawing pictures quickly grows tiresome, even with GraphViz, so we might want to serialize this relationship in a more compact way. At this point haunting flashbulb memories of endless XML might occur, but fear not! There are many semantically equivalent ways to serialize RDF, at its core it is just a set of addresses with some properties. Yes, admittedly, XML is used a lot. But formats like Turtle (TTL), JSON-LD, N3, have gotten more widespread recognition as well. We’ll focus on TTL here, since I find it the most accessible. But you can pick any serialization format, and it won’t change the semantics, just the syntax. RDF is not XML.

In TTL the graph above looks like this

<mailto:me@joelkuiper.eu> <http://xmlns.com/foaf/0.1/homepage> <https://joelkuiper.eu> .

Now if I want to encode other information about myself I could expand it in the following way:

<mailto:me@joelkuiper.eu> ;
  <http://xmlns.com/foaf/0.1/homepage> <https://joelkuiper.eu> ;
  <http://xmlns.com/foaf/0.1/givenName> "Joel" ;
  <http://xmlns.com/foaf/0.1/familyName> "Kuiper" .

Notice that you can also use simple strings instead of addresses (called “literals”). The spec gives much more information about, for example, specifying the human language, (XML) types or how to group properties together. One thing that is a bit annoying is the repetition of the addresses, luckily that can be fixed by introducing a prefix.

@prefix foaf: <http://xmlns.com/foaf/0.1/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

<mailto:me@joelkuiper.eu> ;
  a foaf:Person ;
  foaf:homepage <https://joelkuiper.eu> ;
  foaf:givenName "Joel" ;
  foaf:familyName "Kuiper" .

In the example above I’ve also taken the liberty of including two common prefixes, and defining myself to be a foaf:Person (or rather, my email address uri). Visualizing the triples above would look something like

simple2.png Now at this point you might be tempted to go “yes, all very interesting, but eehm, I’m going somewhere else now”. Well, I think it is kinda interesting that unordered sets of triples form an isomorphism3 with directed labeled multi-graphs; and that URIs can be used to represent semantic information. But stopping there would be a mistake. The best bits are still to come.

Ontologies

Questions about the true nature of knowledge are still a lively topic within philosophy, but most computer scientists are oblivious to them. This is fine, and it sometimes gives interesting abstractions of things. While philosophers might still be pondering the nature of the thing in itself and whether we have in any Kantian sense access to its properties, the bioinformaticians have coded up a vast taxonomy of all the genes and their relations to diseases (GO, OGG). Sites like Bioportal provide a window onto the wealth of information encoded in RDF ontologies.

Similarly, while epistemological musings about Wittgenstein and the nature of language games are fascinating, the Natural Language Processing community pretty much went ahead and encoded enormous amounts of dictionary knowledge into WordNet.

The information retrieval(-istic?) notion of an ontology might not be as interesting (in some sense) as the philosophical notion, but it does give access to truly mind boggling amounts of human curated data, often available for free (sometimes as in beer, usually as in speech). And most of it is retrievable as RDF.

So what do you do with all this RDF? Stick it on a MiniDisk and sent it back to 2000 (yes, you can store data on those)? Well again, that would be a mistake. But to show the power of these ontologies we have to leave the comfort zone of words, and actually get our hands dirty.

In practice

Triple Stores

Since we’re dealing with triples, we need a way to store them. Conveniently they invented Triple Stores alongside triples. Triple Stores can store the triples in a highly efficient and indexed manner 4.

There are a couple of good triple stores out there, but this is also where the warts begin. In the semantic web world the specification often doubles as the documentation, but not all implementations have implemented the specs the same way (sounds familiar?). The reference implementation is considered to be Apache Jena/Fuseki, and we’ll be using that, I suggest you do too.

Installation & set-up

Download Apache Jena Fuseki 2.0 from one of the mirrors. It’s a Java app, so you’ll need to have a recent version of the JDK. Once you’ve downloaded the archive and extracted it somewhere you can run the server with cd <your folder> && ./fuseki-server. Apache Jena and Fuseki are two different things. Jena is the actual store and engine. You can include it as a Java (or Clojure) package to create Triple Stores and run queries. Fuseki provides the HTTP access by generating SPARQL endpoints. We’ll be using the Fuseki graphical user interface. Once it is started, it will listen on port 3030 by default. You should be greeted with this page:

screenshot-first.png

You just created your first SPARQL endpoint. It rather helpfully tells you that you haven’t loaded any data. We’ll be using an example from the biomedical domain, since I’m most familiar with that area. But, these ideas are widely applicable.

Let’s download the Online Mendelian Inheritance in Man (OMIM, download) ontology from bioportal as Turtle (TTL). This ontology encodes inheritance data for various genes and diseases, and what symptoms (manifestations) they might have.

First create a new in memory dataset called example (or, whatever). Leave the graph field blank, we’ll get back to that. It should tell you that it loaded 949628 triples (or something around that number, depending on your version).

screenshot-second.png

Now we’re ready to enter the wonderful world of SPARQL Protocol and RDF Query Language (SPARQL, yes it’s recursive).

SPARQL

SPARQL allows for a query to consist of triple patterns, conjunctions, disjunctions, and optional patterns. – https://en.wikipedia.org/wiki/SPARQL

We already considered that graphs and triples are related. But, triples and graphs are also related to first-order predicate logic. Each triple encodes for a subject-predicate-object relation: \(P(S, O)\). For example the homepage relation: \(\text{foaf:homepage}(uri, uri)\). This view provides an interesting synergy with logic programming (like Prolog and miniKanren). By providing patterns of triples in predicate logic, we can apply unification to find matching sets of triples. Essentially, we define inductive definitions of what the solution should look like, then search for any set of triples that match that definition. So what does that look like in practice? Well lets open up the query tab!

The default query looks like this:

SELECT ?subject ?predicate ?object
WHERE {
  ?subject ?predicate ?object
}
LIMIT 25

This basically matches any 25 triples in the store. When executed, it will show you a rather unhelpful list of triples.

screenshot-anything.png

But we can constrain our query to be more informative. What if we’re only interested in genes (or “Gene with known sequence” in the ontology)? Well in that case we would have to know the predicate equivalent of “is of type” and the object “is a gene”. Note that these two will be URI’s, since that’s what semantic web is build upon.

Let’s do this in steps

PREFIX omim: <http://purl.bioontology.org/ontology/OMIM/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?subject
WHERE {
  ?subject omim:MIMTYPE "1"^^xsd:string
}
LIMIT 5

First I defined two prefixes, which allow me to use shorthands instead of fully writing the URI’s each time. Next I constrained the query to select all the subjects that have omim:MIMTYPE 1. This type is encoded as Gene with known sequence. Unfortunately, the authors of OMIM decided to include XML type information with their literals, hence the stupid ^^xsd:string which indicates that 1 is of the XSD type string. Usually this is not so much an issue, but now you’re aware of it.

This gives a list of URI’s like

subject
http://purl.bioontology.org/ontology/OMIM/615305
http://purl.bioontology.org/ontology/OMIM/615304
http://purl.bioontology.org/ontology/OMIM/615307
http://purl.bioontology.org/ontology/OMIM/615306
http://purl.bioontology.org/ontology/OMIM/615301

You can click them and it will take you to the relevant gene information. But we can do better! What if we want to know the gene symbols, instead of just the URI’s?

Well, we’d need to constrain our SPARQL even further.

PREFIX omim: <http://purl.bioontology.org/ontology/OMIM/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

SELECT ?subject ?gene
WHERE {
  ?subject omim:MIMTYPE "1"^^xsd:string ;
    omim:GENESYMBOL ?gene .
} LIMIT 5
subject gene
http://purl.bioontology.org/ontology/OMIM/615305 TRNAR2
http://purl.bioontology.org/ontology/OMIM/615304 TRNAV32
http://purl.bioontology.org/ontology/OMIM/615307 TRNAV12
http://purl.bioontology.org/ontology/OMIM/615306 TRNAV21
http://purl.bioontology.org/ontology/OMIM/615303 TRNAG3

Nice. Notice the punctuation. By using the ; I didn’t have to re-specify the subject.

Now for something a little more interesting. What if I want to know the diseases associated with these genes?

PREFIX omim: <http://purl.bioontology.org/ontology/OMIM/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?disease ?gene
WHERE {
  ?subject omim:MIMTYPE "3"^^xsd:string ;
    omim:GENESYMBOL ?gene ;
    skos:prefLabel ?disease .
}
LIMIT 5
disease gene
CRANIOSYNOSTOSIS 3 TCF12
CRANIOSYNOSTOSIS 3 CRS3
CRANIOSYNOSTOSIS 3 HTF4
BONE MINERAL DENSITY QUANTITATIVE TRAIT LOCUS 17 GPR48
BONE MINERAL DENSITY QUANTITATIVE TRAIT LOCUS 17 BNMD17

Well, we got lucky here since that data is already encoded in the ontology. But, I would love to tell you about things like property-paths (which allow you to define arbitrary length sequences over graphs) and reasoning engines. However, it would not be a very short introduction anymore. But take a look at this excellent intro for more SPARQL! Or these slides and examples.

Quads: graphs redux

Triples got extended to a fourth element, called “graph” (sometimes abbreviated in SPARQL as ?g). This allows for grouping of the triples into disjunct graphs (so everything doesn’t get dumped in one big uber-graph). If you’re using this option then it’s natural to refer to its constituents as quads, instead of triples. Essentially graphs work a bit like “namespaces”.

A word on text search

A common use case is to search for a specific subject or object by name. You can do this by using the regular expression FILTER syntax, but this is awfully slow. Fortunately Jena/Fuseki offers out-of-the-box integration with Lucene, which also powers ElasicSearch and SOLR. It requires some determination to set-up, but the tutorial on the Apache site should get you started!

A word on scalability

While we simply used the GUI for Fuseki here, it is in fact a fully fledged database. It allows for data export in different formats, and queries over HTTP (SPARQL works over HTTP). You can even define your entire database as RDF. Many tried, but many have failed. RDF is not suited for binary data (you can store base64 encoded strings, but that’s silly). And anything that hinges on ordering is likely to be problematic, since RDF does not support any convenient way of storing lists.

If you do want to enjoy the flexibility of logic programming over graphs, and make use of the vast amounts of knowledge present in existing ontologies, people often opt for a dual database option. You define your data in a classic SQL store like PostgreSQL, and use URI’s as indexed keys. Those URI’s can then be inserted in a triple store and reasoned about as semantic web.

The promise, the dream

According to the W3C, “The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries”. The term was coined by Tim Berners-Lee for a web of data that can be processed by machines. While its critics have questioned its feasibility, proponents argue that applications in industry, biology and human sciences research have already proven the validity of the original concept.

The 2001 Scientific American article by Berners-Lee, Hendler, and Lassila described an expected evolution of the existing Web to a Semantic Web. In 2006, Berners-Lee and colleagues stated that: “This simple idea … remains largely unrealized”. — https://en.wikipedia.org/wiki/Semantic_Web

Semantic Web truly defines an interesting set of ideas, and has given birth to many well thought-out specifications. For example Open Annotation allows annotations of any resource, anywhere. W3C PROV allows for provenance, which deals with the increasing issue of “where the heck did this come from?”. And the many large ontologies provide a wealth of information. It would be sad if these techniques would be truly forgotten.

Resources

Footnotes:

1

a labeled, directed, multigraph

2

which means the address is dereferenceable, which is not always the case

3

am I using this right? My vague memory from Gödel Escher Bach says yes

4

not all of them do, there is a lot of “academic” code out there