The Semantic Web: What and Why


Posted in Semantic Web, Software Development, Technology on November 8, 2011

Sir Tim Berners-Lee coined the term “semantic web” and defined it as “a web of data that can be processed directly and indirectly by machines.” However this statement both precise and vague, and probably the reason why its definition in Wikipedia is scattered and rambling. Since I’ve started studying the subject I’ve found myself trying to explain it to various people, usually with limited success.

A good way to practically define it, if not exactly explain it, is by listing specific technologies that fall under the semantic web moniker: RDF, OWL, SPARQL, RDFS, XML.  These are among the core technologies that facilitate the semantic web.  Frameworks and specifications that enable machines to process a web of data.  Understanding these technologies is akin to understanding just what the semantic web is, but it still doesn’t explain what the semantic web is, especially to a non-technical person.  If these technologies are to fulfill their full potential, non-technical people need to understand what the heck this semantic web is all about. Investors, business owners, managers, and even hobbyists could get great value out of this technology, but they won’t invest time or resources in a concept the don’t understand.  Having defined it twice already, I think the question still remains: what is the semantic web?

Buried in the Wikipedia entry is a good description: “The semantic web is a vision of information that can be readily interpreted by machines, so machines can perform more of the tedious work involved in finding, combining, and acting upon information on the web.” Terrabytes of data are published on the web daily, but the vast majority of it is configured for human consumption.  Machines can read the content, but they can’t relate it to other content or put it in context- at least without a human writing a program that does those things.

If you are a software developer, you are likely familiar with relational databases.  That’s because most modern business applications store their data in a relational database. The reason is that they provide awesome data fidelity- the program using the data can “understand” how items of data relate. If I’m building a music distribution application, I likely have individual items of data to represent a song, an artist, and an album. With a relational data store, my application can know that an album is a collection of songs and that an album is written by an artist.

The prevalence of relational databases in application development highlights the fact that not only is the data important, but its relationship to other data is just as important.  They work extremely well, but are limited- they model data for a particular domain, namely the domain of the application being written (in the example above, music albums).  This means that for my application to use any data, it must be imported into the relational database, which has a schema unique to the application, so someone must write an explicit mapping from an external source into the custom database.

Web services that return data are another popular data source, and they are typically interfaced using an application programming interface (API). At first it might seem that web based APIs fulfill the description of the semantic web as defined by Berners-Lee, and some of them indeed might, but it is generally not the case. Why? Because just like with the relational database, an explicit mapping will be created by a developer to interface with the API for their application.  Also, APIs are proprietary interfaces which return data in a format defined by whomever wrote it.  For machines to truly act “intelligently” data needs to be presented in a uniform manner, regardless of technology or application domain.

Another key aspect of the semantic web, possibly the most important, is the web part.  Both a relational database and a web-based API represent endpoints to closed, non-distributed systems. What we are talking about with the semantic web are applications that have the entire internet as a data source, and can make intelligent connections between that data, even though the application has never been programmed with the relationship explicitly. In short, we’re talking about building applications that can use the web like a human can.

Let’s say I wanted to create a web-based application that is the ultimate music research system.  I want it to know every album published, the credits for every track, all the artwork, and even song lyrics.  One approach I could take is to create a relational database and attempt to fill it up with all this data.  New music is published all the time and from a variety of sources, how can my application keep up? Furthermore, how does the data get from the Internet into my relational database? Traditionally, some sort of connector tool or importer tool is written for this purpose.  But new sites and data sources are created all the time, and creating a custom import mapping for each is a whole lot of work, and limits the data sources my application can use.

A semantic web application by contrast, could simply be told to use allmusic.com, amazon.com, wikipedia.org, iTunes, songlyrics.com, and many other systems known to have some authority on the subject of music and song publishing.  I could ask it, “What albums did Fleetwood Mac release in 1969?” and it would semantically connect the question to the data provided on those sites, aggregate it, and give me an answer.  When I see “English Rose” in the result set, I could then ask, “what is track 7 and who wrote it?”  Back to those data sources to come tell me: “Black Magic Woman” and “Peter Green“. You could keep going: “What are the lyrics to that?”, “Where was Peter Green born?”, etc. Unlike my relational database, which would certainly have boundaries (I doubt it would contain the birth place of every band member), the semantic web application can keep going, relating question after question, as long as the data sources it uses contains the data and relationships.  And many prominent sites, like Wikipedia publish their content in semantic formats and do contain all of this data.

The notion of an application understanding relationships between things without a developer explicitly defining them is an odd one to those of us used to developing traditional software systems, and exactly how that occurs is the secret sauce provided by the technologies I mentioned earlier: RDF, SPARQL, OWL.  However, it can be summarized in a word: metadata.

While a lot of progress has been made building structures semantic applications understand out of plain text (text to ontology), the ultimate promise of a semantic web does require some publishing guidelines. Publishers will ideally publish RDF/XML in addition to their human readable content. Detractors of the technology view this as a difficult barrier to widespread adoption, but I would point to protocols like RSS or ATOM.  Ultimately, publishing RDF/XML should be just as easy and ubiquitous as those syndication feeds. Any developer wanting to do so already has a wealth of tools at their disposal.

Although there are major content providers (like Wikipedia) that do publish data in a semantic web format like RDF/XML, it remains to be seen if these technologies will be universally adopted. One thing is for certain though: our current World Wide Web is dumb.  Its full of text, images, and videos that have no context or relationship to each other. Humans can sift through this data and make sense of it, machines can’t.  Google (and Bing) are about the closest thing there is, but even those amazing systems don’t make the Internet semantic.  As good as their algorithms are, the relationships they create between content on the web is imprecise.  It must be, because its the search indexing algorithm that is defining relationships.  In the semantic web, the content publisher defines them, in a very precise way.  Google at its best can only make assumptions based on the content of the page, the links in the page, and links to the page.  It would be much better if the publisher explicitly defined how their published data is related.  Then think of how accurate those search engines could be.

When people think about the future of computers, I think Star Trek comes to mind. In that universe, people can ask a computer any arbitrary question and get back an answer (or at least a request for disambiguation or clarification), not a giant list of possible places an answer might be, which is the state of our current web.  Furthermore, I can give that computer any arbitrary instruction like “make dinner reservations for me at the best French restaurant in town on Thursday sometime after 7PM” and it could do exactly that.

That future is not too far off.  Transcribing speech is something that even your smartphone can do well now.  Programs can parse that transcription into an ontology- a specific vocabulary in a context. Combining these with semantic web technologies to achieve those futuristic results is ultimately what we are trying to do.

Tagged: , , , ,

Comments

  1. No comments yet.

Submit Your Comment