Anatomy of a URI


Posted in Semantic Web, Software Development on November 1, 2011

The Uniform Resource Identifier, or URI, is on one hand very familiar to most of us in the form of the URLs we use to program and navigate the web.  They are however, a much more general construct that are frequently misunderstood even by web professionals. Yet they are at the heart of many key technologies, like XML namespaces and are used in semantic web technologies like RDF. In my study of semantic web technologies, I thought it would be helpful to take a moment and truly understand the URI.

URI vs. URL

A good place to start is with a special type of URI: the Uniform Resource Locator, or URL. How is a URL different from a URI? The answer is that a URL is a URI that you can provide to a system or program (like a web browser) and in return you get a resource (like a web page, image, or video). Its all in the name really, the URI is just an identifier, while the URL is a URI that specifies some resource. In other words, it locates a resource.  URIs in and of themselves do not necessarily specify a resource.

What good is a URI if it doesn’t specify a resource, you ask?  If it isn’t a URL, then the URI is a URN, a Uniform Resource Name. That is, it doesn’t locate a resource, but simply names one. This allows even non-digital resources (like physical books) to be specified using a URI.  A URI can be constructed using the ISBN number of a book. That URI is a URN, since it can name the book, but can’t locate it (even if the system knows the location, the program can’t physically walk down an isle and get the book for you, which is the intended meaning of “locate” here).

General URI Syntax

A common misconception is that the first section of the URI is for specifying a protocol. It actually denotes the URI’s scheme. Some very popular schemes (like http) are tied directly to a protocol (Hypertext Transfer Protocol), but this doesn’t have to be the case (the file scheme isn’t tied to a protocol for example).

The most basic and abstract definition of a URI, as defined by the IETF is actually quite simple. The generic syntax (brackets denote optional parts) is:

<scheme name> : <hierarchical part> [ ? <query part>] [ # <fragment part>]

That’s it.  However things get complicated because schemes can define unique  syntax to be used on the right side of that colon, the only real rules being that they respect the ‘?’ and ‘#’ characters used to indicate the query and fragment parts. Anyone can create a URI scheme that fits their exact needs.  If intended to be a standard, that scheme can be registered with the IANA. Most schemes like to add the familiar “//” after the first colon, but it isn’t required. If you work on the web, you are probably familiar with the schemes: http, https, ftp, file, and mailto. There are many, many more: the Wikipedia page on the subject lists just how many there are.

Authorities and Paths

While the hierarchical part of a URI can be custom defined, in practice it often has 2 parts: The authority and the path.

The authority part specifies a hostname. On the web, this is typically a domain name or IP address. It bears mentioning that domain names themselves have their own syntax and specifications which are specified by ICANN (not IANA or IETF).  The authority part can also include login credentials or a port number.

The path is expressed using the typical “/” notation we are all used to, and is conceptually like directories in a file system. In some cases, paths do indeed correspond to a files on a server, but not as a rule. Often paths are meant to model some domain, as with REST services, but there are no rules for structuring paths. They are arbitrary logical hierarchies defined by whomever is designing them.

URI Diagram

Keeping in mind that schemes can be custom defined, the general URI format is illustrated (using an example URI from Wikipedia) below:

The parts of a URI

Tagged: ,

Comments

  1. #1 by Jason on January 3, 2012 - 9:55 am

    Very informative post! Thanks for the info. :)

Submit Your Comment