Archive for category Software Development

A Robust Title Casing Algorithm

I’ve been thinking a lot about title casing lately. I’ve been tagging my huge music library, which exposes me to all the odd variations that are band, song, and album names. Additionally I’ve recently written an algorithm to support the new Orchard CMS Coverflow module I’ve been working on. This post outlines the logic of the algorithm and includes the C# source code at the bottom.

The Basic Rules

In general, title casing means capitalizing the first letter of each word in a string, like this: “My Important Title.” Creating an algorithm to do this is trivial, but unfortunately there are exceptions to this rule.

The first is that in English there are a handful of words that we agree to not capitalize in titles.  The list of words that I came up with and included are:

{ the, of, or, and, an, a, in, is, are, to, on }

Unless they are the first or the last word in the title, these words should be lowercased.  See the results for yourself: “The Lord Of The Rings” vs. “The Lord of the Rings.” Notice the first word, “the” is capitalized, but in the middle of the sentence it is lowercased.

The last word is important too. Take the string “…and the band played on.” The correct title casing should be “…And the Band Played On.”  The last “on” is capitalized because it is at the end. Contrast this with “hop on pop,” which should be cased “Hop on Pop.”

Exceptions for Specifically Cased Words

The list of words to lowercase isn’t the only list of special words we need to consider. Perhaps more important in today’s brand conscious world are casing exceptions. These are quite common, like Apple’s “i” products: iPhone, iPad, iPod, etc. If you title case those names, they become downright unrecognizable: “Iphone.” At Planet Telex, we’ve built websites for DEMOGala, ScriptSave, WellDyneRx, BioClaim, and others who wouldn’t be happy to have their brand incorrectly cased as all lowercase except for the first letter.

An example from my music library is the band MUTEMATH. The correct branding is all caps. If my algorithm makes it “Mutemath” not only is it wrong, its totally lame. The nuances don’t stop there though- consider the band “Portugal. The Man.” Yes, that period is supposed to be there, and the “The” should also be capitalized. That is how the band does it, but it is also natural to the English language. We expect a capital letter after a period. If my algorithm generates “Portugal. the Man” it is also incorrect and lame.

So a successful casing algorithm needs 2 lists of special words: One to specify words to lowercase when in the middle of the title, and the other to specify words that should be cased specifically, like “MUTEMATH” and “iPhone”.

Nuances in Punctuation

A robust title casing algorithm needs to be aware of which symbols that separate words should trigger exceptions to the general lowercase rule, like “Portugal. The Man” or “Pinion/Terrible Lie” (which could produce “Pinion/terrible Lie” in an algorithm that didn’t respect the “/” character).

To surmount this complexity, I’ve created 2 lists of characters that separate words, a list of “weak” separators and a list of “strong” separators. As their name implies, all of these characters can be seen as flags that separate one word from another, the difference is that after a “strong” separator, the following word should be capitalized, even if it is in the lowercase list.

The two weak separators are the space and comma. There are more strong ones:

{ . ? ! ( ) { } [ ] < > / & }

 Algorithm Overview

With the assistance of the lists I’ve defined as well as a few helper methods, the basic algorithm iterates over each character, building words and then adding them when separators are encountered. A separate function handles applying the rules of casing to a single word, the iterator function simply has to control it.

The biggest complexity is dealing with the possible variations in punctuation. The least obvious rule, which has several lines of explanation in my example, is that if a strong separator is encountered, spaces must be discounted until the next word is written. This way, the word “and” following both a “)” character and then a space character is correctly uppercased.

The Code

The following code is a slightly revised version of the code included in the Planet Telex .Net Library. Some formatting is changed to better fit on the page, and the class name has been contrived for this example. Download or fork the source code at our Planet Telex GitHub account.

 

,

8 Comments