A Robust Title Casing Algorithm


Posted in Software Development on March 1, 2012

I’ve been thinking a lot about title casing lately. I’ve been tagging my huge music library, which exposes me to all the odd variations that are band, song, and album names. Additionally I’ve recently written an algorithm to support the new Orchard CMS Coverflow module I’ve been working on. This post outlines the logic of the algorithm and includes the C# source code at the bottom.

The Basic Rules

In general, title casing means capitalizing the first letter of each word in a string, like this: “My Important Title.” Creating an algorithm to do this is trivial, but unfortunately there are exceptions to this rule.

The first is that in English there are a handful of words that we agree to not capitalize in titles.  The list of words that I came up with and included are:

{ the, of, or, and, an, a, in, is, are, to, on }

Unless they are the first or the last word in the title, these words should be lowercased.  See the results for yourself: “The Lord Of The Rings” vs. “The Lord of the Rings.” Notice the first word, “the” is capitalized, but in the middle of the sentence it is lowercased.

The last word is important too. Take the string “…and the band played on.” The correct title casing should be “…And the Band Played On.”  The last “on” is capitalized because it is at the end. Contrast this with “hop on pop,” which should be cased “Hop on Pop.”

Exceptions for Specifically Cased Words

The list of words to lowercase isn’t the only list of special words we need to consider. Perhaps more important in today’s brand conscious world are casing exceptions. These are quite common, like Apple’s “i” products: iPhone, iPad, iPod, etc. If you title case those names, they become downright unrecognizable: “Iphone.” At Planet Telex, we’ve built websites for DEMOGala, ScriptSave, WellDyneRx, BioClaim, and others who wouldn’t be happy to have their brand incorrectly cased as all lowercase except for the first letter.

An example from my music library is the band MUTEMATH. The correct branding is all caps. If my algorithm makes it “Mutemath” not only is it wrong, its totally lame. The nuances don’t stop there though- consider the band “Portugal. The Man.” Yes, that period is supposed to be there, and the “The” should also be capitalized. That is how the band does it, but it is also natural to the English language. We expect a capital letter after a period. If my algorithm generates “Portugal. the Man” it is also incorrect and lame.

So a successful casing algorithm needs 2 lists of special words: One to specify words to lowercase when in the middle of the title, and the other to specify words that should be cased specifically, like “MUTEMATH” and “iPhone”.

Nuances in Punctuation

A robust title casing algorithm needs to be aware of which symbols that separate words should trigger exceptions to the general lowercase rule, like “Portugal. The Man” or “Pinion/Terrible Lie” (which could produce “Pinion/terrible Lie” in an algorithm that didn’t respect the “/” character).

To surmount this complexity, I’ve created 2 lists of characters that separate words, a list of “weak” separators and a list of “strong” separators. As their name implies, all of these characters can be seen as flags that separate one word from another, the difference is that after a “strong” separator, the following word should be capitalized, even if it is in the lowercase list.

The two weak separators are the space and comma. There are more strong ones:

{ . ? ! ( ) { } [ ] < > / & }

 Algorithm Overview

With the assistance of the lists I’ve defined as well as a few helper methods, the basic algorithm iterates over each character, building words and then adding them when separators are encountered. A separate function handles applying the rules of casing to a single word, the iterator function simply has to control it.

The biggest complexity is dealing with the possible variations in punctuation. The least obvious rule, which has several lines of explanation in my example, is that if a strong separator is encountered, spaces must be discounted until the next word is written. This way, the word “and” following both a “)” character and then a space character is correctly uppercased.

The Code

The following code is a slightly revised version of the code included in the Planet Telex .Net Library. Some formatting is changed to better fit on the page, and the class name has been contrived for this example. Download or fork the source code at our Planet Telex GitHub account.

 

Tagged: ,

Comments

  1. #1 by Stephen Ringle on February 27, 2016 - 5:20 am

    Would it be possible for you to provide me with either a translation of your “Robust Title Case Algorithm” into pseudo code, or into VBA (Visual Basic for Applications)? I must admit, I am an amateur programmer. But I do a lot of volunteer work in MS Access for my local Historical Society. I’m working on data cleanup and your algorithm would be a big help. I just can’t read and understand your source code language well enough to translate it myself. Many thanks for any help you can offer.
    Steve.

  2. #2 by Mark Meikle on April 29, 2016 - 9:15 am

    Thanks so much for writing this. I’m looking forward to using it, but am having trouble because I’m not sure how to set up the PlanetTelex library this appears to depend on. Can you provide me with some instructions?

  3. #3 by Mark Meikle on April 29, 2016 - 9:24 am

    Also, my project is in VB. Will I be able to us a CS class with my VB project?

  4. #4 by Rob Dixon on April 29, 2016 - 10:41 am

    That is a pretty old .Net library we haven’t maintained for years, although the few methods it has viz. regex and string parsing remain solid. The project is on GitHub here: https://github.com/planettelex/dotnet-pt-library

  5. #5 by Rob Dixon on April 29, 2016 - 10:45 am

    Yes, .Net compiles everything to an intermediate language and assemblies can be consumed by several languages, including VB and C#.

    That said, you’d be better off simply copy and pasting the code. If you feel you can’t convert the C# to VB yourself (which I would recommend to better understand how the it works), you can use a handy service like this Telerik one to do it: http://converter.telerik.com/

  6. #6 by Rob Dixon on April 29, 2016 - 10:48 am

    Sorry, the explanation above the code, the code, and the code comments will have to suffice.

  7. #7 by Rob Dixon on April 29, 2016 - 10:58 am

    Also, Mark, if you are referring to usage of “Resources” that isn’t something from the Planet Telex .Net library, that is .net. Simply go to project properties in Visual Studio, and you will see “Resources” in the left menu about halfway down. That is where you define those in whatever assembly has this code.

  8. #8 by Jim Taylor on May 31, 2016 - 4:38 pm

    This is a helpful review of what’s involved in title casing. Thanks. However, the list of uncapitalized words contains words that are almost always capitalized. “Is” and “are” are verbs, which are capitalized in titles, so they should not be in list. (Just as “be”, “was”, “go”, etc. are not in the list.)

    FWIW, here’s a list provided by the U.S. Government Printing Office a, an, the, at, by, for, in, of, on, to, up, and, as, but, or, nor (https://www.gpo.gov/fdsys/pkg/GPO-STYLEMANUAL-2008/pdf/GPO-STYLEMANUAL-2008.pdf).

Submit Your Comment