Helical Alpha Code

nameparse: PHP name parser

Once upon a time (03Mar2007), I wanted to write a library card catalogue. However, my author-parsing routine was leaving something to be desired. It required names to be input in a very rigid form, which I wasn't sure I wanted to do late at night (or after several hours of cataloguing new accessions). Well, "Free and Open Source Software", right? Wrong. There is a huge market for name parsing, apparently, since there are dozens of commercial products claiming to make sense of any name format extant.

I eventually found one, and written in PHP, no less! normalize_name, written by Jed Hartman, is along the lines of what I was looking for (announcement, source). Unfortunately, it is somewhat limited for my purposes. It normalizes names to first and last name, discarding titles, not differentiating between first and middle names, and requiring names to be in "spoken" format (first followed by last). However, he had some ideas which I incorporated into nameparse.php, especially in the recognition of prefixes (prefices?), suffixes (suffices?), and multi-word last names.

nameparse.php can recognize names in "[title]first[middles]last[,][suffix]" and "last,first[middles][,][suffix]" forms, which, when you think about it, cover most if not all well-formed name input formats. nameparse.php handles last names of arbitrary complexity, such as "bin Laden", "van der Vort", and "Garcia y Vega", as well as middle names of arbitrary size and complexity, differentiating between most last names and the first or middle names or initials preceding them.

An example of names correctly parse by nameparse.php:

To use, simple include() or require() nameparse.php and call parse_name($string) on any name. parse_name() returns an associative array of all name segments found of "title","first","middle","last", and "suffix". Do note that no spelling, capitalization, or punctuation of titles, prefixes, or suffixes is normalized. That is, every token remains as entered: nameparse.php is a semantic parser only. If you want orthographic or other normalization, you'll have to postprocess the output. However, since the name is now semantically parsed, such postprocessing is (for applications which require it) simple.

print_r(parse_name('Velasquez y Garcia, Dr. Juan Q. Xavier III'));

yields . . .

    [title] => Dr.
    [first] => Juan
    [middle] => Q. Xavier
    [suffix] => III
    [last] => Velasquez y Garcia

There you have it. The first drop-in Free and Open Source name parser in PHP. Now to rewrite it in C . . .

If you want test vectors and a test harness, they're available in the "package" section.

nameparse | Download Source | Download Package