Once upon a time (03Mar2007), I wanted to write a library card catalogue. However, my author-parsing routine was leaving something to be desired. It required names to be input in a very rigid form, which I wasn't sure I wanted to do late at night (or after several hours of cataloguing new accessions). Well, "Free and Open Source Software", right? Wrong. There is a huge market for name parsing, apparently, since there are dozens of commercial products claiming to make sense of any name format extant.
I eventually found one, and written in PHP, no less!
normalize_name, written by Jed Hartman, is along the lines of what
I was looking for
Unfortunately, it is somewhat limited for my purposes. It normalizes names to
first and last name, discarding titles, not differentiating between first and
middle names, and requiring names to be in "spoken" format (first followed by
last). However, he had some ideas which I incorporated into
nameparse.php, especially in the recognition of prefixes
(prefices?), suffixes (suffices?), and multi-word last names.
nameparse.php can recognize names in
"[title]first[middles]last[,][suffix]" and "last,first[middles][,][suffix]"
forms, which, when you think about it, cover most if not all well-formed name
nameparse.php handles last names of arbitrary
complexity, such as "bin Laden", "van der Vort", and "Garcia y Vega", as well as
middle names of arbitrary size and complexity, differentiating between most last
names and the first or middle names or initials preceding them.
An example of names correctly parse by nameparse.php:
To use, simple
nameparse.php and call
parse_name($string) on any
parse_name() returns an associative array of all name
segments found of "title","first","middle","last", and "suffix". Do note
that no spelling, capitalization, or punctuation of titles, prefixes, or
suffixes is normalized. That is, every token remains as entered:
nameparse.php is a semantic parser only. If you want orthographic
or other normalization, you'll have to postprocess the output. However,
since the name is now semantically parsed, such postprocessing is (for
applications which require it) simple.
print_r(parse_name('Velasquez y Garcia, Dr. Juan Q. Xavier III'));
yields . . .
Array ( [title] => Dr. [first] => Juan [middle] => Q. Xavier [suffix] => III [last] => Velasquez y Garcia )
There you have it. The first drop-in Free and Open Source name parser in PHP. Now to rewrite it in C . . .
If you want test vectors and a test harness, they're available in the "package" section.