Categorization by Character-Level Models: Exploiting the Sound Symbolism of Proper Names
Master’s thesis, Symbolic Systems Program, Stanford University
Christopher D. Manning, Advisor
June 11, 2003

Download PDF (52 pages) 

A character-level hidden markov model (HMM)After four years as an undergraduate at Stanford, I wasn’t ready to leave yet. There were more classes I wanted to take, and I wanted to do more research. Since I was in the Symbolic Systems program, I was taking a mix of Computer Science, Linguistics, Psychology, and Philosophy classes for my major. I was particularly interested in CS and Linguistics, and I wanted to take many of the graudate-level classes in each department, so I really needed a fifth year at school.

During my senior year, I had started doing some NLP research with Chris Manning, which I was really enjoying. When I took his CS224N class, I did a final paper with Steve “Sensei” Patel in which we built a model to recognize unknown words as drug, company, person, place, and movie names based on their composition (e.g. “cotramoxizole” looks like a drug word, and “InterTrust” looks like company name, and we trained our model to learn these patterns). Our model performed very well–in fact, it did better than our friends on the same tests!–and Chris asked me if I’d like to develop this research further with him. After the work we did during my senior year, he offered to fund me as a research assistant during my fifth year.

Stanford has this amazing co-terminal master’s program where you can start taking master’s classes before you finish your undergraduate degree, and so you end up getting both degrees in five years (some people, like my wife, even manage to squeeze both degrees into four years, but like I said, I wasn’t ready to leave yet). The symbolic systems program had just started offering a co-term, but it was research-based (in some departments you just have to take more classes) and so one requirement was you had to have a professor sponsoring your research and vouching that you were serious. The timing was perfect, and I was selected as one of a few students to do a research MS in SSP that year.

(That summer, I also met the founders of Plaxo and started working “part time” building some NLP tools for them. That’s another story, but let me say it’s really not possible to do research and a startup at the same time and do both of them well.)

While working in the Stanford NLP group, I spent a lot of time with Dan Klein, one of Chris’s star PhD students, who’s now a professor at Berkeley. He had a major influence on my work, as well as on me personally. During my co-term year, I also started working with a CS master’s student named Huy Nguyen. We became good friends and he’s now an engineer at Plaxo (hmm, I wonder how that happened ;)).

I wrote quite a few academic NLP papers during my time at Stanford, some of which got published and some of which didn’t. The original paper I did with Chris based on my CS224N project got rejected, but it ended up forming the core of the model Dan, Huy, and I used at the CoNLL-03 competition, which was very successful and has since been widely cited.

My thesis represents the culmination of the work I did at Stanford. It’s central thrust is that you can tell a surprisingly great deal about a proper name by looking at its composition at the character-level. Most NLP systems just treat words as opaque symbols (“dog” = x1, “cat” = x2, etc.) and treat all unknown (previously unseen) words as a generic UNK word (that’s really all you can do if you’re only gathering statistics at the word-level). As a result, these systems often perform poorly when dealing with unknown words, which is increasingly common as they are applied to the untamed world-wide-web or to domains like medicine and biology that are full of specific technical words.

My research looked at a variety of ways you could exploit regularities in the character sequences of unknown words to segment and classify them semantically, even though you’d never seen them before. In addition to presenting experimental results in a number of domains and in multiple lanugages, I also investigated why there appears to be this sound-symbolic regularity in naming, looking at language evolution and professional brand-name creation in particular.

When my thesis was complete, I had to decide whether to apply to a PhD program to continue my research or to instead join Plaxo full-time as an engineer. As you probably know, I ended up choosing Plaxo, mainly because I really believed in the founders and the company’s vision, but also because I wanted to do something tangible that would have immediate impact in the real-world. But I still think that someday I might like to go back to school and continue doing NLP research. The way I look at it, I can’t lose: by the time I’m ready to go back, either all the interesting problems in NLP will have already been solved–in which case the world will be a truly amazing place to live–or there will still be plenty for me left to work on. 🙂

Liked this post? Follow this blog to get more.