Page 9 of 9

My most famous NLP paper (CoNLL-03)

January 27, 2007 / jsmarr / 1 Comment

Named Entity Recognition with Character-Level Models
HLT-NAACL CoNLL-03 Shared Task
Edmonton, Canada
June 1, 2003

Download PPT (3.8MB; presentation at CoNLL-03)

Every year that Conference on Computational Natural Language Learning (CoNLL) has a “shared task” where they define a specific problem to solve, provide a standard data set to train your models on, and then host a competition for researchers to see who can get the best score. In 2003 the shared task was named-entity recognition (labeling person, place, and organization names in free text) with the twist that they were going to run the final models on a foreign language that wouldn’t be disclosed until the day of the competition. This meant that your model had to be flexible enough to learn from training data in a language it had never seen before (and thus you couldn’t hard-code English rules like “CEO of X” –> “X is an organization”).

Even though my first paper on character-level models got rejected, we kept working on it in the Stanford NLP group because we knew we were on to something. Since one of the major strengths of the model was its ability to distinguish different types of proper names based on their composition (i.e. it recognized that people’s names and company names usually look different), this seemed like an ideal task in which it could shine (see my master’s thesis for more on this work). By this time, I’d started working with Dan Klein, and he was able to take my model to the next level by combining it with a discriminatively trained maximum-entropy sequence model that allowed us to try lots of different character-level features without worrying about violating independence assumptions (a common problem with generative models like my original version). Dan’s also just brilliant and relentless when it comes to analyzing the errors a model is making and then iteratively refining it to perform better and better. The final piece of the puzzle came from my HMM work with Huy Nguyen, which let us combine segmentation (finding the boundaries of proper names in text) and classification (figuring out which type of proper name it is) into a single model.

Our paper was accepted (yay!) and Dan and I flew to Canada to present our work. This was my first NLP conference and it was awesome to meet all these famous researchers whom I’d previously read and learned from. Luckily for me, Dan was just about to finish his PhD, and he was actively being courted by the top NLP programs, so by sticking with him I quickly met most of the important people in the field. Statistical NLP attracts a fascinating mix of people with strong math backgrounds, interest in language, and a passion for empirical (data-driven) research, so this was an amazing group of people to interact with.

On the last day of the conference (CoNLL was held inside HLT-NAACL, which were two larger NLP conferences that had also merged), the big day had come at last. My first presentation as an NLP researcher (Dan let me give the talk on behalf of our team), and the announcement of the competition results. There were 16 entries in the competition. In English (the language we had been given ahead of time), our model got the 3rd highest score; in German (the secret language), our model came in 2nd, though the difference between our model and the one in 1st place was not statistically significant. In other words, had the test data been slightly different, we might easily have had the highest score.

Doing so well was certainly gratifying, but what made us even happier was the fact that our model was far simpler and purer than most in the competition. For instance, the model that got first place in both languages was itself a combination of four separate classifiers, and in addition to the training data provided by the conference, it also used a large external list of known person, place, and organizaton names (called a gazetteer). While piling so much on certainly helped eek out a slightly higher score, it also makes it harder to learn any general results about what pieces contributed and how that might be applied in the future.

In contrast, our model was almost exclusively a demonstration of the valuable information contained in character-level features. Despite leaving out many of the bells-and-whistles used by other systems, our model performed well because we gave it good features and let it combine them well. As a wise man once said, “let the data do the talking”. Perhaps because of the simplicity of our model and its novel use of character features, our paper has been widely cited, and is certainly the most recognized piece of research I did while at Stanford. It makes me smile because the core of the work never got accepted for publication, but it managed to live on and make an impact regardless.

My first NLP research paper

January 27, 2007 / jsmarr / 10 Comments

Classifying Unknown Proper Noun Phrases Without Context
Technical Report dbpubs/2002-46
Stanford University
April 9, 2002

Download PDF (9 pages)

Download PPT (1.3MB; presentation of the paper to the NLP group)

As I describe in my post about my master’s thesis, I started doing research in Natural Language Processing after Chris Manning, the professor that taught my NLP class at Stanford, asked me to further develop the work I did for my class project. He helped me clean up my model, suggested some improvements, and taught me the official way to write and style a professional academic paper (I narrowly avoided having to write it in LaTeX!). I was proud of the final paper, but it wasn’t accepted (I believe we submitted it to EMNLP 02).

This was the start of a series of lessons I learned at Stanford about the difference between what I personally found interesting (and how I wanted to explain it) and what the academic establishment (that decides what papers are published by peer review) thought the rules and conventions had to be for “serious academic work”. While I got better at “playing the game” during my time at Stanford–and to be fair, some of it was actually good and helpful in terms of how to be precise, avoid overstating results, and so on–I still feel that the academic community has lost sight of their original aspirations in some important ways.

At its best, academic research embarks on grand challenges that will take many years to accomplish but whose results will change society in profound ways. It’s a long-term investment for a long-term gain. NLP has no shortage of these lofty goals, including the ability to carry on a natural conversation with your computer, high quality machine-translation of text in foreign languages, the ability to automatically summarize large quantities of text, and so on. But in practice I have found that in most of these areas, the sub-community that is ostensibly working one of these problems has actually constructed its own version of the problem, along with its own notions of what’s important and what isn’t, that doesn’t always ground out in the real world at the end of the day. This limits progress when work that could contribute to the original goal is not seen as important in the current academic formulation. And since, in most cases, the final challenge is not yet solvable, it’s often difficult to offer empirical counter-evidence to the opinions of the establishment as to whether a piece of work will or will not end up making an important difference.

I found this particularly vexing because my intuition is driven strongly by playing with a system, noting its current shortcomings, and then devising clever ways to overcome them. Some of the shortcomings I perceived were not considered shortcomings in the academic version of these challenges, and thus my interest in improving those aspects fell largely on deaf ears.

For instance, I did a fair amount of work in information extraction, which is about extracting structured information from free text (e.g. finding the title, author, and price of a book on an amazon web page or determining which company bought which other one and for how much in a Reuters news article). The academic formulation of this problem is to run your system fully autonomously over a collection of pages, and your score is based on how many mistakes you make. There are two kinds of mistakes–extracting the wrong piece of information, or not extracting anything when you should have–and both are usually counted as equally bad (the main score used in papers is F1, which is the harmonic average of precision and recall, which measure those two types of errors respectively). If your paper doesn’t show a competitive F1, it’s difficult to convince the community that you’re advancing the state-of-the-art, and thus it’s difficult to get it published.

However, in many real-world applications, the computer is not being run completely autonomously, and mistakes and omissions are not equally costly. In fact, if you’re trying to construct a high-quality database of information starting from free text, I’d say the general rule is that people are ultimately responsible for creating the output (the computer program is a means to that end), and that the real challenge is to see how much text you can automatically extract given that what you do extract has to be extremely high quality. In most cases, returning garbage data is much worse than not being able to cover every piece of information possible, and if humans can clean up the computer’s output, they will definitely want to do so. Thus the real-world challenges are maximizing recall at a fixed high-level of precision (not maximizing F1) and accurately estimating confidence scores for each piece of information extracted (so the human can focus on just cleaning up the tricky parts), neither of which fit cleanly into the academic conception of the problem. And this is to say nothing about how quickly or robustly the systems can process the information they’re extracting, which would clearly also be of utmost importance in a functioning system.

I witnessed firsthand this difference between the problem academics are trying to solved and the solution that real applications need when I started working for Plaxo. A core component of the original system was the ability to let people e-mail you their current contact info (either in free text, like “hey, i got a new cell phone…” or in the signature blocks at the bottom of messages) and automatically extract that information and stick it in your address book. This would clearly be very useful if it worked well (the status quo is you have to copy-and-paste it all manually, and as a result, most people just leave that information sitting in e-mail), and it clearly fits the real-world description above (sticking garbage in your address book is unaccepatble, whereas failing to extract 100% of the info is still strictly better than not doing anything). None of the academic systems being worked on had a chance of doing a good job at this problem, and so I had to write a custom solution involving a lot of complicated regular expressions and other pattern-matching code. My system ended up working very well–and very quickly (it could process a typical message in under 50 msec, whereas most academic systems are a “start it running and then go for coffee” kind of affair)–and developing it required a lot of clever ideas, but it was certainly nothing I could get an academic paper published about.

The irony cuts both ways–when I tried to solve the real problem, I couldn’t get published, but the work that was published didn’t help. And yet the academic community could surely do a much better job of solving the real problem if only they hadn’t decided it wasn’t the problem they were interested in. I only bring this up because I am a big believer in the power and potential of academic research, and I still optimistically hope that its impact could be that much greater if its goals were more closely aligned with the ultimate problems they’re trying to solve. By bridging the gap between academia and companies, both should be able to benefit tremendously.

If you’ve read this far in the hope of knowing more about the contents of my first NLP paper, I’m sorry to say it has nothing to do with information extraction, and certainly nothing to do with the academic/real-world divide. But it’s a neat paper (and probably shorter than this blog post!) and despite its not being published, the work it describes ended up influencing other work that I and people at the Stanford NLP group did, some of which did end up gaining a fair bit of notoriety in academic circles.

My Stanford Master’s Thesis

January 27, 2007 / jsmarr / 7 Comments

Categorization by Character-Level Models: Exploiting the Sound Symbolism of Proper Names
Master’s thesis, Symbolic Systems Program, Stanford University
Christopher D. Manning, Advisor
June 11, 2003

Download PDF (52 pages)

A character-level hidden markov model (HMM) After four years as an undergraduate at Stanford, I wasn’t ready to leave yet. There were more classes I wanted to take, and I wanted to do more research. Since I was in the Symbolic Systems program, I was taking a mix of Computer Science, Linguistics, Psychology, and Philosophy classes for my major. I was particularly interested in CS and Linguistics, and I wanted to take many of the graudate-level classes in each department, so I really needed a fifth year at school.

During my senior year, I had started doing some NLP research with Chris Manning, which I was really enjoying. When I took his CS224N class, I did a final paper with Steve “Sensei” Patel in which we built a model to recognize unknown words as drug, company, person, place, and movie names based on their composition (e.g. “cotramoxizole” looks like a drug word, and “InterTrust” looks like company name, and we trained our model to learn these patterns). Our model performed very well–in fact, it did better than our friends on the same tests!–and Chris asked me if I’d like to develop this research further with him. After the work we did during my senior year, he offered to fund me as a research assistant during my fifth year.

Stanford has this amazing co-terminal master’s program where you can start taking master’s classes before you finish your undergraduate degree, and so you end up getting both degrees in five years (some people, like my wife, even manage to squeeze both degrees into four years, but like I said, I wasn’t ready to leave yet). The symbolic systems program had just started offering a co-term, but it was research-based (in some departments you just have to take more classes) and so one requirement was you had to have a professor sponsoring your research and vouching that you were serious. The timing was perfect, and I was selected as one of a few students to do a research MS in SSP that year.

(That summer, I also met the founders of Plaxo and started working “part time” building some NLP tools for them. That’s another story, but let me say it’s really not possible to do research and a startup at the same time and do both of them well.)

While working in the Stanford NLP group, I spent a lot of time with Dan Klein, one of Chris’s star PhD students, who’s now a professor at Berkeley. He had a major influence on my work, as well as on me personally. During my co-term year, I also started working with a CS master’s student named Huy Nguyen. We became good friends and he’s now an engineer at Plaxo (hmm, I wonder how that happened ;)).

I wrote quite a few academic NLP papers during my time at Stanford, some of which got published and some of which didn’t. The original paper I did with Chris based on my CS224N project got rejected, but it ended up forming the core of the model Dan, Huy, and I used at the CoNLL-03 competition, which was very successful and has since been widely cited.

My thesis represents the culmination of the work I did at Stanford. It’s central thrust is that you can tell a surprisingly great deal about a proper name by looking at its composition at the character-level. Most NLP systems just treat words as opaque symbols (“dog” = x1, “cat” = x2, etc.) and treat all unknown (previously unseen) words as a generic UNK word (that’s really all you can do if you’re only gathering statistics at the word-level). As a result, these systems often perform poorly when dealing with unknown words, which is increasingly common as they are applied to the untamed world-wide-web or to domains like medicine and biology that are full of specific technical words.

My research looked at a variety of ways you could exploit regularities in the character sequences of unknown words to segment and classify them semantically, even though you’d never seen them before. In addition to presenting experimental results in a number of domains and in multiple lanugages, I also investigated why there appears to be this sound-symbolic regularity in naming, looking at language evolution and professional brand-name creation in particular.

When my thesis was complete, I had to decide whether to apply to a PhD program to continue my research or to instead join Plaxo full-time as an engineer. As you probably know, I ended up choosing Plaxo, mainly because I really believed in the founders and the company’s vision, but also because I wanted to do something tangible that would have immediate impact in the real-world. But I still think that someday I might like to go back to school and continue doing NLP research. The way I look at it, I can’t lose: by the time I’m ready to go back, either all the interesting problems in NLP will have already been solved–in which case the world will be a truly amazing place to live–or there will still be plenty for me left to work on. 🙂

FOAF Workshop Talk

January 27, 2007 / jsmarr / 1 Comment

Technical and Privacy Challenges for Integrating FOAF into Existing Applications
FOAF Workshop
Galway, Ireland
September 2, 2004

Full paper (HTML)

Download PPT (2.1MB)

FOAF stands for friend-of-a-friend and it’s an open standard for describing your contact information and who you know. When social networking sites started exploding, many people were annoyed that they had to keep entering this information over and over again, and they also wanted to maintain ownership of their own information. I got excited about the potential for FOAF at Plaxo because we’re always looking for new ways to help users get their data in and out of other applications and services. Unlike most social networks, we don’t benefit from keeping your data trapped in a walled garden–quite the opposite! When I started looking mroe seriously at implementing FOAF in Plaxo, I noticed a number of issues–both technical (e.g. how to handle authentication) and privacy-related (e.g. do I have a right to publish contact info about the people in my address book, or is that their call?) that I thought the FOAF community should be talking about. After writing a blog post for Plaxo about the potential for FOAF and its challenges (which turned out to be our most popular post for quite some time), I expanded it into a full paper, which I presented at the FOAF Workshop in Galway, Ireland. I went to Ireland a few days early and spent them in Dublin, which I absolutely loved. Ireland has this enchanting mix of old- and new-world culture, it’s all iridescently green, and the people were all friendly. I took the train cross-country to Galway, which is also a very cool town.

I haven’t heard much about FOAF lately, though I believe the project is still being worked on by some people. I had high hopes that Marc Canter’s FOAFnet project (a subset of FOAF that lets you import and export your social network data from a web site) would be simple and sexy enough to gain widespread adoption, but it doesn’t look like it ever happened, most people don’t seem to be outraged that they have to maintain separate profiles and contact lists on MySpace, Facebook, LinkedIn, and so on. Maybe one day these sites will all sync with Plaxo, but until then they continue to be separate walled gardens that own your data.

Mashup University Talk

January 27, 2007 / jsmarr / 1 Comment

Smarten Up Your Address Book with a Plaxo Mashup
Mashup University (part of Mashup Camp 2)
Mountain View, CA (Computer History Museum)
July 12, 2006

Download PPT (1.7MB)

Watch the complete video of my talk (QT, 78MB)

Plaxo sponsored me to give a talk at the beginning of MashupCamp2 (alongside speakers from Microsoft, AOL, and Intel) during its new “Mashup University” section. I talked about the Plaxo widget and our APIs and why they’re a useful ingredient for many sites. MashupCamp itself was also great fun, and it made me a strong believer in the style of un-conferences (where the schedule is formed ad-hoc by the participants when the conference starts), something I’ve since used at Plaxo to cut down on scheduling meetings. Rather than trying to get on everyone’s calendar, we just reserve Tuesday afternoons for an internal unconference we call Meataxo (yes, the spelling is intentional–we had to do something to make the idea of a meeting marathon sound fun :)).

I’ve covered my talk at MashupU in more detail on Plaxo’s blog, and Joe “Duck” Hunkins also wrote a great summary.

Cross-Site Ajax (OSCON 2006)

January 27, 2007 / jsmarr / 3 Comments

Cross-Site Ajax: Challenges and Techniques for Building Rich Web 2.0 Mashups
O’Reilly Open Source Convention (OSCON) 2006
Portland, OR
July 26, 2006

Download PPT (1.8MB)

This was the first OSCON I ever attended. I had a great time and I met a lot of amazing people. I’m definitely going back next year. Much of what I discussed in the talk came from work I did on the Plaxo widget, and the point of this talk was to share the techniques I’d learned and also to raise awareness and debate of the larger issues and privacy/technology tradeoffs involved. I’ve covered this talk in more detail on Plaxo’s blog (thanks Plaxo for sending me!). Kevin Yank also blogged a summary of the talk.

Joseph, why a blog, and why now?

January 25, 2007 / jsmarr / 5 Comments

You know, I used to care about my presence on the web. Or, rather, I used to do actually something about it. In fact, I’ve had a personal web page since 1993, when I was a subfreshman at Uni High in Urbana, IL (the page was called mosaic.home.html, which kinda dates it). Sadly, I can’t seem to find a copy of my website’s earliest incarnation, but I do still have other old versions of my web site from Uni High, NCSA, and Stanford.

After graduating from Stanford in 2003, they promptly took back my Leland account, and my web site went with it (do they really need the extra logins and disk quota that badly?!). [If I’d been a pure CS student, I would have gotten to keep a cs.stanford.edu account, but being in Symbolic Systems””even though I took as many CS classes as most CS students””I was not afforded that luxury.] So my web presence regressed to a collection of google search results, incomplete profile pages, and bylines on blog posts I did for Plaxo.

It’s ironic that my move to becoming a professional web developer coincided with the first time since I was 12 that I didn’t have a personal web page. But honestly, it was just because I was too busy and/or too lazy to find more web space and get started again.

Whilst I was busy working on Plaxo’s web site instead of my own, an additional variable entered the equation: blogs. My first serious exposure to blogs came when Mark Fletcher, who was working at Plaxo in the early days, started talking about how he was going to build a web-based blog aggregator he called Bloglines. I remember at the time thinking to myself, “Mark, you’re a smart guy, and the only time-sink bigger than reading these random geeks’ rants in the first place is actually building a tool to help you read more of them!” Of course, in retrospect, the move was prescient (I shouldn’t be too surprised: Mark has a track record for timing the democratization of geek tools, having done the same during the bubble with mailing lists at ONElist/eGroups) and I now spend more time in Bloglines than any other web site.

Most of the blogs I read are not personal blogs, but rather news about companies I’m interested in (e.g. TiVo, Netflix, Zvents), technical articles about web development (e.g. Ajaxian, A List Apart, dojo.foo), or mentions of Plaxo across the blogosphere (Bloglines search feeds do this well, as does Technorati and Google). Most people I know that have started personal blogs post infrequently, and unless they’re a personal friend of mine, I’m not usually interested in that much of what they’re saying (it’s just not that relevant to my life).

So, why a blog, and why now? The direct reason is that I made a New Year’s resolution (sorry, Rikk, I know you hate those) to do a better job of work-life balance and setting up a personal web site is one of those “life” things I haven’t been spending enough time on lately (spending more time with my wife is another). But I’ve also had more cause to want to put things online lately. I’ve been giving more talks, going to more conferences, and working on more cutting-edge web development that I want to discuss and share with my colleagues across the web. Mainly I’ve done so through Plaxo’s corporate blog, but there’s a lot I want to talk about that’s probably too specific and long for that forum (you may have noticed my inability to keep blog posts short by now). And when Mark Jen (who, I’m happy to say, I helped hire at Plaxo when Google foolishly let him go) told me he could hook me up with space at DreamHost and even set up WordPress for me (Matt, you’ve done a bang-up job with that project!), I decided I really had no excuse not to go for it. So here I am!

I really will try to keep this site fresh and meaningful, and I encourage you to pressure me if I don’t live up to that goal. I started keeping a list of ideas for things to write about and it’s already several pages long, so all I need is the time/discipline to regularly spend some time on it. The few friends of mine that do regularly post about their lives and share photos make me feel much closer and more connected to them than I otherwise would, and I hope I can return the favor. One realization that made me want to start a personal blog even after seeing so many rantings-of-some-random-dude is that just because anyone on the web can read your blog doesn’t mean you have to write for a global audience. If a few people that want to keep in touch with you can do so from your blog, you’ve done them and yourself a service, and if it’s not relevant to everyone else, they can read something else.

That being said, a lot of what I hope to write about is ideas I’ve had about web development, tech, and entrepreneurship in general, which I hope will be of interest even to people that don’t know me personally. I’ll try to do a good job of tagging my posts so you can focus on the ones you care about. Let me know if you have any suggestions about how I can design/run this site better. I hope I can manage it like any good startup project: ship early and often, listen to your users, and rev quickly. 🙂

Joseph Smarr

Page 9 of 9

My most famous NLP paper (CoNLL-03)

My first NLP research paper

My Stanford Master’s Thesis

FOAF Workshop Talk

Mashup University Talk

Cross-Site Ajax (OSCON 2006)

Joseph, why a blog, and why now?

Follow me

Current Projects

Links

Me on the Web

Past Projects

Categories

Archives