Classifying Unknown Proper Noun Phrases Without Context
Technical Report dbpubs/2002-46
April 9, 2002
Download PDF (9 pages)
Download PPT (1.3MB; presentation of the paper to the NLP group)
As I describe in my post about my master’s thesis, I started doing research in Natural Language Processing after Chris Manning, the professor that taught my NLP class at Stanford, asked me to further develop the work I did for my class project. He helped me clean up my model, suggested some improvements, and taught me the official way to write and style a professional academic paper (I narrowly avoided having to write it in LaTeX!). I was proud of the final paper, but it wasn’t accepted (I believe we submitted it to EMNLP 02).
This was the start of a series of lessons I learned at Stanford about the difference between what I personally found interesting (and how I wanted to explain it) and what the academic establishment (that decides what papers are published by peer review) thought the rules and conventions had to be for “serious academic work”. While I got better at “playing the game” during my time at Stanford–and to be fair, some of it was actually good and helpful in terms of how to be precise, avoid overstating results, and so on–I still feel that the academic community has lost sight of their original aspirations in some important ways.
At its best, academic research embarks on grand challenges that will take many years to accomplish but whose results will change society in profound ways. It’s a long-term investment for a long-term gain. NLP has no shortage of these lofty goals, including the ability to carry on a natural conversation with your computer, high quality machine-translation of text in foreign languages, the ability to automatically summarize large quantities of text, and so on. But in practice I have found that in most of these areas, the sub-community that is ostensibly working one of these problems has actually constructed its own version of the problem, along with its own notions of what’s important and what isn’t, that doesn’t always ground out in the real world at the end of the day. This limits progress when work that could contribute to the original goal is not seen as important in the current academic formulation. And since, in most cases, the final challenge is not yet solvable, it’s often difficult to offer empirical counter-evidence to the opinions of the establishment as to whether a piece of work will or will not end up making an important difference.
I found this particularly vexing because my intuition is driven strongly by playing with a system, noting its current shortcomings, and then devising clever ways to overcome them. Some of the shortcomings I perceived were not considered shortcomings in the academic version of these challenges, and thus my interest in improving those aspects fell largely on deaf ears.
For instance, I did a fair amount of work in information extraction, which is about extracting structured information from free text (e.g. finding the title, author, and price of a book on an amazon web page or determining which company bought which other one and for how much in a Reuters news article). The academic formulation of this problem is to run your system fully autonomously over a collection of pages, and your score is based on how many mistakes you make. There are two kinds of mistakes–extracting the wrong piece of information, or not extracting anything when you should have–and both are usually counted as equally bad (the main score used in papers is F1, which is the harmonic average of precision and recall, which measure those two types of errors respectively). If your paper doesn’t show a competitive F1, it’s difficult to convince the community that you’re advancing the state-of-the-art, and thus it’s difficult to get it published.
However, in many real-world applications, the computer is not being run completely autonomously, and mistakes and omissions are not equally costly. In fact, if you’re trying to construct a high-quality database of information starting from free text, I’d say the general rule is that people are ultimately responsible for creating the output (the computer program is a means to that end), and that the real challenge is to see how much text you can automatically extract given that what you do extract has to be extremely high quality. In most cases, returning garbage data is much worse than not being able to cover every piece of information possible, and if humans can clean up the computer’s output, they will definitely want to do so. Thus the real-world challenges are maximizing recall at a fixed high-level of precision (not maximizing F1) and accurately estimating confidence scores for each piece of information extracted (so the human can focus on just cleaning up the tricky parts), neither of which fit cleanly into the academic conception of the problem. And this is to say nothing about how quickly or robustly the systems can process the information they’re extracting, which would clearly also be of utmost importance in a functioning system.
I witnessed firsthand this difference between the problem academics are trying to solved and the solution that real applications need when I started working for Plaxo. A core component of the original system was the ability to let people e-mail you their current contact info (either in free text, like “hey, i got a new cell phone…” or in the signature blocks at the bottom of messages) and automatically extract that information and stick it in your address book. This would clearly be very useful if it worked well (the status quo is you have to copy-and-paste it all manually, and as a result, most people just leave that information sitting in e-mail), and it clearly fits the real-world description above (sticking garbage in your address book is unaccepatble, whereas failing to extract 100% of the info is still strictly better than not doing anything). None of the academic systems being worked on had a chance of doing a good job at this problem, and so I had to write a custom solution involving a lot of complicated regular expressions and other pattern-matching code. My system ended up working very well–and very quickly (it could process a typical message in under 50 msec, whereas most academic systems are a “start it running and then go for coffee” kind of affair)–and developing it required a lot of clever ideas, but it was certainly nothing I could get an academic paper published about.
The irony cuts both ways–when I tried to solve the real problem, I couldn’t get published, but the work that was published didn’t help. And yet the academic community could surely do a much better job of solving the real problem if only they hadn’t decided it wasn’t the problem they were interested in. I only bring this up because I am a big believer in the power and potential of academic research, and I still optimistically hope that its impact could be that much greater if its goals were more closely aligned with the ultimate problems they’re trying to solve. By bridging the gap between academia and companies, both should be able to benefit tremendously.
If you’ve read this far in the hope of knowing more about the contents of my first NLP paper, I’m sorry to say it has nothing to do with information extraction, and certainly nothing to do with the academic/real-world divide. But it’s a neat paper (and probably shorter than this blog post!) and despite its not being published, the work it describes ended up influencing other work that I and people at the Stanford NLP group did, some of which did end up gaining a fair bit of notoriety in academic circles.