Sources of Success for Boosted Wrapper Induction
Journal of Machine Learning Research, Volume 5
Written October 2001, published December 2004

Download PDF (29 pages)

Download PPT (900KB; presentation at Stanford’s Seminar for Computational Learning and Adaptation)

I co-wrote this paper during the first summer I started doing NLP research, but it didn’t see the light of day until a year after I’d finished my Master’s degree. Yeesh!

It all started when I decided to spend the Summer of 2001 (between my junior and senior years at Stanford) doing research at UC San Diego with Charles Elkan. I’d met Charles through my dad on an earlier visit to UCSD, and his research exhibited exactly the mix of drive for deep understanding and desire to solve real-world problems that I was looking for. I was also working at the time for a startup called MedExpert that was using AI to help provide personalized medical advice. Since one of the major challenges was digesting the staggering volume of medical literature, MedExpert agreed to fund my summer research in information extraction. So I joined the UCSD AI lab for the summer and started working on tools for extracting information from text, a field that I would end up devoting most of my subsequent research to in one form or another.

As it happened, one of Charles’s PhD students, Dave Kauchak, was also working on information extraction, and he had recently gotten interested in a technique called Boosted Wrapper Induction. So Dave, Charles, and I ended up writing a lengthy paper that analyzed how BWI worked and how to improve it, including some specific work on medical literature using data from Mark Craven. By the end of the summer we had some interesting results, a well-written paper (or so I thought), and I was looking forward to being a published author.

Then the fun began. We submitted the paper for publication in an AI journal (it was too long to be a conference paper) and it got rejected, but with a request to re-submit it once we had made a list of changes. Many of the changes seemed to be more about style than substance, but we decided to make them anyway, and in the process we ran some additional experiments to shore up any perceived weaknesses (by this time I was back at Stanford and Dave was TAing classes, so re-running old research was not at the top of our wish list). Finally we submitted our revised paper to a new set of reviewers, who came back with a different set of issues they felt we had to fix first.

To make a long story short, we kept fiddling with it until finally, long after I had stopped personally working on this paper (and NLP altogether, for that matter), I got an e-mail from Dave saying the paper had finally been accepted, and would be published in the highly respected Journal of Machine Learning Research. It was hard to believe, but sure enough at the end of 2004–more than three years since we first wrote the paper–it finally saw the light of day. It was the paper that would not die.

Charles had long since published an earlier version of the paper as a technical report, so at least our work was able to have a bit more timely of an impact while it was “in the machine”. I’m glad it finally did get published, and I know that academic journals are notoriously slow, but given how fast the fronteir of computer science and NLP are moving, waiting 3+ years to release a paper is almost comical. I can’t wait until this fall to get the new issue and find out what Dave did the following summer. :p

Liked this post? Follow this blog to get more.