Joseph Smarr

Thoughts on web development, tech, and life.

Page 8 of 9

Has it really been five years already??

My Stanford class book pageI can’t believe it, but Stanford is already telling me to get ready for my five-year college reunion this fall. Five years–that’s as long as I was in college (including my Master’s degree) but this five years sure went by a lot faster than the previous five! Then again, I just passed my five-year anniversary at Plaxo (the math is a bit funny because I started working at Plaxo before I finished my MS, which btw is not advisable for one’s sanity).

Anyway as part of the reunion they asked everyone to make a page for a “class book” that they’ll be distributing. It’s a one-pager where you share some of you Stanford memories and give an update on your life since graduating. I think they expected most people to draw their class book page by hand and snail-mail it in or use their web-based pseudo-WYSIWIG editor, but I wanted a bit more control. So I downloaded the template PDF and opened it in Adobe Illustrator, which converted it to line-art (wow–product compatibility, who knew?!). Then I was able to add the type and graphics in Illustrator and save the final copy back out to a PDF.

For me, life since Stanford meant three things: doing NLP research (this is the reunion for my undergrad class), working at Plaxo, and getting married. As scary as it is to consider that five years have gone by already, when I actually stop to think of all the wonderful things that have happened since then, I consider myself extremely fortunate. I couldn’t be happier. In fact, I could really use another five years like this one!

One quick technical note: Since I embedded lots of photos in my class book page at their original resolution (I just scaled them down in Illustrator so they would still print at high quality), the file ended up being almost 200MB. When I first exported it as a PDF, I kept all the default options, including “preserve Illustrator editing capabilities” and the resulting PDF was 140MB. Clearly I could not e-mail this to Stanford nor post it on my web site. So I tried again, unchecked the Illustrator option, and also went into the compression settings and told it to use JPEG for the color images (which of course the originals were, but the default PDF option is to use 8-bit ZIP). This made a huge difference and the PDF was only 3MB but still high resolution. I also tried the compression option “Average downsampling at 300 dpi” for color images, but that essentially took out all the resolution in the images, so as soon as you magnified the document at all, they were very pixelated (looked more like 72 dpi to me). Apparently just telling it to use JPEG with the original images is plenty.

Handling GeoCoding Errors From Yahoo Maps

One of the best features of Yahoo’s AJAX Maps API is its ability to geo-code full-text mailing addresses into lat/long on-the-fly, so you can say for instance “draw a map of 1300 Crittenden Lane, Mountain View, CA 94043“. (By now, Google and MSFT also offer geocoding, but Yahoo had it way earlier because they embraced JSON-P at a time when everyone else was still scratching their heads).

Yahoo does a pretty good job of geocoding addresses even if the format is a bit weird or some the info is missing, but of course they can’t always figure it out, especially if the address is garbage to start with. Their API indicates (somewhat cryptically) that you can capture an event when geocoding completes, but they don’t tell you what data you get or how to deal with it. Since there doesn’t appear to be much discussion of this on the Internets, I thought I’d provide the answer here for the record.

When creating a YMap, you can register a callback function to get triggered every time geocoding completes, like so:

var map = new YMap( ”¦ );
YEvent.Capture(map, EventsList.onEndGeoCode, myCallback);
”¦
function myCallback(resultObj) { ”¦ }

Yahoo’s API claims that you can also pass an optional private context object, but as far as I can tell they never send it back to you. Of course you can always use a closure around your callback function to achieve the same thing.

Now for the part they don’t tell you: your callback is called with a single resultObj argument. You can figure out the contents of this argument by just writing your callback function to take an argument and then writing console.dir(resultObj) to print out its full nested structure in the unbelievably useful Firebug (Joe, you’re my hero!). Here’s what you’ll see:

var resultObj = {
  success: 1, /* 1 for success, 0 for failure */
  /* Original address you tried to geo-code */
  Address: “1300 Crittenden Lane Mountain View, CA 94043”³,
  GeoPoint: {
    /* This is a YGeoPoint, which also has a bunch of functions you can call */
    Lat: 37.424663,
    Long: -122.07248
  },
  ThisMap: { /* reference to the YMap */ }
};

So in your callback function you just test for resultObj.success, and if the geocoding failed, you can show an appropriate error message.

One trick I found for showing an error message is that you can embed a hidden div with an error message inside the map-holder div you pass to the YMap constructor, and YMap won’t get rid of it. If you use absolute positioning and give it a z-index, you can then show it when geocoding fails and get a nice “Map not available” right where the map would normally be.

Here’s a working example of handling geocoding and showing an error message. Thanks Yahoo! for the great API, and hopefully some of this info will find its way into the next rev of your API docs. :)

PS: Special thanks to Mark Jen for finding me a decent code-writing plug-in for Windows Live Writer! Boy did I struggle with getting WordPress not to mangle my code in the eval post!

The origins of Lunch 2.0

In honor of the first officially-sanctioned Lunch 2.0 at Yahoo today, I thought I would finally write something on how and why we started this valley phenomenon:

Living in Silicon Valley is expensive, and the traffic on 101 sucks. So why not telecommute from, say, somewhere in the Midwest? What does living out here get you that working remotely doesn’t? Well, for one, all the other cool companies are out here. And, more importantly, the smart, innovative people behind those companies all live and work out here. But except for hiring employees, we rarely take advantage of that fact. We read about these companies in the blogs, and we use their products, and we’d probably all love to see how these companies and people live and work, but we don’t. Even though they’re like 5 minutes away from us, and they’re full of people just like us that would love to see how we live and work too!

And though many silicon valley companies are ostensibly at least somewhat in competition with one another, I think in most aspects we’re all kindred spirits fighting the same fight””trying to transform the world through technology and build a successful, functioning organization in the process. We all face the same issues: prioritizing features, hiring, nurturing a happy and productive work environment, dealing with growth, dealing with meetings and process (how much is too much? How little is too little?) and so on. Yet we rarely talk about these things, mainly because we’re all so busy trying to figure them out on our own. While traditional conferences may fill this need to some degree, they’re usually too big, too expensive, too impersonal, and too infrequent to appeal to most working people in the valley. But lunch is a perfect venue to get together, “talk shop”, and see how each other are set up. Everyone has to eat, it’s an informal setting, and it tends to be a manageable size. And silicon valley is such a small, closely connected world, that we know people at all the companies we care about within a degree or two of separation.

So initially, we just started doing this ourselves, e.g. “hey, you know so-and-so at Yahoo, can we go meet him for lunch next week?” or “my friend has this new startup and they just got an office, let’s go see them”. We thought others would be interested to see what we had seen, so we took photos and posted them online (in the process, coining “lunch 2.0” since we needed a name for the site, and it felt like a web-2.0 approach to the problem of meeting people). We also blogged upcoming events, but mainly just as an alternative for managing a mailing list of our few friends that wanted to come to these events. As we told our friends what we were doing, more and more wanted to come too, so we just pointed them to the blog, not thinking much of it.

The “we” in this case was initially me, Mark Jen (yes, that Mark; he joined Plaxo right after leaving Google), and Terry Chay from Plaxo (now at Tagged), and Terry’s friend Dave at Yahoo. Mark and I started having more lunches out at friends’ companies and Terry said he and Dave had been trying to do the same, so we quickly joined forces. Terry now tells people he was the “VC of lunch 2.0” because he plunked down the 5-bucks or so for the lunch20.com domain name. 🙂

The first company to realize that officially hosting lunch 2.0 would be a good thing was SimplyHired in early March ’06. Previously, we all just went to lunch with friends at Yahoo!, Google, and so on, but no one from the company officially “hosted” it, and certainly no one paid for us to eat. But Kay Kuo at SimplyHired wanted to get the word out about her company, so they ordered a bunch of food, gave us a tour of their office, demoed their site, and even gave us some free t-shirts! The event was a huge success, both for SimplyHired and for the people that came. Soon after, other companies started offering to host their own lunch 2.0 events. Mainly this was because someone from that company had attended a previous lunch 2.0 event, gotten excited, and gone back to tell their company they should do the same. Early lunch 2.0 hosts were Meebo, Plaxo, AOL, JotSpot, and Zvents.

Another big milestone was in May 06, when some people from Microsoft’s Silicon Valley Center got permission to host a lunch 2.0 event at their campus. This was definitely the most prominent company to host lunch 2.0 so far, and they did an amazing job, including paying for our lunch at the MSFT cafeteria, providing a tour of their 6-bldg campus, and bringing a lot of their own engineers to the event. By this point, lunch 2.0 had picked up enough of its own momentum that our roles as stewards changed from mainly trying to find and convince new people to host events to just coordinating times and logistics for companies that came to us and wanted to host. That trend has continued thus far, and shows no signs of slowing yet.

Other important milestones in lunch 2.0 history:

  • When JotSpot hosted lunch 2.0, something like 45 people showed up. Previously the biggest event had around 20 people, so this was the first time we thought “whoa, this thing is really getting out there”.
  • Meebo hosted a lunch 2.0 early in the summer and invited all summer interns in the valley to come. They had about 6 employees at the time and were sub-leasing a small amount of office space from another startup. About 80 people showed up, completely filling the office and spilling out onto the street.
  • Zazzle hosted an outdoor BBQ at their office and attracted a record crowd of about 150 people. They also set up tables with umbrellas, a professional BBQ setup and buffet line, custom-printed posters and banners, and even custom-printed lunch 2.0 t-shirts for all attendees.
  • Jeremiah from Hitachi Data Systems organized a combination lunch 2.0 and “web expo” at their executive briefing center. There were about 300 attendees, and we picked 10 data-intensive startups to bring laptops and set up an informal web expo where they could demo their products and talk about how they dealt with large amounts of data.

Going forward, it’s great to see that some of these events have gotten so large, but we also want to make sure that smaller startups can host lunch 2.0 events without feeling like they have to handle a ton of people or spend a lot of money. There are still plenty of cool companies in the area that we’ve never been to yet, so we’re hoping to keep doing lunch 2.0 for the foreseeable future.

Returning to those initial observations about making the most of living in the valley, I think the best thing that’s come from lunch 2.0 is that we’ve met so many other great people in the area, seen how they work, and they’ve met us in return. I feel more connected to what we’re all doing here, and I feel that I’m taking better advantage of the time and space in which we’re all living.

Fixing eval() to use global scope in IE

[Note: This is the first in what I hope will become a series of technical articles on the lessons I’ve learned “from the trenches” of my web development work at Plaxo. Non-techy readers are invited to skip any articles categorized under “Web development”. :)]

Update: This article has been picked up by Ajaxian, and it’s sparked an interesting discussion there. 

At Plaxo I’ve been working on a new (soon to be released) version of Plaxo Online (our web-based address book, calendar, and more) that is very ambitious both technically and in terms of user experience. We’re currently deep into performance tuning and bug fixing, and we’ve already learned a lot of interesting things, most of which I hope to share on this blog. The first lesson is how to correctly eval() code in the global scope (e.g. so functions you define inside the eval’d code can be used outside).

When we built the first version of the new site, we combined all the JavaScript into one giant file as part of our deployment process. The total codebase was huge and it had the predictable effect that initial page-load time was terrible because the user’s CPU was solidly spiked for several seconds while the poor browser choked through the massive amount of code it had to parse. So we started loading a lot of our code on-demand (packaging it into several logical chunks of related files and using dojo’s package/loader system to pull in the code as needed).

All was well until we started defining global functions in the loaded JavaScript. (We did this mainly for event handler code so we didn’t have to spend time walking the DOM and finding all the clickable nodes after injecting innerHTML to hook them up to the right scoped functions.) In Firefox, everything kept working fine, but in IE, none of the global functions were callable outside of the module being loaded on-demand (you would get a typically cryptic IE error that in effect said those global functions weren’t defined). It seemed clear that when the code being loaded got eval’d, the functions weren’t making it into the global scope of the page in IE. What was unclear was how to fix this.

Here’s a simplified version of the situation we faced:

function loadMyFuncModule() {
  // imagine this was loaded via XHR/etc
  var code = 'function myFunc() { alert("myFunc"); }';
  return eval(code); // doesn't work in FF or IE
}

function runApp() {
  loadMyFuncModule(); // load extra code "on demand"
  myFunc(); // execute newly loaded code
}

The thing to note above is that just calling eval() doesn’t stick the code in global scope in either browser. Dojo’s loader code solves this in Firefox by creating a dj_global variable that points to the global scope and then calling eval on dj_global if possible:

function loadMyFuncModule() {
  // imagine this was loaded via XHR/etc
  var code = 'function myFunc() { alert("myFunc"); }';
  var dj_global = this; // global scope object
  return dj_global.eval ? dj_global.eval(code) : eval(code);
}

This works in Firefox but not in IE (eval is not an object method in IE). So what to do? The answer turns out to be that you can use a proprietary IE method window.execScript to eval code in the global scope (thanks to Ryan “Roger” Moore on our team for figuring this out). The only thing to note about execScript is that it does NOT return any value (unlike eval). However when we’re just loading code on-demand, we aren’t returning anything so this doesn’t matter.

The final working code looks like this:

function loadMyFuncModule() {
  var dj_global = this; // global scope reference
  if (window.execScript) {
    window.execScript(code); // eval in global scope for IE
    return null; // execScript doesn't return anything
  }
  return dj_global.eval ? dj_global.eval(code) : eval(code);
}

function runApp() {
  loadMyFuncModule(); // load extra code "on demand"
  myFunc(); // execute newly loaded code
}

And once again all is well in the world. Hopefully this is the type of thing that will be hidden under the hood in future versions of dojo and similar frameworks, but for the time being it may well impact you if you’re loading code on demand. So may this article save you much time scratching you head and swearing at IE. 🙂

(PS: Having found the magic term execScript, I was then able to find some related articles on this topic by Dean Edwards and Jeff Watkins. However much of the details are buried in the comments, so I hope this article will increase both the findability and conciseness of this information).

How to stay current with my blog

If you would like to find out when I post something new to my web site, here are three ways to do it (ranging from least work to most useful):

  1. Just check josephsmarr.com periodically. Newest stories are at the top, and you can use the calendar and category links on the right sidebar to see what I’ve written.
  2. Subscribe via e-mail. If you want to receive an e-mail for each post I write, enter it in the “subscribe via e-mail” section on the right sidebar and click “subscribe”, or, if that smacks of effort, you can enter it here:

    Subscribe via e-mail:

  3. Subscribe via Bloglines (or another feed reader). If you’re reading other blogs besides mine, you should really consider using a tool to aggregate them. Bloglines is my favorite–it’s web-based (no downloads) and it shows you how many articles are unread in each blog you read (just like an e-mail program shows you unread messages). I have a “Subscribe with Bloglines” button on my sidebar, or you should be able to enter josephsmarr.com into most feed readers to subscribe. If your feed reader requires the actual RSS URL, it’s http://josephsmarr.com/feed/.

If you’re not really a “blog reader”, I recommend subscribing via e-mail. It will essentially turn my blog into a mailing list you can subscribe to. And you can still click through to leave a comment when the fancy strikes you.

Guy Kawasaki likes my Plaxo widget!

This was a pleasant surprise to wake up to yesterday: Guy Kawasaki, the widely read startup guru whose top-ten lists of do’s and don’t for entrepreneurs are gospel here in the valley, posted a new top-ten list called “The Top Ten Stupid Ways to Hinder Market Adoption“. Number 8 caught my attention in particular:

8. Requirement to re-type email addresses. How about the patent-pending, curve-jumping, VC-funded Web 2.0 company that wants to you to share content but requires you to re-type the email addresses of your friends?

I have 7,703 email addresses in Entourage. I am not going to re-type them into the piece-of-shiitake, done-as-an-afterthought address book that companies build into their products. If nothing else, companies can use this cool tool from Plaxo or allow text imports into the aforementioned crappy address book. When do you suppose a standard format will emerge for transferring contacts?

Wow, Guy is telling startups that if they don’t use the widget I built for Plaxo, they’re stupidly hindering market adoption! He’s also selling it exactly the way I would–you already have an address book, so sites should let you use it rather than foisting their own half-baked version with none of your contacts on you. 🙂

On a personal note, Guy was one of the “industry thought leaders” I listened to at Stanford during the bubble who cemented my zeal for doing startups (ah, the euphoric pre-crash days of sitting in a packed Stanford auditorium on Friday afternoons for IE292 and listening to people like Mike McCue and Marc Andreesen explain convincingly why the tech sector was still vastly under-valued). Guy is a fun and charasmatic speaker and his delivery is always very punchy.

What I remember most vividly from his talk (now nearly 7 years ago) was his “Stanford Shopping Mall Test” for picking VCs–if you saw the VC across the way in the (large, open-air) Stanford shopping mall, would you (a) run over to see him, (b) say hi if you happened to bump into him, or (c) get in your car and drive to another mall. You should only pick VCs for whom you answer (a). I don’t know if it’s true, but it sure sounded good! I’ve since read his Art of the Start, which contained similarly memorable advice, such as “flow with the go”, meaning if/when people adopt your technology for some purpose other than you originally envisioned, embrace the change instead of resisting it.

Given the impact that Guy’s had on me and most of my cohort here in the valley, it was certainly a trip to see him evangelize something I worked on. I went into work with a little extra bounce in my (admittedly already quite bouncy) step. 🙂

The paper that would not die

Sources of Success for Boosted Wrapper Induction
Journal of Machine Learning Research, Volume 5
Written October 2001, published December 2004

Download PDF (29 pages)

Download PPT (900KB; presentation at Stanford’s Seminar for Computational Learning and Adaptation)

I co-wrote this paper during the first summer I started doing NLP research, but it didn’t see the light of day until a year after I’d finished my Master’s degree. Yeesh!

It all started when I decided to spend the Summer of 2001 (between my junior and senior years at Stanford) doing research at UC San Diego with Charles Elkan. I’d met Charles through my dad on an earlier visit to UCSD, and his research exhibited exactly the mix of drive for deep understanding and desire to solve real-world problems that I was looking for. I was also working at the time for a startup called MedExpert that was using AI to help provide personalized medical advice. Since one of the major challenges was digesting the staggering volume of medical literature, MedExpert agreed to fund my summer research in information extraction. So I joined the UCSD AI lab for the summer and started working on tools for extracting information from text, a field that I would end up devoting most of my subsequent research to in one form or another.

As it happened, one of Charles’s PhD students, Dave Kauchak, was also working on information extraction, and he had recently gotten interested in a technique called Boosted Wrapper Induction. So Dave, Charles, and I ended up writing a lengthy paper that analyzed how BWI worked and how to improve it, including some specific work on medical literature using data from Mark Craven. By the end of the summer we had some interesting results, a well-written paper (or so I thought), and I was looking forward to being a published author.

Then the fun began. We submitted the paper for publication in an AI journal (it was too long to be a conference paper) and it got rejected, but with a request to re-submit it once we had made a list of changes. Many of the changes seemed to be more about style than substance, but we decided to make them anyway, and in the process we ran some additional experiments to shore up any perceived weaknesses (by this time I was back at Stanford and Dave was TAing classes, so re-running old research was not at the top of our wish list). Finally we submitted our revised paper to a new set of reviewers, who came back with a different set of issues they felt we had to fix first.

To make a long story short, we kept fiddling with it until finally, long after I had stopped personally working on this paper (and NLP altogether, for that matter), I got an e-mail from Dave saying the paper had finally been accepted, and would be published in the highly respected Journal of Machine Learning Research. It was hard to believe, but sure enough at the end of 2004–more than three years since we first wrote the paper–it finally saw the light of day. It was the paper that would not die.

Charles had long since published an earlier version of the paper as a technical report, so at least our work was able to have a bit more timely of an impact while it was “in the machine”. I’m glad it finally did get published, and I know that academic journals are notoriously slow, but given how fast the fronteir of computer science and NLP are moving, waiting 3+ years to release a paper is almost comical. I can’t wait until this fall to get the new issue and find out what Dave did the following summer. :p

A nifty NLP paper that never made it

Conditional Estimation of HMMs for Information Extraction
Submitted to ACL 2003
Sapporo, Japan
July 2003

Download PDF (8 pages)

Download PPT (500KB; presentation to NLP group, including work discussed in this paper)

A conditionally-trained HMM in a toy domainThis is another paper I wrote that didn’t get accepted for publication. Like my character-level paper, it was interesting and useful but not well targeted to the mindset and appetite of the academic NLP community. Also like my other paper, the work here ended up helping us build our CoNLL named-entity recognition model, which performed quite well and became a well-cited paper. If for no other reason, this paper is worth looking at because it contains a number of neat diagrams and graphs (as well as some fancy math that I can barely comprehend any more, heh).

One reason why I think this paper failed to find acceptance is that it wasn’t trying to get a high-score in extraction accuracy. Rather it was trying to use smaller models and simpler data to gain a deeper understanding of what’s working well and what’s not. When you build a giant HMM and run it on 1000 pages of text, it does so-so and there’s not a lot you can learn about what went wrong. It’s way too complex and detailed to look at and grok what it did and didn’t learn. Our approach was to start with a highly restricted toy domain and minimal model so we could see exactly what was going on and test various hypotheses. We then scaled the models up slightly to show that the results held in the real world, but we never tried to beat the state-of-the-art numbers. Sadly, it’s a lot harder to get a paper published when your final numbers aren’t competitive, even if the paper contributes some useful knowledge in the process.

It seems both odd and unfortunate to me that academic NLP, which is supposedly doing basic scientific research for the long-term interest, is culturally focused more on engineering and tweaking systems that can boost the numbers by a few percent than by really trying to understand what’s going on under the covers. After all, most of these systems aren’t close to human-level performance, and the current generation of technology is unlikely to get us there, so just doing a little better is a bit like climbing a tree to get to the moon (to quote Hubert Dreyfus, who famously said as much about the field of AI in general).

If companies are trying to use AI in the real-world, their interest is performance first, understanding second (make it work). But in academia, it should be just the opposite–careful study of techniqus and investigation of hypotheses with the aim of making breakthroughs in understanding today that will lead to high-performance systems in the future. But I guess the reality is that it’s much easier (in any discipline) to pick a metric and compete for the high score. (The race for a 3.6GHz processor to out-do the 3.5GHz competition in consumer desktop computers comes to mind, when both computers are severely bottlenecked on disk-IO and memory size and rarely stress the CPU in either case. Ok, that was either a lucid metaphor or complete jibberish, depending on you are. :))

In any event, I enjoyed doing this research, and I’m proud of the paper we wrote.

Information Extraction for the Semantic Web

Finding Educational Resources on the Web: Exploiting Automatic Extraction of Metadata
Workshop on Adaptive Text Extraction and Mining
Cavtat-Dubrovnik, Croatia
Sempetmber 22, 2003

Download PDF (4 pages)

The Semantic Web is a great idea: expose all of the information on the web in a machine-readable format, and intelligent agents will the be able to read it and act on your behalf (“Computer: When can I fly to San Diego? Where can I stay that has a hot tub? Ok, book it and add it to my calendar”). There’s just one problem: the humans writing web pages are writing them for other humans, and no one is labeling them for computers. (A notable exception are blogs, like this one, whose authoring tools also generate machine-readable versions in RSS or Atom that can be consumed by sites like Bloglines. In a way, Bloglines is one of the few sites making good on the vision of the Semantic Web.)

What do people do when they’re looking for a piece of information, say a list of syllabi for NLP classes? There’s no database that lists that type of information in a structured and curated form. Rather, there are a set of web pages that describe these classes, and they’re all a bit different. But most of them contain similar information–the title of the class, the course number, the professor, and so on. So, in a way, these pages do constitute a database of information, it just takes more work to access it.

That’s where NLP comes in. One of the ways we were using information extraction in the Stanford NLP group was to automatically extract structured information from web pages and represent it in a semantic web format like DAML+OIL and RDF. The idea is that you send your intelligent agent out to the web (“find me a bunch of NLP classes”) and when it comes across a page that looks promising, it first looks for semantic web markup. If it can’t find any (which will usually be the case for the forseeable future), it tries running the information extraction engine on the site to pull out the relevant data anyway. If the site allows it, it could then write that data back in a machine-readable format so the web becomes semantically richer the more agents go looking for information.

Specifically, we built a plug-in to the protege tool developed by the Stanford Medical Informatics group. Protege is a Java-based tool for creating and managing ontologies (a form of knowledge representation used by the semantic web). Our plug-in let you load a web page, run our information extraction tools on it, and add the extracted semantic information to your ontology. You could build up a collection of general-purpose information extracton tools (either hand-built or trained from data) and then use them as you found web pages you wanted to annotate.

Cynthia Thompson, a visiting professor for the year, used this system to find and extract information about educational materials on the web as part of the Edutella project. It ended up working well, and this paper was accepted to the Workshop on Adaptive Text Extraction and Mining as part of the annual European Conference on Machine Learning (ECML). I declined the offer to go to Croatia for the conference (though I’m sure it would have been a memorable experience), but I’m glad that my work contributed to this project.

My most famous NLP paper (CoNLL-03)

Named Entity Recognition with Character-Level Models
HLT-NAACL CoNLL-03 Shared Task
Edmonton, Canada
June 1, 2003

Download PDF (4 pages)

Download PPT (3.8MB; presentation at CoNLL-03)

Every year that Conference on Computational Natural Language Learning (CoNLL) has a “shared task” where they define a specific problem to solve, provide a standard data set to train your models on, and then host a competition for researchers to see who can get the best score. In 2003 the shared task was named-entity recognition (labeling person, place, and organization names in free text) with the twist that they were going to run the final models on a foreign language that wouldn’t be disclosed until the day of the competition. This meant that your model had to be flexible enough to learn from training data in a language it had never seen before (and thus you couldn’t hard-code English rules like “CEO of X” –> “X is an organization”).

Even though my first paper on character-level models got rejected, we kept working on it in the Stanford NLP group because we knew we were on to something. Since one of the major strengths of the model was its ability to distinguish different types of proper names based on their composition (i.e. it recognized that people’s names and company names usually look different), this seemed like an ideal task in which it could shine (see my master’s thesis for more on this work). By this time, I’d started working with Dan Klein, and he was able to take my model to the next level by combining it with a discriminatively trained maximum-entropy sequence model that allowed us to try lots of different character-level features without worrying about violating independence assumptions (a common problem with generative models like my original version). Dan’s also just brilliant and relentless when it comes to analyzing the errors a model is making and then iteratively refining it to perform better and better. The final piece of the puzzle came from my HMM work with Huy Nguyen, which let us combine segmentation (finding the boundaries of proper names in text) and classification (figuring out which type of proper name it is) into a single model.

Our paper was accepted (yay!) and Dan and I flew to Canada to present our work. This was my first NLP conference and it was awesome to meet all these famous researchers whom I’d previously read and learned from. Luckily for me, Dan was just about to finish his PhD, and he was actively being courted by the top NLP programs, so by sticking with him I quickly met most of the important people in the field. Statistical NLP attracts a fascinating mix of people with strong math backgrounds, interest in language, and a passion for empirical (data-driven) research, so this was an amazing group of people to interact with.

On the last day of the conference (CoNLL was held inside HLT-NAACL, which were two larger NLP conferences that had also merged), the big day had come at last. My first presentation as an NLP researcher (Dan let me give the talk on behalf of our team), and the announcement of the competition results. There were 16 entries in the competition. In English (the language we had been given ahead of time), our model got the 3rd highest score; in German (the secret language), our model came in 2nd, though the difference between our model and the one in 1st place was not statistically significant. In other words, had the test data been slightly different, we might easily have had the highest score.

Doing so well was certainly gratifying, but what made us even happier was the fact that our model was far simpler and purer than most in the competition. For instance, the model that got first place in both languages was itself a combination of four separate classifiers, and in addition to the training data provided by the conference, it also used a large external list of known person, place, and organizaton names (called a gazetteer). While piling so much on certainly helped eek out a slightly higher score, it also makes it harder to learn any general results about what pieces contributed and how that might be applied in the future.

In contrast, our model was almost exclusively a demonstration of the valuable information contained in character-level features. Despite leaving out many of the bells-and-whistles used by other systems, our model performed well because we gave it good features and let it combine them well. As a wise man once said, “let the data do the talking”. Perhaps because of the simplicity of our model and its novel use of character features, our paper has been widely cited, and is certainly the most recognized piece of research I did while at Stanford. It makes me smile because the core of the work never got accepted for publication, but it managed to live on and make an impact regardless.

« Older posts Newer posts »

© 2025 Joseph Smarr

Theme by Anders NorenUp ↑