Thoughts on web development, tech, and life.

Author: jsmarr (Page 4 of 5)

The origins of Lunch 2.0

In honor of the first officially-sanctioned Lunch 2.0 at Yahoo today, I thought I would finally write something on how and why we started this valley phenomenon:

Living in Silicon Valley is expensive, and the traffic on 101 sucks. So why not telecommute from, say, somewhere in the Midwest? What does living out here get you that working remotely doesn’t? Well, for one, all the other cool companies are out here. And, more importantly, the smart, innovative people behind those companies all live and work out here. But except for hiring employees, we rarely take advantage of that fact. We read about these companies in the blogs, and we use their products, and we’d probably all love to see how these companies and people live and work, but we don’t. Even though they’re like 5 minutes away from us, and they’re full of people just like us that would love to see how we live and work too!

And though many silicon valley companies are ostensibly at least somewhat in competition with one another, I think in most aspects we’re all kindred spirits fighting the same fight””trying to transform the world through technology and build a successful, functioning organization in the process. We all face the same issues: prioritizing features, hiring, nurturing a happy and productive work environment, dealing with growth, dealing with meetings and process (how much is too much? How little is too little?) and so on. Yet we rarely talk about these things, mainly because we’re all so busy trying to figure them out on our own. While traditional conferences may fill this need to some degree, they’re usually too big, too expensive, too impersonal, and too infrequent to appeal to most working people in the valley. But lunch is a perfect venue to get together, “talk shop”, and see how each other are set up. Everyone has to eat, it’s an informal setting, and it tends to be a manageable size. And silicon valley is such a small, closely connected world, that we know people at all the companies we care about within a degree or two of separation.

So initially, we just started doing this ourselves, e.g. “hey, you know so-and-so at Yahoo, can we go meet him for lunch next week?” or “my friend has this new startup and they just got an office, let’s go see them”. We thought others would be interested to see what we had seen, so we took photos and posted them online (in the process, coining “lunch 2.0” since we needed a name for the site, and it felt like a web-2.0 approach to the problem of meeting people). We also blogged upcoming events, but mainly just as an alternative for managing a mailing list of our few friends that wanted to come to these events. As we told our friends what we were doing, more and more wanted to come too, so we just pointed them to the blog, not thinking much of it.

The “we” in this case was initially me, Mark Jen (yes, that Mark; he joined Plaxo right after leaving Google), and Terry Chay from Plaxo (now at Tagged), and Terry’s friend Dave at Yahoo. Mark and I started having more lunches out at friends’ companies and Terry said he and Dave had been trying to do the same, so we quickly joined forces. Terry now tells people he was the “VC of lunch 2.0” because he plunked down the 5-bucks or so for the lunch20.com domain name. 🙂

The first company to realize that officially hosting lunch 2.0 would be a good thing was SimplyHired in early March ’06. Previously, we all just went to lunch with friends at Yahoo!, Google, and so on, but no one from the company officially “hosted” it, and certainly no one paid for us to eat. But Kay Kuo at SimplyHired wanted to get the word out about her company, so they ordered a bunch of food, gave us a tour of their office, demoed their site, and even gave us some free t-shirts! The event was a huge success, both for SimplyHired and for the people that came. Soon after, other companies started offering to host their own lunch 2.0 events. Mainly this was because someone from that company had attended a previous lunch 2.0 event, gotten excited, and gone back to tell their company they should do the same. Early lunch 2.0 hosts were Meebo, Plaxo, AOL, JotSpot, and Zvents.

Another big milestone was in May 06, when some people from Microsoft’s Silicon Valley Center got permission to host a lunch 2.0 event at their campus. This was definitely the most prominent company to host lunch 2.0 so far, and they did an amazing job, including paying for our lunch at the MSFT cafeteria, providing a tour of their 6-bldg campus, and bringing a lot of their own engineers to the event. By this point, lunch 2.0 had picked up enough of its own momentum that our roles as stewards changed from mainly trying to find and convince new people to host events to just coordinating times and logistics for companies that came to us and wanted to host. That trend has continued thus far, and shows no signs of slowing yet.

Other important milestones in lunch 2.0 history:

  • When JotSpot hosted lunch 2.0, something like 45 people showed up. Previously the biggest event had around 20 people, so this was the first time we thought “whoa, this thing is really getting out there”.
  • Meebo hosted a lunch 2.0 early in the summer and invited all summer interns in the valley to come. They had about 6 employees at the time and were sub-leasing a small amount of office space from another startup. About 80 people showed up, completely filling the office and spilling out onto the street.
  • Zazzle hosted an outdoor BBQ at their office and attracted a record crowd of about 150 people. They also set up tables with umbrellas, a professional BBQ setup and buffet line, custom-printed posters and banners, and even custom-printed lunch 2.0 t-shirts for all attendees.
  • Jeremiah from Hitachi Data Systems organized a combination lunch 2.0 and “web expo” at their executive briefing center. There were about 300 attendees, and we picked 10 data-intensive startups to bring laptops and set up an informal web expo where they could demo their products and talk about how they dealt with large amounts of data.

Going forward, it’s great to see that some of these events have gotten so large, but we also want to make sure that smaller startups can host lunch 2.0 events without feeling like they have to handle a ton of people or spend a lot of money. There are still plenty of cool companies in the area that we’ve never been to yet, so we’re hoping to keep doing lunch 2.0 for the foreseeable future.

Returning to those initial observations about making the most of living in the valley, I think the best thing that’s come from lunch 2.0 is that we’ve met so many other great people in the area, seen how they work, and they’ve met us in return. I feel more connected to what we’re all doing here, and I feel that I’m taking better advantage of the time and space in which we’re all living.

Fixing eval() to use global scope in IE

[Note: This is the first in what I hope will become a series of technical articles on the lessons I’ve learned “from the trenches” of my web development work at Plaxo. Non-techy readers are invited to skip any articles categorized under “Web development”. :)]

Update: This article has been picked up by Ajaxian, and it’s sparked an interesting discussion there. 

At Plaxo I’ve been working on a new (soon to be released) version of Plaxo Online (our web-based address book, calendar, and more) that is very ambitious both technically and in terms of user experience. We’re currently deep into performance tuning and bug fixing, and we’ve already learned a lot of interesting things, most of which I hope to share on this blog. The first lesson is how to correctly eval() code in the global scope (e.g. so functions you define inside the eval’d code can be used outside).

When we built the first version of the new site, we combined all the JavaScript into one giant file as part of our deployment process. The total codebase was huge and it had the predictable effect that initial page-load time was terrible because the user’s CPU was solidly spiked for several seconds while the poor browser choked through the massive amount of code it had to parse. So we started loading a lot of our code on-demand (packaging it into several logical chunks of related files and using dojo’s package/loader system to pull in the code as needed).

All was well until we started defining global functions in the loaded JavaScript. (We did this mainly for event handler code so we didn’t have to spend time walking the DOM and finding all the clickable nodes after injecting innerHTML to hook them up to the right scoped functions.) In Firefox, everything kept working fine, but in IE, none of the global functions were callable outside of the module being loaded on-demand (you would get a typically cryptic IE error that in effect said those global functions weren’t defined). It seemed clear that when the code being loaded got eval’d, the functions weren’t making it into the global scope of the page in IE. What was unclear was how to fix this.

Here’s a simplified version of the situation we faced:

function loadMyFuncModule() {
  // imagine this was loaded via XHR/etc
  var code = 'function myFunc() { alert("myFunc"); }';
  return eval(code); // doesn't work in FF or IE
}

function runApp() {
  loadMyFuncModule(); // load extra code "on demand"
  myFunc(); // execute newly loaded code
}

The thing to note above is that just calling eval() doesn’t stick the code in global scope in either browser. Dojo’s loader code solves this in Firefox by creating a dj_global variable that points to the global scope and then calling eval on dj_global if possible:

function loadMyFuncModule() {
  // imagine this was loaded via XHR/etc
  var code = 'function myFunc() { alert("myFunc"); }';
  var dj_global = this; // global scope object
  return dj_global.eval ? dj_global.eval(code) : eval(code);
}

This works in Firefox but not in IE (eval is not an object method in IE). So what to do? The answer turns out to be that you can use a proprietary IE method window.execScript to eval code in the global scope (thanks to Ryan “Roger” Moore on our team for figuring this out). The only thing to note about execScript is that it does NOT return any value (unlike eval). However when we’re just loading code on-demand, we aren’t returning anything so this doesn’t matter.

The final working code looks like this:

function loadMyFuncModule() {
  var dj_global = this; // global scope reference
  if (window.execScript) {
    window.execScript(code); // eval in global scope for IE
    return null; // execScript doesn't return anything
  }
  return dj_global.eval ? dj_global.eval(code) : eval(code);
}

function runApp() {
  loadMyFuncModule(); // load extra code "on demand"
  myFunc(); // execute newly loaded code
}

And once again all is well in the world. Hopefully this is the type of thing that will be hidden under the hood in future versions of dojo and similar frameworks, but for the time being it may well impact you if you’re loading code on demand. So may this article save you much time scratching you head and swearing at IE. 🙂

(PS: Having found the magic term execScript, I was then able to find some related articles on this topic by Dean Edwards and Jeff Watkins. However much of the details are buried in the comments, so I hope this article will increase both the findability and conciseness of this information).

How to stay current with my blog

If you would like to find out when I post something new to my web site, here are three ways to do it (ranging from least work to most useful):

  1. Just check josephsmarr.com periodically. Newest stories are at the top, and you can use the calendar and category links on the right sidebar to see what I’ve written.
  2. Subscribe via e-mail. If you want to receive an e-mail for each post I write, enter it in the “subscribe via e-mail” section on the right sidebar and click “subscribe”, or, if that smacks of effort, you can enter it here:

    Subscribe via e-mail:

  3. Subscribe via Bloglines (or another feed reader). If you’re reading other blogs besides mine, you should really consider using a tool to aggregate them. Bloglines is my favorite–it’s web-based (no downloads) and it shows you how many articles are unread in each blog you read (just like an e-mail program shows you unread messages). I have a “Subscribe with Bloglines” button on my sidebar, or you should be able to enter josephsmarr.com into most feed readers to subscribe. If your feed reader requires the actual RSS URL, it’s http://josephsmarr.com/feed/.

If you’re not really a “blog reader”, I recommend subscribing via e-mail. It will essentially turn my blog into a mailing list you can subscribe to. And you can still click through to leave a comment when the fancy strikes you.

Guy Kawasaki likes my Plaxo widget!

This was a pleasant surprise to wake up to yesterday: Guy Kawasaki, the widely read startup guru whose top-ten lists of do’s and don’t for entrepreneurs are gospel here in the valley, posted a new top-ten list called “The Top Ten Stupid Ways to Hinder Market Adoption“. Number 8 caught my attention in particular:

8. Requirement to re-type email addresses. How about the patent-pending, curve-jumping, VC-funded Web 2.0 company that wants to you to share content but requires you to re-type the email addresses of your friends?

I have 7,703 email addresses in Entourage. I am not going to re-type them into the piece-of-shiitake, done-as-an-afterthought address book that companies build into their products. If nothing else, companies can use this cool tool from Plaxo or allow text imports into the aforementioned crappy address book. When do you suppose a standard format will emerge for transferring contacts?

Wow, Guy is telling startups that if they don’t use the widget I built for Plaxo, they’re stupidly hindering market adoption! He’s also selling it exactly the way I would–you already have an address book, so sites should let you use it rather than foisting their own half-baked version with none of your contacts on you. 🙂

On a personal note, Guy was one of the “industry thought leaders” I listened to at Stanford during the bubble who cemented my zeal for doing startups (ah, the euphoric pre-crash days of sitting in a packed Stanford auditorium on Friday afternoons for IE292 and listening to people like Mike McCue and Marc Andreesen explain convincingly why the tech sector was still vastly under-valued). Guy is a fun and charasmatic speaker and his delivery is always very punchy.

What I remember most vividly from his talk (now nearly 7 years ago) was his “Stanford Shopping Mall Test” for picking VCs–if you saw the VC across the way in the (large, open-air) Stanford shopping mall, would you (a) run over to see him, (b) say hi if you happened to bump into him, or (c) get in your car and drive to another mall. You should only pick VCs for whom you answer (a). I don’t know if it’s true, but it sure sounded good! I’ve since read his Art of the Start, which contained similarly memorable advice, such as “flow with the go”, meaning if/when people adopt your technology for some purpose other than you originally envisioned, embrace the change instead of resisting it.

Given the impact that Guy’s had on me and most of my cohort here in the valley, it was certainly a trip to see him evangelize something I worked on. I went into work with a little extra bounce in my (admittedly already quite bouncy) step. 🙂

The paper that would not die

Sources of Success for Boosted Wrapper Induction
Journal of Machine Learning Research, Volume 5
Written October 2001, published December 2004

Download PDF (29 pages)

Download PPT (900KB; presentation at Stanford’s Seminar for Computational Learning and Adaptation)

I co-wrote this paper during the first summer I started doing NLP research, but it didn’t see the light of day until a year after I’d finished my Master’s degree. Yeesh!

It all started when I decided to spend the Summer of 2001 (between my junior and senior years at Stanford) doing research at UC San Diego with Charles Elkan. I’d met Charles through my dad on an earlier visit to UCSD, and his research exhibited exactly the mix of drive for deep understanding and desire to solve real-world problems that I was looking for. I was also working at the time for a startup called MedExpert that was using AI to help provide personalized medical advice. Since one of the major challenges was digesting the staggering volume of medical literature, MedExpert agreed to fund my summer research in information extraction. So I joined the UCSD AI lab for the summer and started working on tools for extracting information from text, a field that I would end up devoting most of my subsequent research to in one form or another.

As it happened, one of Charles’s PhD students, Dave Kauchak, was also working on information extraction, and he had recently gotten interested in a technique called Boosted Wrapper Induction. So Dave, Charles, and I ended up writing a lengthy paper that analyzed how BWI worked and how to improve it, including some specific work on medical literature using data from Mark Craven. By the end of the summer we had some interesting results, a well-written paper (or so I thought), and I was looking forward to being a published author.

Then the fun began. We submitted the paper for publication in an AI journal (it was too long to be a conference paper) and it got rejected, but with a request to re-submit it once we had made a list of changes. Many of the changes seemed to be more about style than substance, but we decided to make them anyway, and in the process we ran some additional experiments to shore up any perceived weaknesses (by this time I was back at Stanford and Dave was TAing classes, so re-running old research was not at the top of our wish list). Finally we submitted our revised paper to a new set of reviewers, who came back with a different set of issues they felt we had to fix first.

To make a long story short, we kept fiddling with it until finally, long after I had stopped personally working on this paper (and NLP altogether, for that matter), I got an e-mail from Dave saying the paper had finally been accepted, and would be published in the highly respected Journal of Machine Learning Research. It was hard to believe, but sure enough at the end of 2004–more than three years since we first wrote the paper–it finally saw the light of day. It was the paper that would not die.

Charles had long since published an earlier version of the paper as a technical report, so at least our work was able to have a bit more timely of an impact while it was “in the machine”. I’m glad it finally did get published, and I know that academic journals are notoriously slow, but given how fast the fronteir of computer science and NLP are moving, waiting 3+ years to release a paper is almost comical. I can’t wait until this fall to get the new issue and find out what Dave did the following summer. :p

A nifty NLP paper that never made it

Conditional Estimation of HMMs for Information Extraction
Submitted to ACL 2003
Sapporo, Japan
July 2003

Download PDF (8 pages)

Download PPT (500KB; presentation to NLP group, including work discussed in this paper)

A conditionally-trained HMM in a toy domainThis is another paper I wrote that didn’t get accepted for publication. Like my character-level paper, it was interesting and useful but not well targeted to the mindset and appetite of the academic NLP community. Also like my other paper, the work here ended up helping us build our CoNLL named-entity recognition model, which performed quite well and became a well-cited paper. If for no other reason, this paper is worth looking at because it contains a number of neat diagrams and graphs (as well as some fancy math that I can barely comprehend any more, heh).

One reason why I think this paper failed to find acceptance is that it wasn’t trying to get a high-score in extraction accuracy. Rather it was trying to use smaller models and simpler data to gain a deeper understanding of what’s working well and what’s not. When you build a giant HMM and run it on 1000 pages of text, it does so-so and there’s not a lot you can learn about what went wrong. It’s way too complex and detailed to look at and grok what it did and didn’t learn. Our approach was to start with a highly restricted toy domain and minimal model so we could see exactly what was going on and test various hypotheses. We then scaled the models up slightly to show that the results held in the real world, but we never tried to beat the state-of-the-art numbers. Sadly, it’s a lot harder to get a paper published when your final numbers aren’t competitive, even if the paper contributes some useful knowledge in the process.

It seems both odd and unfortunate to me that academic NLP, which is supposedly doing basic scientific research for the long-term interest, is culturally focused more on engineering and tweaking systems that can boost the numbers by a few percent than by really trying to understand what’s going on under the covers. After all, most of these systems aren’t close to human-level performance, and the current generation of technology is unlikely to get us there, so just doing a little better is a bit like climbing a tree to get to the moon (to quote Hubert Dreyfus, who famously said as much about the field of AI in general).

If companies are trying to use AI in the real-world, their interest is performance first, understanding second (make it work). But in academia, it should be just the opposite–careful study of techniqus and investigation of hypotheses with the aim of making breakthroughs in understanding today that will lead to high-performance systems in the future. But I guess the reality is that it’s much easier (in any discipline) to pick a metric and compete for the high score. (The race for a 3.6GHz processor to out-do the 3.5GHz competition in consumer desktop computers comes to mind, when both computers are severely bottlenecked on disk-IO and memory size and rarely stress the CPU in either case. Ok, that was either a lucid metaphor or complete jibberish, depending on you are. :))

In any event, I enjoyed doing this research, and I’m proud of the paper we wrote.

Information Extraction for the Semantic Web

Finding Educational Resources on the Web: Exploiting Automatic Extraction of Metadata
Workshop on Adaptive Text Extraction and Mining
Cavtat-Dubrovnik, Croatia
Sempetmber 22, 2003

Download PDF (4 pages)

The Semantic Web is a great idea: expose all of the information on the web in a machine-readable format, and intelligent agents will the be able to read it and act on your behalf (“Computer: When can I fly to San Diego? Where can I stay that has a hot tub? Ok, book it and add it to my calendar”). There’s just one problem: the humans writing web pages are writing them for other humans, and no one is labeling them for computers. (A notable exception are blogs, like this one, whose authoring tools also generate machine-readable versions in RSS or Atom that can be consumed by sites like Bloglines. In a way, Bloglines is one of the few sites making good on the vision of the Semantic Web.)

What do people do when they’re looking for a piece of information, say a list of syllabi for NLP classes? There’s no database that lists that type of information in a structured and curated form. Rather, there are a set of web pages that describe these classes, and they’re all a bit different. But most of them contain similar information–the title of the class, the course number, the professor, and so on. So, in a way, these pages do constitute a database of information, it just takes more work to access it.

That’s where NLP comes in. One of the ways we were using information extraction in the Stanford NLP group was to automatically extract structured information from web pages and represent it in a semantic web format like DAML+OIL and RDF. The idea is that you send your intelligent agent out to the web (“find me a bunch of NLP classes”) and when it comes across a page that looks promising, it first looks for semantic web markup. If it can’t find any (which will usually be the case for the forseeable future), it tries running the information extraction engine on the site to pull out the relevant data anyway. If the site allows it, it could then write that data back in a machine-readable format so the web becomes semantically richer the more agents go looking for information.

Specifically, we built a plug-in to the protege tool developed by the Stanford Medical Informatics group. Protege is a Java-based tool for creating and managing ontologies (a form of knowledge representation used by the semantic web). Our plug-in let you load a web page, run our information extraction tools on it, and add the extracted semantic information to your ontology. You could build up a collection of general-purpose information extracton tools (either hand-built or trained from data) and then use them as you found web pages you wanted to annotate.

Cynthia Thompson, a visiting professor for the year, used this system to find and extract information about educational materials on the web as part of the Edutella project. It ended up working well, and this paper was accepted to the Workshop on Adaptive Text Extraction and Mining as part of the annual European Conference on Machine Learning (ECML). I declined the offer to go to Croatia for the conference (though I’m sure it would have been a memorable experience), but I’m glad that my work contributed to this project.

My most famous NLP paper (CoNLL-03)

Named Entity Recognition with Character-Level Models
HLT-NAACL CoNLL-03 Shared Task
Edmonton, Canada
June 1, 2003

Download PDF (4 pages)

Download PPT (3.8MB; presentation at CoNLL-03)

Every year that Conference on Computational Natural Language Learning (CoNLL) has a “shared task” where they define a specific problem to solve, provide a standard data set to train your models on, and then host a competition for researchers to see who can get the best score. In 2003 the shared task was named-entity recognition (labeling person, place, and organization names in free text) with the twist that they were going to run the final models on a foreign language that wouldn’t be disclosed until the day of the competition. This meant that your model had to be flexible enough to learn from training data in a language it had never seen before (and thus you couldn’t hard-code English rules like “CEO of X” –> “X is an organization”).

Even though my first paper on character-level models got rejected, we kept working on it in the Stanford NLP group because we knew we were on to something. Since one of the major strengths of the model was its ability to distinguish different types of proper names based on their composition (i.e. it recognized that people’s names and company names usually look different), this seemed like an ideal task in which it could shine (see my master’s thesis for more on this work). By this time, I’d started working with Dan Klein, and he was able to take my model to the next level by combining it with a discriminatively trained maximum-entropy sequence model that allowed us to try lots of different character-level features without worrying about violating independence assumptions (a common problem with generative models like my original version). Dan’s also just brilliant and relentless when it comes to analyzing the errors a model is making and then iteratively refining it to perform better and better. The final piece of the puzzle came from my HMM work with Huy Nguyen, which let us combine segmentation (finding the boundaries of proper names in text) and classification (figuring out which type of proper name it is) into a single model.

Our paper was accepted (yay!) and Dan and I flew to Canada to present our work. This was my first NLP conference and it was awesome to meet all these famous researchers whom I’d previously read and learned from. Luckily for me, Dan was just about to finish his PhD, and he was actively being courted by the top NLP programs, so by sticking with him I quickly met most of the important people in the field. Statistical NLP attracts a fascinating mix of people with strong math backgrounds, interest in language, and a passion for empirical (data-driven) research, so this was an amazing group of people to interact with.

On the last day of the conference (CoNLL was held inside HLT-NAACL, which were two larger NLP conferences that had also merged), the big day had come at last. My first presentation as an NLP researcher (Dan let me give the talk on behalf of our team), and the announcement of the competition results. There were 16 entries in the competition. In English (the language we had been given ahead of time), our model got the 3rd highest score; in German (the secret language), our model came in 2nd, though the difference between our model and the one in 1st place was not statistically significant. In other words, had the test data been slightly different, we might easily have had the highest score.

Doing so well was certainly gratifying, but what made us even happier was the fact that our model was far simpler and purer than most in the competition. For instance, the model that got first place in both languages was itself a combination of four separate classifiers, and in addition to the training data provided by the conference, it also used a large external list of known person, place, and organizaton names (called a gazetteer). While piling so much on certainly helped eek out a slightly higher score, it also makes it harder to learn any general results about what pieces contributed and how that might be applied in the future.

In contrast, our model was almost exclusively a demonstration of the valuable information contained in character-level features. Despite leaving out many of the bells-and-whistles used by other systems, our model performed well because we gave it good features and let it combine them well. As a wise man once said, “let the data do the talking”. Perhaps because of the simplicity of our model and its novel use of character features, our paper has been widely cited, and is certainly the most recognized piece of research I did while at Stanford. It makes me smile because the core of the work never got accepted for publication, but it managed to live on and make an impact regardless.

My first NLP research paper

Classifying Unknown Proper Noun Phrases Without Context
Technical Report dbpubs/2002-46
Stanford University
April 9, 2002

Download PDF (9 pages)

Download PPT (1.3MB; presentation of the paper to the NLP group)

As I describe in my post about my master’s thesis, I started doing research in Natural Language Processing after Chris Manning, the professor that taught my NLP class at Stanford, asked me to further develop the work I did for my class project. He helped me clean up my model, suggested some improvements, and taught me the official way to write and style a professional academic paper (I narrowly avoided having to write it in LaTeX!). I was proud of the final paper, but it wasn’t accepted (I believe we submitted it to EMNLP 02).

This was the start of a series of lessons I learned at Stanford about the difference between what I personally found interesting (and how I wanted to explain it) and what the academic establishment (that decides what papers are published by peer review) thought the rules and conventions had to be for “serious academic work”. While I got better at “playing the game” during my time at Stanford–and to be fair, some of it was actually good and helpful in terms of how to be precise, avoid overstating results, and so on–I still feel that the academic community has lost sight of their original aspirations in some important ways.

At its best, academic research embarks on grand challenges that will take many years to accomplish but whose results will change society in profound ways. It’s a long-term investment for a long-term gain. NLP has no shortage of these lofty goals, including the ability to carry on a natural conversation with your computer, high quality machine-translation of text in foreign languages, the ability to automatically summarize large quantities of text, and so on. But in practice I have found that in most of these areas, the sub-community that is ostensibly working one of these problems has actually constructed its own version of the problem, along with its own notions of what’s important and what isn’t, that doesn’t always ground out in the real world at the end of the day. This limits progress when work that could contribute to the original goal is not seen as important in the current academic formulation. And since, in most cases, the final challenge is not yet solvable, it’s often difficult to offer empirical counter-evidence to the opinions of the establishment as to whether a piece of work will or will not end up making an important difference.

I found this particularly vexing because my intuition is driven strongly by playing with a system, noting its current shortcomings, and then devising clever ways to overcome them. Some of the shortcomings I perceived were not considered shortcomings in the academic version of these challenges, and thus my interest in improving those aspects fell largely on deaf ears.

For instance, I did a fair amount of work in information extraction, which is about extracting structured information from free text (e.g. finding the title, author, and price of a book on an amazon web page or determining which company bought which other one and for how much in a Reuters news article). The academic formulation of this problem is to run your system fully autonomously over a collection of pages, and your score is based on how many mistakes you make. There are two kinds of mistakes–extracting the wrong piece of information, or not extracting anything when you should have–and both are usually counted as equally bad (the main score used in papers is F1, which is the harmonic average of precision and recall, which measure those two types of errors respectively). If your paper doesn’t show a competitive F1, it’s difficult to convince the community that you’re advancing the state-of-the-art, and thus it’s difficult to get it published.

However, in many real-world applications, the computer is not being run completely autonomously, and mistakes and omissions are not equally costly. In fact, if you’re trying to construct a high-quality database of information starting from free text, I’d say the general rule is that people are ultimately responsible for creating the output (the computer program is a means to that end), and that the real challenge is to see how much text you can automatically extract given that what you do extract has to be extremely high quality. In most cases, returning garbage data is much worse than not being able to cover every piece of information possible, and if humans can clean up the computer’s output, they will definitely want to do so. Thus the real-world challenges are maximizing recall at a fixed high-level of precision (not maximizing F1) and accurately estimating confidence scores for each piece of information extracted (so the human can focus on just cleaning up the tricky parts), neither of which fit cleanly into the academic conception of the problem. And this is to say nothing about how quickly or robustly the systems can process the information they’re extracting, which would clearly also be of utmost importance in a functioning system.

I witnessed firsthand this difference between the problem academics are trying to solved and the solution that real applications need when I started working for Plaxo. A core component of the original system was the ability to let people e-mail you their current contact info (either in free text, like “hey, i got a new cell phone…” or in the signature blocks at the bottom of messages) and automatically extract that information and stick it in your address book. This would clearly be very useful if it worked well (the status quo is you have to copy-and-paste it all manually, and as a result, most people just leave that information sitting in e-mail), and it clearly fits the real-world description above (sticking garbage in your address book is unaccepatble, whereas failing to extract 100% of the info is still strictly better than not doing anything). None of the academic systems being worked on had a chance of doing a good job at this problem, and so I had to write a custom solution involving a lot of complicated regular expressions and other pattern-matching code. My system ended up working very well–and very quickly (it could process a typical message in under 50 msec, whereas most academic systems are a “start it running and then go for coffee” kind of affair)–and developing it required a lot of clever ideas, but it was certainly nothing I could get an academic paper published about.

The irony cuts both ways–when I tried to solve the real problem, I couldn’t get published, but the work that was published didn’t help. And yet the academic community could surely do a much better job of solving the real problem if only they hadn’t decided it wasn’t the problem they were interested in. I only bring this up because I am a big believer in the power and potential of academic research, and I still optimistically hope that its impact could be that much greater if its goals were more closely aligned with the ultimate problems they’re trying to solve. By bridging the gap between academia and companies, both should be able to benefit tremendously.

If you’ve read this far in the hope of knowing more about the contents of my first NLP paper, I’m sorry to say it has nothing to do with information extraction, and certainly nothing to do with the academic/real-world divide. But it’s a neat paper (and probably shorter than this blog post!) and despite its not being published, the work it describes ended up influencing other work that I and people at the Stanford NLP group did, some of which did end up gaining a fair bit of notoriety in academic circles.

My Stanford Master’s Thesis

Categorization by Character-Level Models: Exploiting the Sound Symbolism of Proper Names
Master’s thesis, Symbolic Systems Program, Stanford University
Christopher D. Manning, Advisor
June 11, 2003

Download PDF (52 pages) 

A character-level hidden markov model (HMM)After four years as an undergraduate at Stanford, I wasn’t ready to leave yet. There were more classes I wanted to take, and I wanted to do more research. Since I was in the Symbolic Systems program, I was taking a mix of Computer Science, Linguistics, Psychology, and Philosophy classes for my major. I was particularly interested in CS and Linguistics, and I wanted to take many of the graudate-level classes in each department, so I really needed a fifth year at school.

During my senior year, I had started doing some NLP research with Chris Manning, which I was really enjoying. When I took his CS224N class, I did a final paper with Steve “Sensei” Patel in which we built a model to recognize unknown words as drug, company, person, place, and movie names based on their composition (e.g. “cotramoxizole” looks like a drug word, and “InterTrust” looks like company name, and we trained our model to learn these patterns). Our model performed very well–in fact, it did better than our friends on the same tests!–and Chris asked me if I’d like to develop this research further with him. After the work we did during my senior year, he offered to fund me as a research assistant during my fifth year.

Stanford has this amazing co-terminal master’s program where you can start taking master’s classes before you finish your undergraduate degree, and so you end up getting both degrees in five years (some people, like my wife, even manage to squeeze both degrees into four years, but like I said, I wasn’t ready to leave yet). The symbolic systems program had just started offering a co-term, but it was research-based (in some departments you just have to take more classes) and so one requirement was you had to have a professor sponsoring your research and vouching that you were serious. The timing was perfect, and I was selected as one of a few students to do a research MS in SSP that year.

(That summer, I also met the founders of Plaxo and started working “part time” building some NLP tools for them. That’s another story, but let me say it’s really not possible to do research and a startup at the same time and do both of them well.)

While working in the Stanford NLP group, I spent a lot of time with Dan Klein, one of Chris’s star PhD students, who’s now a professor at Berkeley. He had a major influence on my work, as well as on me personally. During my co-term year, I also started working with a CS master’s student named Huy Nguyen. We became good friends and he’s now an engineer at Plaxo (hmm, I wonder how that happened ;)).

I wrote quite a few academic NLP papers during my time at Stanford, some of which got published and some of which didn’t. The original paper I did with Chris based on my CS224N project got rejected, but it ended up forming the core of the model Dan, Huy, and I used at the CoNLL-03 competition, which was very successful and has since been widely cited.

My thesis represents the culmination of the work I did at Stanford. It’s central thrust is that you can tell a surprisingly great deal about a proper name by looking at its composition at the character-level. Most NLP systems just treat words as opaque symbols (“dog” = x1, “cat” = x2, etc.) and treat all unknown (previously unseen) words as a generic UNK word (that’s really all you can do if you’re only gathering statistics at the word-level). As a result, these systems often perform poorly when dealing with unknown words, which is increasingly common as they are applied to the untamed world-wide-web or to domains like medicine and biology that are full of specific technical words.

My research looked at a variety of ways you could exploit regularities in the character sequences of unknown words to segment and classify them semantically, even though you’d never seen them before. In addition to presenting experimental results in a number of domains and in multiple lanugages, I also investigated why there appears to be this sound-symbolic regularity in naming, looking at language evolution and professional brand-name creation in particular.

When my thesis was complete, I had to decide whether to apply to a PhD program to continue my research or to instead join Plaxo full-time as an engineer. As you probably know, I ended up choosing Plaxo, mainly because I really believed in the founders and the company’s vision, but also because I wanted to do something tangible that would have immediate impact in the real-world. But I still think that someday I might like to go back to school and continue doing NLP research. The way I look at it, I can’t lose: by the time I’m ready to go back, either all the interesting problems in NLP will have already been solved–in which case the world will be a truly amazing place to live–or there will still be plenty for me left to work on. 🙂

« Older posts Newer posts »

© 2024 Joseph Smarr

Theme by Anders NorenUp ↑