Finding Educational Resources on the Web: Exploiting Automatic Extraction of Metadata
Workshop on Adaptive Text Extraction and Mining
Cavtat-Dubrovnik, Croatia
Sempetmber 22, 2003

Download PDF (4 pages)

The Semantic Web is a great idea: expose all of the information on the web in a machine-readable format, and intelligent agents will the be able to read it and act on your behalf (“Computer: When can I fly to San Diego? Where can I stay that has a hot tub? Ok, book it and add it to my calendar”). There’s just one problem: the humans writing web pages are writing them for other humans, and no one is labeling them for computers. (A notable exception are blogs, like this one, whose authoring tools also generate machine-readable versions in RSS or Atom that can be consumed by sites like Bloglines. In a way, Bloglines is one of the few sites making good on the vision of the Semantic Web.)

What do people do when they’re looking for a piece of information, say a list of syllabi for NLP classes? There’s no database that lists that type of information in a structured and curated form. Rather, there are a set of web pages that describe these classes, and they’re all a bit different. But most of them contain similar information–the title of the class, the course number, the professor, and so on. So, in a way, these pages do constitute a database of information, it just takes more work to access it.

That’s where NLP comes in. One of the ways we were using information extraction in the Stanford NLP group was to automatically extract structured information from web pages and represent it in a semantic web format like DAML+OIL and RDF. The idea is that you send your intelligent agent out to the web (“find me a bunch of NLP classes”) and when it comes across a page that looks promising, it first looks for semantic web markup. If it can’t find any (which will usually be the case for the forseeable future), it tries running the information extraction engine on the site to pull out the relevant data anyway. If the site allows it, it could then write that data back in a machine-readable format so the web becomes semantically richer the more agents go looking for information.

Specifically, we built a plug-in to the protege tool developed by the Stanford Medical Informatics group. Protege is a Java-based tool for creating and managing ontologies (a form of knowledge representation used by the semantic web). Our plug-in let you load a web page, run our information extraction tools on it, and add the extracted semantic information to your ontology. You could build up a collection of general-purpose information extracton tools (either hand-built or trained from data) and then use them as you found web pages you wanted to annotate.

Cynthia Thompson, a visiting professor for the year, used this system to find and extract information about educational materials on the web as part of the Edutella project. It ended up working well, and this paper was accepted to the Workshop on Adaptive Text Extraction and Mining as part of the annual European Conference on Machine Learning (ECML). I declined the offer to go to Croatia for the conference (though I’m sure it would have been a memorable experience), but I’m glad that my work contributed to this project.

Liked this post? Follow this blog to get more.