Wikipedia API?

August 11, 2007

A team of German university researchers are putting together an API for the Wikipedia. They describe it like “DBpedia.org is a community effort to extract structured information from Wikipedia and to make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to Wikipedia data.”

What it seems that they have actually done is extracted data from the wikipedia and have it available online via their API, using SPARQL to query against this data.

From their Introduction:

Wikipedia is the by far largest publicly available encyclopedia on the Web. Wikipedia editions are available in over 100 languages with the English one accounting for more than 1.6 million articles. Wikipedia has the problem that its search capabilities are limited to full-text search, which only allows very limited access to this valuable knowledge-base.

Semantic Web technologies enable expressive queries against structured information on the Web. The Semantic Web has the problem that there is not much RDF data online yet and that up-to-date terms and ontologies are missing for many application domains.

The DBpedia.org project approaches both problems by extracting structured information from Wikipedia and by making this information available on the Semantic Web. DBpedia.org allows you to ask sophisticated queries against Wikipedia and to link other datasets on the Web to DBpedia data.

Wikipedia articles consist mostly of free text, but also contain different types of structured information, such as infobox templates, categorisation information, images, geo-coordinates and links to external Web pages. This structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content.

The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. We use the SPARQL query language to query this data.

The DBpedia dataset currently consists of around 91 million RDF triples, which have been extracted from the English, German, French, Spanish, Italian, Portuguese, Polish, Swedish, Dutch, Japanese and Chinese version of Wikipedia. The DBpedia dataset describes 1,600,000 concepts, including at least 58,000 persons, 70,000 places, 35,000 music albums, 12,000 films. It contains 557,000 links to images, 1,300,000 links to relevant external web pages, 207,000 Wikipedia categories and 75,000 YAGO categories.