Tag The Web

Posted by
|

Title: Tag The Web
Team: Jerry Fernandes Medeiros, Bernardo Pereira Nunes


Short description
:
The main purpose of this project is to create a universal classification on the Web based on the common sense rather than on a traditional classification system created by domain experts. That is, we believe it is easier to an ordinary user access and retrieve information from a Web created by people to people rather than by experts. The proposed method is a general purpose classification that is able to classify any text-based content on the Web, for instance, from scientific articles to even tweets. At the current stage, the proposed method is able to classify any content in English and can be accessed at http://www.TagTheWeb.com.br.

84 votes, average: 4.44 out of 584 votes, average: 4.44 out of 584 votes, average: 4.44 out of 584 votes, average: 4.44 out of 584 votes, average: 4.44 out of 5 (84 votes, average: 4.44 out of 5)
You need to be a registered member to rate this.
Loading...

Long description:

TAG THE WEB Project

Tag the Web is a project that aims to create a common sense classification for the whole Web based on the common sense and Linked Open Data.

THE WIKIPEDIA

Wikipedia is the largest encyclopedia freely available on the Web. It has been developed and curated by a large number of users over time and represents the common sense about facts, people and the most broad type of topics currently found on the Web.

The Wikipedia project was created in 2001 as an evolution of the Nupedia, written only by experts, and is now maintained by the Wikipedia Foundation. Wikipedia content is available in almost all languages and, only in the English version, it has more than 5.4M articles. On average, 10 edits per second is performed by the approx. 30M registered users all over the World.

One of the outstanding features of Wikipedia is the categorization system used to classify its internal content. Very briefly, there are a finite number of top categories that represents the whole Wikipedia content. This top categories, as well as its subcategories, are not fixed and is maintained and curated by Wikipedia users.

As in many classification methods, such as Dewey Decimal Classification and Library of Congress Classification, an article in Wikipedia can belong to one or more top categories which at some sense represents the topics it covers. Within Wikipedia, the primary purpose of this classification is to facilitate the search for relevant information.

Note that although the structure of Wikipedia category forms a taxonomy, it is not represented by a simple tree of categories and subcategories but in fact a dense graph. It allows multiple categorizations of topics simultaneously. That is, one category may have multiple parents. The category “Semantics” is a good example of this complex structure since it is a subcategory of “Communication Studies”, “Cybernetics”, “Interdisciplinary Subfields of Sociology” and “Philosophy of Language”.

The Wikipedia Categorization scheme is a thesaurus collaboratively constructed used for indexing the content on Wikipedia pages. Therefore, we can say that it represents the common sense about everything contained in Wikipedia. It is a classification made by people to people and not by experts to ordinary people. It is also noteworthy to mention the richness of this kind of information for several tasks performed by users on the Web. For instance, search, information retrieval, recommendation, clusterization, etc.

The main purpose of this project is to create a universal classification on the Web based on the common sense rather than on a traditional classification system created by domain experts. That is, we believe it is easier to an ordinary user access and retrieve information from a Web created by people to people rather than by experts. The proposed method is a general purpose classification that is able to classify any text-based content on the Web, for instance, from scientific articles to even tweets. At the current stage, the proposed method is able to classify any content in English and can be accessed at http://www.TagTheWeb.com.br.

APPROACH OVERVIEW

As the basis for our approach, we consider the relationships of Wikipedia Categories as a directed graph. Let G=(V, E) be a graph, where V is the set of vertices representing Wikipedia categories, and E is the set of edges representing the relationships between two categories. To build the Category Graph we relied on a dump of the English version of Wikipedia from October 20th resulting in a graph containing 1,475,806 vertices and 4,091,416 edges. Administrative categories, used only for Wikipedia internal organization was removed. Examples of these categories include, but are not limited to, categories beginning with Articles_needing_ and WikiProject_.

As the goal of this step is the creation of a graph of the categories of Wikipedia, we decided to use Neo4J, a free graph-based database that provides proper documentation as well as an active community. The graph in Neo4j was built with one concept named Category, with two properties: categoryName and categoryID, as well as a relationship type, called SUBCATEGORY_OF with no properties.

For a better understanding of how Neo4J handles the graph, we can use as example the following query:

MATCH (a:Category)-[r:SUBCATEGORY_OF] =>(b:Category{categoryName:”Carnivores”}) RETURN a

In this example, Neo4J fetches all nodes of the type Category linked to other nodes also of the type Category having the property CategoryName with value “Carnivores”.

FINGERPRINT (Content “Tagging”)

The rich structure of the Wikipedia Category Graph has contributed to make it one of the largest semantic taxonomies in existence. That said, our main goal is to take advantage of this body of knowledge to automatically categorize any text-based content on the Web following the common sense of Wikipedia contributors.

We developed a universal categorization method based on four steps:
1. Text Annotation;
2. Categories Extraction;
3. Topic Attribution; and,
4. Fingerprint Generation.

To make it simple to understand, let us illustrate the fingerprint process below:

Step 1 – Text Annotation

When dealing with the Web of Documents, we are essentially working with
unstructured data, which in turn hinders data manipulation and the identification of atomic elements in texts. To alleviate this problem, information extraction (IE) methods, such as Named-Entity Recognition (NER) and name resolution, are employed. These tools automatically extract structured information from unstructured data and link to external knowledge bases in the Linked Open Data cloud (LOD), in our case, DBpedia.

For instance, after processing the following Web resource using an IE tool: “I agree with Barack Obama that the whole episode should be investigated.”, the entity “Barack Obama” is annotated and classified as “person” and linked to the DBpedia resource
<http://dbpedia.org/resource/Barack_Obama>, where structured information about him is available.

We use the DBpedia Spotlight tool <http://dbpedia-spotlight.github.io/demo/> to extract and enrich entities found in the Web resources.

Note that our method is language independent as long as we have a solid repository of entities (such as DBpedia) and a proper annotation tool (such as Spotlight, Alchemy or WikipediaMiner). However, the set of entities that can be identified by the annotation process is limited to the number of known entities in the dataset, in our case, the English DBpedia dataset.

Step 2 – Categories Extraction
Given as starting point the entities that were found in the previous step, the categories extraction step begins by traversing the entity relationships to find a more general representation of the entity, i.e., their categories. All categories associated to the entities identified in the source of information are extracted.

For instance, for each extracted and enriched entity in a Web resource, we explore their relationships through the predicate [dcterms:subject], which by definition represents the categories of an entity. In that sense, to retrieve the topics, we use SPARQL query language for RDF over the DBpedia SPARQL, where we navigate up in the DBpedia hierarchy to retrieve broader semantic relations between the entities and its topics.

Note that an entity/concept can be found in different levels of the hierarchical categories of DBpedia, and hence this approach would lead us to retrieve topics in different category levels.

Step 3 – Topic Attribution
In the third step we want to understand how the resource page being tagged is related to the the top level of Wikipedia’s overall classification system.

We consider the paths from all categories associated to the entities in a Web page to all container categories of Wikipedia, the ones that are subcategories of Main_top_classifications, to generate a fingerprint for each Web resource.

Knowing all categories associated with the entities in a given Web resource, we can than assign each entity to the top-level categories it belongs to. This process consists of navigating in the Category Graph from each category extracted in step three towards the top of the graph by all the shortest paths between the category and the container Main_topic_classification.

Each time the source category reaches one of the top-level categories, we update the influence of this top category in the composition of the resource classification.

Step 4 – Fingerprint Generation
The last step is the fingerprint generation based on the influence of each top category in the resource being tagged. We store the calculated classification as a multidimensional vector, making it easy to retrieve and compare documents. For Instance, using a straightforward similarity metric such as cosine.

FINAL REMARKS
We believe that this challenge will bring the opportunity to showcase our work and demonstrate its usefulness in several areas. For instance, education, information retrieval, semantic web, recommendation systems, digital libraries, and so on.

This project is being currently developed at the Federal University of the State of Rio de Janeiro and the Pontifical Catholic University of Rio de Janeiro in Brazil.

Country: Brazil

 

loghi

Fondazione Giorgio Cini
EuropeanaPrattregesta.exe


Linking sponsors

OCLC
AARNet

Gladys Krieble Delmas Foundation


Supporting sponsors

The Getty
Casalini Libri


Challenge sponsors

Synaptic


Travel awards sponsors

Digital Library Federation

ExLibris

con il patrocinio

With the patronage of


AgID MAECI Alliance of Digital Humanities Organizations ICOM Italia anai
Le città in festa

timeline twitter

Skip to toolbar