PLEASE NOTE:

We are currently releasing the second version of Dexter, documentation may be incorrect. We are working to update it! In the meanwhile you can browse (and play) with the new REST API documentation :)

Developing

You can use Dexter in two different ways:

  • Using the Rest API, after downloading the jar and its resources;
  • Using the Java API;

The REST API

Start a REST Server

Click on this link for downloading Dexter. The archive requires around 2 Gigabytes, and contains the Dexter binary code (dexter.jar) and the model used by Dexter for annotating.

The current model is generated from the 04/03/2013 English Wikipedia dump, available here. (we plan to release update models for English and other languages).

Once the download is finished, untar the package, and from the directory dexter, just run

java -Xmx3000m -jar dexter.jar

(you will need at least 3G of ram and Java 7). The framework should be available in few seconds at the address:

http://localhost:8080/

First query will take a bit because Dexter will have to load all the model in main memory.

The API

Currently Dexter supports these functions:

Method Params Description Example
rest/annotate text the text to annotate
n (optional, default=5) the maximum number of entities to annotate
Performs the entity linking on a given text, annotating maximum n entities example
rest/get-desc id the wiki-id of an entity
title-only (optional, default=false), if set to true returns only the label of the entity
Given the Wiki-id of an entity, returns the label and a snippet containing some sentences that describe the entity (the snippet is retrieved from the lucene index if present, otherwise calling the Wikipedia API) example
rest/spot text the text to spot It only performs the first step of the entity linking process, i.e., find all the mentions that could refer to an entity example
rest/graph/
get-source-entities
wid the id of the entities asWikiNames=true|false(optional) Returns all the entities whose the correspondent Wikipedia page contains a link to the given entity. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Maradona
rest/graph/
get-target-entities
wid the id of the entities asWikiNames=true|false(optional) Returns all the entities whose the correspondent Wikipedia page contains is linked by the given entity. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Maradona
rest/graph/
get-entity-categories
wid the id of the entities asWikiNames=true|false(optional) Returns all the categories of the given entity. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Maradona
rest/graph/
get-belonging-entities
wid the id of the entities asWikiNames=true|false(optional) Returns all the entities belonging to the given category. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Category 1982 FIFA World Cup players
rest/graph/
get-parent-categories
wid the id of the entities asWikiNames=true|false(optional) Returns all the parent categories for the given category. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Category 1982 FIFA World Cup players
rest/graph/
get-child-categories
wid the id of the entities asWikiNames=true|false(optional) Returns all the child categories for the given category. By default it returns the wikiIds, if asWikiNames is set to true, returns the titles of the pages example for Category 1982 FIFA World Cup
Using the Java API Client

Download the dexter source code:

git clone https://github.com/diegoceccarelli/dexter
cd dexter
git submodule init
git submodule update

the project is built using Maven so in order to compile it you will have to go in the main folder of the project (dexter) and run the command:

mvn install 

Once you performed the installation, you will have to add to your maven project the dependency:

<dependency>
	<groupId>it.cnr.isti.hpc</groupId>
	<artifactId>dexter-webapp</artifactId>
	<version>1.0.0</version>
</dependency>

Then will be able to call the REST api from your have project using the DexterRestClient as in the following example:

DexterRestClient client = new DexterRestClient(
		"http://dexterdemo.isti.cnr.it:8080/rest");
AnnotatedDocument ad = client
		.annotate("Dexter is an American television drama series which debuted on Showtime on October 1, 2006. The series centers on Dexter Morgan (Michael C. Hall), a blood spatter pattern analyst for the fictional Miami Metro Police Department (based on the real life Miami-Dade Police Department) who also leads a secret life as a serial killer. Set in Miami, the show's first season was largely based on the novel Darkly Dreaming Dexter, the first of the Dexter series novels by Jeff Lindsay. It was adapted for television by screenwriter James Manos, Jr., who wrote the first episode. ");
System.out.println(ad);
SpottedDocument sd = client
		.spot("Dexter is an American television drama series which debuted on Showtime on October 1, 2006. The series centers on Dexter Morgan (Michael C. Hall), a blood spatter pattern analyst for the fictional Miami Metro Police Department (based on the real life Miami-Dade Police Department) who also leads a secret life as a serial killer. Set in Miami, the show's first season was largely based on the novel Darkly Dreaming Dexter, the first of the Dexter series novels by Jeff Lindsay. It was adapted for television by screenwriter James Manos, Jr., who wrote the first episode. ");
System.out.println(sd);
ArticleDescription desc = client.getDesc(5981816);
System.out.println(desc);

If you downloaded the framework and you started it on your machine you can also call your service changing the server url:

DexterRestClient client = new DexterRestClient(
		"http://localhost:8080/rest");

Managing the Java project

Install

You can install the java project checking out it from github:

git clone https://github.com/diegoceccarelli/dexter
cd dexter
git submodule init
git submodule update
 

the project is built using Maven so in order to compile it you will have to go in the main folder of the project (dexter) and run the command:

mvn install 

The compilation should terminate with no errors. You will still need the model 'data' folder provided in the dexter.tar, you can put where it where you want, but you will have to indicate its position in the files project.properties contained in the subfolders dexter-code and dexter-webapp.

Dexter is organized in several submodules, in the following we will briefly describe them:

Json-Wikipedia

(see the javadoc)

Json Wikipedia contains code to convert the Wikipedia XML dump in a [JSON][json] dump.

  java target/json-wikipedia-1.0.0-jar-with-dependencies.jar it.cnr.isti.hpc.wikipedia.cli.MediawikiToJsonCLI -input wikipedia-dump.xml.bz -output wikipedia-dump.json[.gz] -lang [en|it] 		

or

./scripts/convert-xml-dump-to-json.sh [en|it] wikipedia-dump.xml.bz wikipedia-dump.json[.gz]

produces in `wikipedia-dump.json` the JSON version of the dump. Each line of the file contains an article of dump encoded in JSON. Each JSON line can be deserialized in an Article object, which represents an _enriched_ version of the wikitext page. The Article object contains:

  • the title (e.g., Leonardo Da Vinci);
  • the wikititle (used in Wikipedia as key, e.g., Leonardo\_Da\_Vinci);
  • the namespace and the integer namespace in the dump;
  • the timestamp of the article;
  • the type, if it is a standard article, a redirection, a category and so on;
  • if it is not in English the title of the correspondent English Article;
  • a list of tables that appear in the article ;
  • a list of lists that that appear in the article ;
  • a list of internal links that appear in the article;
  • if the article is a redirect, the pointed article;
  • a list of section titles in the article;
  • the text of the article, divided in paragraphs;
  • the categories and the templates of the articles;
  • the list of attributes found in the templates;
  • a list of terms highlighted in the article;
  • if present, the infobox.

Dexter-Core

(see the javadoc)

The core implements the pipeline for generating the entity linking model from a wikipedia dump. It also provides all the tools needed to write an entity linking method.

The most important objects to understand are:

  • The Spot object, which represents a mention of one or more candidate entities;
  • The Entity object, which represents an entity;
  • A SpotMatch object, which represents a particular mention in a given text,
  • An EntityMatch which represents a particular match of an entity in a document.

It defines also some important interfaces for performing the linking:

  • Spotter, which defines the method spot that given a text returns a list of SpotMatches
  • Disambiguator, which given a list of SpotMatches returns a list of EntityMatch.

It is possible to write new Spotters or Disambiguators and use them in dexter, putting the jars in the folder dexter/libs, or inside the folder dexter-webapp/src/main/webapp/WEBINF/lib, and then selecting them from the project.properties file, e.g.,

disambiguator.class=it.cnr.isti.hpc.wikiminer.Wikiminer

By default, Dexter ships with one spotter (based on the dictionary of the anchors in Wikipedia) and one Disambiguator, implementing the Okkam's Razor principle, resolving the ambiguity for a spot using the entity with the largest probability to be represented by the spot (this probability is called commonness and it is computed as the ratio between the links that point to the entity (using the spot as anchor) and the total number of links that have the spot as anchor.

Creating a New Model

TODO

Dexter-Webapp

(see the javadoc)

Finally you will able to start the web-app with the interface and rest api, going into the folder dexter-webapp and running

mvn jetty:run -DskipTests

Entity Linking Models

TODO

Hpc Utils

(see the javadoc)

TODO