Questa è una vecchia versione del documento!
This page contains some useful information about the Information Retrieval project.
In the project, you will have to develop a Relatedness function, i.e. a function that, given two entities, returns their semantic relatedness. Entities are expressed as Wikipedia pages. For example:
R(Tree, Tree) = 1.0
R(Tree, Leaf) = 0.9
R(Tree, Dart_Vader) = 0.0
Note that TagMe is built upon such a function.
For assisting your development, you are provided with:
We suggest you perform the following steps to get into the environment.
ssh username@ferrax-2.itc.unipi.it
ls /home/irproject/*lib*
irproject.IRProjectHelper irproject.IRProjectExperiments
IRProjectHelper
gives you access to the Wikipedia graph, while IRProjectExperiments
runs two experiments to test your function.
free -h df -h
As our first relatedness function, we will make a dummy one that always returns 0.
DummyRel.java
with your favorite text editor.import it.acubelab.tagme.RelatednessMeasure; public class DummyRel extends RelatednessMeasure { public DummyRel(String lang) { super(lang); } @Override public float rel(int a, int b) { return 0.0f; } }
Then save and exit.
bin/
to store your compiled files, set an environment variable for the path of the libraries, and compile: mkdir bin export IRLIB=../irproject/lib/*:../irproject/tagme_libs/*:../irproject/tagme_preproc_lib/* javac -cp $IRLIB -d bin/ *.java
Main.java
file and paste the following code:import it.acubelab.tagme.config.TagmeConfig; import it.acubelab.tagme.RelatednessMeasure; import irproject.IRProjectExperiments; public class Main { public static void main(String[] args) throws Exception { TagmeConfig.init("/home/irproject/config.xml"); String groupName = ""; // Insert your group name here. String groupPw = ""; // Insert your group password here. RelatednessMeasure rel = new DummyRel("en"); IRProjectExperiments.launchTagMeExperiment(groupName, groupPw, rel); IRProjectExperiments.launchRelatednessExperiment(groupName, groupPw, rel); } }
java -cp $IRLIB:./bin Main
On the first launch, the program will have to query Wikipedia and retrieve some data. Don't worry: this data gets cached, and if you run the program again, the output will be way smaller. Running the program again will generate the following output:
Results for the Evaluation of TagMe: F1:0.524198 Precision:0.577957 Recall:0.479590 Time spent for the annotations:20.935064 Results for the Evaluation of the Relatedness function: Quadratic distance:0.369874 Absolute distance:0.408762
Look back at the scoreboard page. Your group should have appeared.
Let's have a closer look at these numbers. There are two kinds of results:
You don't really have to care about how the first three numbers are computed (if interested, read here), but keep in mind that the higher those numbers are, the better TagMe is performing.
Let's focus on the last two figures. The test of the relatedness function is performed against a dataset of 311 pairs (entity, entity)
. For each pair, the dataset provides a relatedness found by humans. Your task is to make a relatedness function that computes a relatedness as close as possible to that given by those humans.
Let P
be the list of pairs, H(p)
the relatedness found by the humans for a pair of entities p
, and R(p)
the relatedness found by your function for p
. The measures we employ are defined as:
In other words, they compute the distance between your and their relatedness, and do an average. Quadratic distance penalizes bigger mistakes more than Absolute distance.
Add to the end of your main
method a line like:
IRProjectExperiments.dumpRelatednessExperiment(groupName, rel);
This will dump, for each pair of entities, the expected relatedness (the one given by humans) and that returned by your function.
Before starting to implement your function, we suggest to have a glance at a few articles:
You can use irproject.IRProjectHelper
to access some pre-computed data that you may found useful to develop your function. Note that we do not suggest to limit your scope to these methods: if you need a new one, implement it!
These are the methods provided by IRProjectHelper
:
public static int[] getInlinks(int page_id); public static int[] getOutlinks(int page_id); public static int TitleToId(String title); public static String getCategoryTitle(int catId); public static IntSet getAllWids(); public static boolean isDisambiguation(int pageId); public static boolean isNormalPage(int page_id) public static boolean isPerson(int pageId); public static int[] getCategories(int pageId); public static int dereference(int pageId); public static float linkProbability(string anchor); public static float commonness(string anchor, int pageId);
Most of them are self-explanatory.
getInlinks
and getOutlinks
respectively give the pages pointing to and pointed by page_id
.TitleToId
turns a page title into its ID.getAllWids
returns the set of all the Wikipedia IDs.isDisambiguation
tells you if pageId
is a disambiguation page.isNormalPage
tells you if pageId
is a page describing one single concept.isPerson
tries to guess if the page is about a person.getCategories
returns all the categories pageId
is part of.getCategoryTitle
turns a category ID into its title.dereference
turns a redirect link into its actual page.linkProbability
, given an anchor text, returns how many times that text appears as a link in the whole Wikipedia.commonness
, given an anchor and a page ID, returns how many times that anchor points to that page ID.
The submission will happen on Feb 9, 12:00 am. You have to leave in your home directory a Main.java
that runs the experiments with your relatedness function and prints the results. We need to understand from the code how your function works. Please remove all unnecessary data and code from your home directory. If needed, please leave a short README
explaining how to produce your results.
You will make a pitch (5min presentation) on Feb 10, 9:30 @ Aula Seminari Ovest, quickly explaining your idea and results.
Main.java
, simply comment out the call to launchTagMeExperiment
.