====== Information Retrieval project 2013 ====== This page contains some useful information about the Information Retrieval project. In the project, you will have to develop a Relatedness function, i.e. a function that, given two entities, returns their semantic relatedness. Entities are expressed as Wikipedia pages. For example: R(Tree, Tree) = 1.0\\ R(Tree, Leaf) = 0.9\\ R(Tree, Dart_Vader) = 0.0 Note that [[http://tagme.di.unipi.it|TagMe]] is built upon such a function. For assisting your development, you are provided with:\\ * An access to ferrax-2.itc.unipi.it * A helper class which provides you with a few methods that you may find useful * An experiment class that launches two experiments for you * A [[http://ferrax-2.itc.unipi.it|site]] that keeps track of the development for each group. ===== Walthrough ===== We suggest you perform the following steps to get into the environment. ==== First access ==== - Ask x@y (where x is 'cornolti' and y is 'di.unipi.it') for a username and a password to access ferrax-2. - Log into ferrax-2 with: ssh username@ferrax-2.itc.unipi.it - Have a look at the jar libraries that you may need to use: ls /home/irproject/*lib* - Note the presence of ir2013project.jar. This jar provides you with two classes: irproject.IRProjectHelper irproject.IRProjectExperiments ''IRProjectHelper'' gives you access to the Wikipedia graph, while ''IRProjectExperiments'' runs two experiments to test your function. - Check the available ram and the hard disk usage. You have to keep an eye at these values: if they run too low, problems may occur free -h df -h ==== My first relatedness function ==== As our first relatedness function, we will make a dummy one that always returns 0. - Open ''DummyRel.java'' with your favorite text editor. - Copy and paste the following code: import it.acubelab.tagme.RelatednessMeasure; public class DummyRel extends RelatednessMeasure { public DummyRel(String lang) { super(lang); } @Override public float rel(int a, int b) { return 0.0f; } } Then save and exit. - Make a directory called ''bin/'' to store your compiled files, set an environment variable for the path of the libraries, and compile: mkdir bin export IRLIB=../irproject/lib/*:../irproject/tagme_libs/*:../irproject/tagme_preproc_lib/* javac -cp $IRLIB -d bin/ *.java - Cool. You have made your first relatedness function. Let's test it. Create a ''Main.java'' file and paste the following code: import it.acubelab.tagme.config.TagmeConfig; import it.acubelab.tagme.RelatednessMeasure; import irproject.IRProjectExperiments; public class Main { public static void main(String[] args) throws Exception { TagmeConfig.init("/home/irproject/config.xml"); String groupName = ""; // Insert your group name here. String groupPw = ""; // Insert your group password here. RelatednessMeasure rel = new DummyRel("en"); IRProjectExperiments.launchTagMeExperiment(groupName, groupPw, rel); IRProjectExperiments.launchRelatednessExperiment(groupName, groupPw, rel); } } - Before testing your relatedness function, let's have a look at the [[http://ferrax-2.itc.unipi.it|scoreboard page]]. This page shows the achievements of the other groups. It also shows the baseline given by TagMe. - We are ready to launch. Enter: java -cp $IRLIB:./bin Main On the first launch, the program will have to query Wikipedia and retrieve some data. Don't worry: this data gets cached, and if you run the program again, the output will be way smaller. Running the program again will generate the following output: Results for the Evaluation of TagMe: F1:0.524198 Precision:0.577957 Recall:0.479590 Time spent for the annotations:20.935064 Results for the Evaluation of the Relatedness function: Quadratic distance:0.369874 Absolute distance:0.408762 Look back at the [[http://ferrax-2.itc.unipi.it|scoreboard page]]. Your group should have appeared. Let's have a closer look at these numbers. There are two kinds of results: * The evaluation of the performance of TagMe using your relatedness function, and * The evaluation of the function itself. You don't really have to care about how the first three numbers are computed (if interested, read [[http://research.google.com/pubs/archive/40749.pdf‎|here]]), but keep in mind that the higher those numbers are, the better TagMe is performing. Let's focus on the last two figures. The test of the relatedness function is performed against a dataset of 311 pairs ''(entity, entity)''. For each pair, the dataset provides a relatedness found by humans. Your task is to make a relatedness function that computes a relatedness as close as possible to that given by those humans.\\ Let ''P'' be the list of pairs, ''H(p)'' the relatedness found by the humans for a pair of entities ''p'', and ''R(p)'' the relatedness found by your function for ''p''. The measures we employ are defined as: * Absolute distance: {{:magistraleinformatica:ir:ir13:abs_distance.gif?200|}} * Quadratic distance: {{:magistraleinformatica:ir:ir13:quad_distance.gif?200|}} In other words, they compute the distance between your and their relatedness, and do an average. Quadratic distance penalizes bigger mistakes more than Absolute distance. ==== Having a closer look at the results ==== Add to the end of your ''main'' method a line like: IRProjectExperiments.dumpRelatednessExperiment(groupName, rel); This will dump, for each pair of entities, the expected relatedness (the one given by humans) and that returned by your function. ===== Developing your own function ===== Before starting to implement your function, we suggest to have a glance at a few articles: * ==== Using ''IRProjectHelper'' ==== You can use ''irproject.IRProjectHelper'' to access some pre-computed data that you may found useful to develop your function. Note that we do not suggest to limit your scope to these methods: if you need more methods, ask Marco. You may need to implement them! To use ''IRProjectHelper'', please refer to the [[http://ferrax-2.itc.unipi.it/static/javadoc/index.html|javadoc]] ===== Submitting your project ===== The submission will happen on Feb 9, 12:00 am. You have to leave in your home directory a ''Main.java'' that runs the experiments with your relatedness function and prints the results. We need to understand from the code how your function works. Please remove all unnecessary data and code from your home directory. If needed, please leave a short ''README'' explaining how to produce your results. You will make a pitch (5min presentation) on Feb 11, 9:30 @ Aula Seminari Ovest, quickly explaining your idea and results. ===== Final remarks ===== * Before starting to implement, do some brainstorming and think about smart solutions. * We suggest to subscribe to this page to be updated with the latest news * While implementing your function, you may want to focus over the test on the Relatedness function rather than the whole TagMe (that is way slower!). From the example ''Main.java'', simply comment out the call to ''launchTagMeExperiment''. * You may ask for feedback at any time by e-mail * We encourage the development of good ideas rather than good results (but we like good results!) * Numbers are big: do not engineer, but be careful with complexity * Tools like [[http://linux.die.net/man/1/scp|scp]] and [[http://linux.die.net/man/1/sshfs|sshfs]] may make your life easier. * There could be bugs: contact Marco in case something is not working as you expect. * You are responsible for what happens with your account: keep it secret, keep it safe, and don't misuse it.