magistraleinformatica:ir:ir16:start

**Teacher**: Paolo Ferragina**Course ID**: 289AA**CFU:**6 (first semester)**Language:**English**Lectures Schedule:**tuesday 09-11 (N1), wednesday 09-11 (L1).**Question time:**after lectures or appointment.**Official Lecture's Log:**Here it is the registro.- News about this course will be distributed via a Tweeter-channel

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

- study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;

- dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;

- describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.

The exam will consist of two parts: (1) a lab test on the libraries and tools learned in class; (2) a written test plus an oral discussion on the exercises.

Date | Room | Text |
---|---|---|

10/01/2017 | L1 (9:30) | Lab test, data |

12/01/2017 | L1 (9:00) | text |

01/02/2017 | L1 (9:30) | Lab test |

12/06/2017 | L1 (15:00) | text + written test on Lucene |

29/06/2017 | A (9:00) | text + written test on Lucene |

27/07/2017 | L1 (9:00) | text + written test on Lucene |

05/09/2017 | N1 (15:00) | text |

**[MRS]**C.D. Manning, P. Raghavan, H. Schutze.*Introduction to Information Retrieval*. Cambridge University Press, 2008. [ link ]- Some copies of papers pointed out below
- Some notes by Prof. Paolo Ferragina.

Date | Argument | Refs | |
---|---|---|---|

19/09/16 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The structure of a search engine. | Slides. Chapter 1 of [MRS] | |

20/09/16 | Web search engine: difficulties in their design and their ephocs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. | Slides. Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS]. | |

27/09/16 | Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Consistent Hashing. Web graph properties and its compressed storage. | Slides. Sect 19.1 and 19.2, 20.3 and 20.4 of [MRS]. For doubts on Bloom Filter see paper. | |

28/09/16 | Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. | Slides | |

04/10/16 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. | Slides. Sect. 2.1, 2.2 and 5.1 of [MRS]. | |

05/10/16 | The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. | Slides. Chapter 4 of [MRS]. | |

11/10/16 | Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Introduction to Locality-Sensitive Hashing (LSH): similarity problem among users based on binary/real features, cosine-similarity among vectors of real-features, near-document similarity via shingles + min-wise hashing + Jaccard similarity. | Slides. Sect 19.6 of [MRS] | |

12/10/16 | Locality-sensitive hashing: basics, hamming distance, Jaccard similarity, cosine-similarity, sketch of the main theorem. | ||

18/10/16 | Posting list compression, codes: gamma, delta, variable bytes, PForDelta and Elias-Fano. | Slides. Sez 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). | |

19/10/16 | Rank and Select data structures, two approaches: the case of B untouched and extra o(B) bits, and the case of Elias-Fano's approach with B compressed. Succinct representation of binary trees and its navigational operations (heap like notation). | Slides. | |

25/10/16 | Exact search: hashing with chaining, univeral hashing, cuckoo hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). | Slides. | |

26/10/16 | Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. | Slides. Chap 3 of [MRS]. | |

08/11/16 | Query processing: skip pointers (with solution based on dynamic programming), caching, phrase queries. Zone index and tiered index. The auto-complete problem and its solutions for the top-1, top-2, .., top-k strings. | Slide. Sect. 2.3 and 2.4 of [MRS] | |

09/11/16 | Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. | Slides. Sect 6.2 and 6.3 and 7 from [MRS]. | |

15/11/16 | Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. | Slides. Chap 8 and 9 of [MRS] | |

16/11/16 | Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank, CoSim rank. | Slides | |

22/11/16 | HITS. Recommendation systems and Web advertising. | Slides. | |

23/11/16 | Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. | Slides. Chap 18 from [MRS]. | |

29/11/16 | Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. How to evaluate those systems. | Slides. | |

30/11/16 | Various approaches to text representation and their applications. | ||

06/12/16 | Hands-on Lab. You need to configure your laptop as follows: Linux system (may be a virtual machine) with debian-like OS (e.g. `Ubuntu 16.10` ), working Internet connection from the Polo's room, at least 5GB of free disk and 2GB RAM. You need to install the following python packages: `scrapy` , `pylucene` (version 3.5, newer versions are not supported) and `lxml` . They can be installed with: `sudo apt-get update` `sudo apt-get install python-scrapy python-lucene python-lxml` We have created the lib_test.py script to check if your system has everything in place for the lab. You may run it ( `python lib_test.py` ) on your system: it must reach the end of the script w/o errors. In collaboration with Marco Cornolti (cornolti@di.unipi.it). | Slides (crawling) | |

07/12/16 | Hands-on Lab. Introduction to Lucene. In collaboration with Marco Cornolti (cornolti@di.unipi.it). | Slides (Lucene) | |

13/12/16 | Hands-on Lab. Advanced topics and use of Lucene. In collaboration with Marco Cornolti (cornolti@di.unipi.it). | Slides (Lucene's fields) | |

14/12/16 | Exercises |

magistraleinformatica/ir/ir16/start.txt · Ultima modifica: 05/09/2017 alle 13:10 (23 mesi fa) da Paolo Ferragina