Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir18:start

Information Retrieval - Academic Year 2018/2019

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Question time: After lectures it will be announced via Twitter
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.

Week Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 L
Tuesday 9:00 - 11:00 L1

Exam

The exam will consist of a written test plus an oral discussion on the exercises.

Date Room Text Notes
29 oct 2018, 16:00-18:00 C1, L1 First Midterm exam: text, results, solution Students that got a rank >= 16 can participate to the second midterm exam.
Friday 9th November, hr 16:00, room L1, correction of the written exam.
19 dec 2018, 16:00-18:00 A1, L1 Second Midterm exam: text, results, solution Correction and registration will occur in Sala Seminari Ovest, Dip. Informatica, Wednesday 9th January, hr 17:00. Score “30 e lode” is assigned only to the students who got in both exams the score 30. The score is lost if the student participates to one of the next exams (just sitting is enough !). The score can be registered in any of the following exam dates (even in the summer), but PLEASE do not write your name in the ESAMI platform if you want to register your exam score, just show yourself in one of those dates.
18 jan 2018, 09:00-11:00 A1 text, results, solution Correction and registration will occur in Room C, Polo Fibonacci, Thursday 24th January, hr 9:00. The score is lost if the students participate to one of the next exams (just sitting is enough !). The score can be registered also in any one of the following exam dates (even in the summer), but PLEASE do not write your name in the ESAMI platform if you want to register your exam score, just show yourself in one of those dates.
15 feb 2018, 09:00-11:00 A1 text, results, solution Correction and registration will occur in Sala Seminari EST, Dip. Informatica, Monday 18th February, hr 16:00. The score can be registered in any of the following exam dates (even in the summer). The score is lost if the student participates to one of the next exams (just sitting is enough !).
4 apr 2019, 11:00-13:00 text Results: Sabiu (30 L), Cardia (30). Correction and registration will occur in Ferragina's office: Monday 15th april at 14:30 (sharp), otherwise in any other following exam's date.
18 jun 2019, 09:00-13:00 L1 text Results. If a student wishes to take also the Algorithm Engineering exam, s/he has to write to Prof. Ferragina for scheduling it in the same morning and avoid overlapping.
16 jul 2019, 09:00-13:00 L1 text, solution If a student wishes to take also the Algorithm Engineering exam, s/he has to write to Prof. Ferragina for scheduling it in the same morning and avoid overlapping.
11 sep 2019, 09:00-13:00 L1 text Results. If a student wishes to take also the Algorithm Engineering exam, s/he has to write to Prof. Ferragina for scheduling it in the same morning and avoid overlapping.

Books

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers (linked below).
  • Some notes by Prof. Paolo Ferragina (linked below).

Content of the Lectures

Date Argument Refs
17.09.2018 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
18.09.2018 Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. Slides.
Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS].
24.09.2018 Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Spectral Bloom Filter. Slides.
For doubts on Bloom Filter see paper.
25.09.2018 Other useful algorithmic techniques for crawling the Web (and not only that!): Consistent Hashing. Web graph properties and its compressed storage. Slides of the previous lecture.
Sect 19.1 and 19.2, 20.3 and 20.4 of [MRS].
01.10.2018 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Slides
02.10.2018 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors, cosine-similarity among vectors of real-features. slides.
Sect 19.6 of [MRS]
08/10/2018 The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. slides.
Chapter 4 of [MRS].
09/10/2018 Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and Set Reconciliation. Slides.
Reading a paper.
15/10/2018 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Slides.
Sect. 2.1, 2.2 and 5.1 of [MRS].
16/10/2018 Keyword extraction: statistical, chi-square test, Rake tool. Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing.
22/10/2018 Exercises.
23/10/2018 Exercises.
Midterm exam
05/11/2018 Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Slides (new version). Chap 3 of [MRS].
06/11/2018 Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta and Elias-Fano. Slide query, Slide integer compressors.
Sect. 2.3 and 2.4 and 5.3 of [MRS] and Ferragina's notes (only the coders presented in class).
09/11/2018 Correction of the midterm exam
12/11/2018 Rank and Select data structures, two approaches: the case of B untouched and extra o(B) bits, and the case of Elias-Fano's approach with B compressed. Slides.
13/11/2018 Succinct representation of binary trees and its navigational operations: heap like notation and LOUDS. Slides.
19/11/2018 Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Slides.
Sect 6.2 and 6.3 and 7 from [MRS].
20/11/2018 Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. Chap 8 and 9 of [MRS]
26/11/2018 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank, CoSim rank. HITS. Slides. Chap 21 of [MRS]
27/11/2018 Exercises
3/12/2018 Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. Slides.
Chap 18 from [MRS].
4/12/2018 Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. Various approaches to text representation and their applications. Exercises Slides
10/12/2018 A glimpse on Sport Analytics, and discussion of possible topics for a master thesis. Slides and Thesis ideas.
11/12/2018 Exercises.
magistraleinformatica/ir/ir18/start.txt · Ultima modifica: 16/07/2019 alle 10:19 (5 settimane fa) da Paolo Ferragina