Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir19:start

Questa è una vecchia versione del documento!


Information Retrieval - Academic Year 2019/2020

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Question time: Monday 15-17, or appointment
  • Official Lecture's Log: Here it is the registro.
  • News about this course will be distributed via a Telegram channel


Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will:

  • study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker;
  • dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching;
  • describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing.


Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 L1
Tuesday 9:00 - 11:00 L1


Exams

The exam will consist of a written test plus an oral discussion on the exercises.

Date Room Text Notes
26/11/19, 9:00-11:00 First Midterm exam text, results, solution Students that got a rank >= 16 can participate to the second midterm exam. To the average of the two midterms will be added +2.
16/12/19, 11:00-13:00 room on exam's site
(Second MidTerm)
text, solution, results. Score “30 e lode” has been assigned only to the students who got rank 32 (I used ceiling and not floor of the average, and summed 2!). The score is lost if the student participates to one of the next exams (just sitting is enough !). The score can be registered in any of the following exam dates (in the summer, too), but PLEASE do not write your name in the ESAMI platform if you want to register your exam score, just show yourself in one of those dates.
13/01/20, 14:00-17:00 room on exam's site text, solution, results. The score is lost if the student participates to one of the next exams (just sitting is enough !). The score can be registered in any of the following exam dates (even in the summer), but PLEASE do not write your name in the ESAMI platform if you want to register your exam score, just show yourself in one of those dates.
10/02/20, 14:00-17:00 room on exam's site text, solution, results. The score is lost if the student participates to one of the next exams (just sitting is enough !). The score can be registered in any of the following exam dates (even in the summer), but PLEASE do not write your name in the ESAMI platform if you want to register your exam score, just show yourself in one of those dates.


Materials for study

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Some copies of papers or notes (linked below).


Lectures

Date Argument Refs
16.09.2019 Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. Slides.
Chapter 1 of [MRS]
17.09.2018 Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). Crawling: problems and algorithmic structure. An example: Mercator. Slides.
Sections 19.1, 19.2, 19.4, 20.1, 20.2 of [MRS].
23.09.2019 Few useful algorithmic techniques for crawling the Web (and not only that!): Bloom Filter and Spectral Bloom Filter. Some other uses of Bloom Filter. Slides. For doubts on Bloom Filter see paper.
24.09.2019 Other useful algorithmic techniques for crawling the Web (and not only that!): Consistent Hashing. Slides of the previous lecture.
Sect 19.1 and 19.2 of [MRS].
30.09.2019 Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Slides
01.10.2019 Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors, cosine-similarity among vectors of real-features. Slides.
Sect 19.6 of [MRS]
07.10.2019 Web graph properties and its compressed storage. The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Slides on WebGraph are at the end of slides of 23.09.
Slides this lecture.
Sect 20.3 and 20.4 of [MRS]. Chapter 4 of [MRS]
08.10.2019 Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. Compressed storage of documents: LZ-based compression. Slides.
Chapter 4 of [MRS].
14.10.2019 Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync), and File Reconciliation. Reading a paper.
15.10.2019 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Keyword extraction: statistical, chi-square test, Rake tool. Slides.
Sect. 2.1, 2.2 and 5.1 of [MRS].
21.10.2019 Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Slide.
22.10.2019 Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). Overlap measure with k-gram index. Edit distance with k-gram index. One-error match. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. Chap 3 of [MRS].
28.10.2019 Exercises
29.10.2019 Seminar on Sports Analytics by Luca Pappalardo, slides
04.11.2019 Exercises
05.11.2019 Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta and Elias-Fano. Sect. 2.3 and 2.4 and 5.3 of [MRS] and Ferragina's notes (only the coders presented in class). Slides and Slides.
18.11.2019 Cancellata per allerta meteo
19.11.2019 Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Sect 6.2 and 6.3 and 7 from [MRS]. Slides.
25.11.2019 Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. Chap 8 and 9 of [MRS]
26.11.2019 First MidTerm exam
02.12.2019 Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank, HITS. Chap 21 of [MRS]. Slides
03.12.2019 Projections to smaller spaces: Latent Semantic Indexing (LSI). Random Projections: Johnson-Linderstauss Lemma and its applications. Chap 18 from [MRS]. Slides.
05.12.2019 Semantic-annotation tools: basics, Wikipedia structure, TAGME and other annotators. Various approaches to text representation and their applications. An example of a sophisticate system: Wiser. Slides tagme and wiser.
09.12.2019 Canceled
10.12.2019 Canceled
12.12.2019 Exercises on the second part of the course
13.12.2019 Exercises on the second part of the course
16.12.2019 Second MidTerm exam
17.12.2019 Canceled



magistraleinformatica/ir/ir19/start.1576570840.txt.gz · Ultima modifica: 17/12/2019 alle 08:20 (4 anni fa) da Paolo Ferragina