Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir12:start

Information Retrieval - Academic Year 2012/2013

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 289AA
  • CFU: 6 (first semester)
  • Language: English
  • Lectures Schedule: Monday and Tuesday 14-16, both in room L1
  • Question time: by appointment.
  • Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
  • News about this course will be distributed via a Tweeter-channel

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:

  • describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;
  • present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;
  • introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course.

Date Text of the exam Results
10 January 2013 text .
12 February 2013 text .
25 June 2013 text .
16 July 2013 text .

Books, notes, ...

  • [MRS] C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • [MG] Chapter 2 “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.

Content of the Lectures

Date Argument Refs
24/09/2012 Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. The issue of hierarchical memories: I/O-model. Chapter 1 of [MRS]
25/09/2012 Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Sect. 2.1, 2.2 and 5.1 of [MRS]. Slides of this day
01/10/2012 NO Lecture
02/10/2012 Index construction: multi-way mergesort, BSBI and SPIMI. Chapter 4 of [MRS]. Slides.
08/10/2012 Distributed and dynamic indexing. Query processing: skip pointers, caching, phrase queries Chapter 4, Sect. 2.3, 2.4 of [MRS]. Slides.
09/10/2012 wild-cards (permuterm, k-gram, dynamic programming). Zone indexes. Sect. 3.2, 3.3, 6.1 of [MRS].
15/10/2012 Posting list compression, codes: gamma, delta, variable bytes, PForDelta. Sect. 5.3 of [MRS]. Read also this paper.
16/10/2012 Notion of Entropy and prefix-free codes. Huffman and Arithmetic coding. LZ77 and gzip. Hashing with chaining, cuckoo hashing. pag. 21-36, 52-56, 74-79 of [MG]. Slides. Paper on cuckoo, just description (no proof).
18/10/2012 Bloom filter and its applications. Prefix-search: trie, front-coding and two-level indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model. Storage of tf-idf and use for computing document-query similarity. Sect 6.2, 6.3 from [MRS]. Paper on bloom filter. Slides.
22/10/2012 Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Chap 7 and 9 from [MRS]. Slides.
23/10/2012 Web search engines: difficulties for their design, the web-graph and its properties, how to compute the SCC I/O-efficiently, the size of the web and the estimation of the relative sizes of SE, the storage of the web-graph. Chap 19.1 and 19.2, 19.5, 20.4 from [MRS]. Slides.
25/10/2012 Crawling and consistent hashing. Link-based ranking: pagerank and HITS and weighted variants. Evaluation measures: precision, recall, F1. Recommendation Systems. Chap 20.1, 20.2, 21 and 8 from [MRS]. Slides.
29/10/2012 Some notes on XML search engines and Web advertising. chap 10 from [MRS]. Slides.
30/10/2012 Latent Semantic Indexing and Random projections. Exact and near-document duplication: shingling, fingerprinting and min-wise permutations. Semantic-annotation tools. chap 18 and 19.6 from [MRS]. Slides.
05/11/2012 Software project description
08/11/2012 Clustering: flat, hierarchical, soft, hard. K-means, optimal bisect, hierarchical - max, min, avg, centroid. Co-clustering. Chap 16 and 17 of [MRS], slides.
12/11/2012 Framework map-reduce, hadoop, with examples. By Claudio Lucchese, slides and some code.
13/11/2012 Beyond hadoop: pig and giraffe. By Fabrizio Silvestri, slides.
19/11/2012 Project discussion
20/11/2012 Lab at home
26/11/2012 Lab at home
27/11/2012 Lab at home
03/12/2012 Lab at home
04/12/2012 Lab at home
10/11/2012 Lab at home
11/11/2012 Lab at home
17/11/2012 Lab at home
18/11/2012 Lab at home
magistraleinformatica/ir/ir12/start.txt · Ultima modifica: 16/07/2013 alle 08:29 (11 anni fa) da Paolo Ferragina