Strumenti Utente

Strumenti Sito


magistraleinformatica:ir:ir10:start

Questa è una vecchia versione del documento!


Information Retrieval - Academic Year 2010/2011

General Information

  • Teacher : Paolo Ferragina
  • Course ID: 346AA
  • CFU: 6 (first semester)
  • Language: English
  • Lectures Schedule: Monday 14-16 (room C1) and Thursday 9-11 (room I1)
  • Question time: Monday or Thursday at 11-12.30, Room 295 (Ferragina's office), Dept of Computer Science.
  • Official Lecture's Log: The schedule and content of the lectures is available with the official registro.
  • News about this course will be distributed via a Tweeter-channel with hashtag #ir2010

Goals

Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will:

  • describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer;
  • present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph;
  • introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching.

Exam

Project assigned during the course, plus an oral discussion concerning with the project and the course material.

Exam dates: February 1 and 16, time slot 9-11, Room A and C at Polo Fibonacci.

Books, notes, ...

  • C.D. Manning, P. Raghavan, H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. [ link ]
  • Chapter 2 on “Text compression” of Managing Gigabytes, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999.

Content of the Lectures

Date Argument Refs Speaker
20/10/2010 Introduction: large data collections and new algorithmic challanges? slides
25/10/2010 Boolean retrieval: Inverted Lists = Dictionary + Postings. Query resolution via list-intersection. Chap 1 in [MRS], and slides
28/10/2010 Index construction: BSBI, SPIMI, distributed (Map-Reduce), dynamic. Chap 4 in [MRS] and notes (drop "Snow Plow"), slides
04/11/2010 The term vocabulary: parsing, tokenization, lemmatization, stemming, Thesauri, soundex, etc.. Statistical models for texts: Zipf, Luhn. Chap 2.1, 2.2 and 5.1 in [MRS], slides
08/11/2010 Faster boolean query processing: skips and phrase queries. String search: Hash Tables, Cuckoo hashing. Chap 2.3, 2.4, and slides
11/11/2010 Min-Ordered Perfect Hash. Prefix search: Tries, 2-level indexing, Front-coding. notes and slides
15/11/2010 Tolerant retrieval. Chap 3 and 5.2 in [MRS], and slides
18/11/2010 The Bloom Filter. slides are enough
22/11/2010 IL compressed storage chaps 5 in [MRS] and slides
02/12/2010 Document storage via Huffman, its canonical variant and Huffword. LZ-compression: LZ77, LZ78, gzip. all data compression's material can be found in Chap 2 in [WMB] and slides
06/12/2010 Scoring, term weighting and the vector space model. Relevance feedback and pseudo-relevance. Query expansion. top-k retrieval chap 6, 7 and 9 in [MRS]
09/12/2010 Zone indexes. Recommendation Systems (sketch). Quality of the results: Precision, Recall, F-measure. chap 8 in [MRS] and slides
13/12/2010 Lucene in action. slides and web site Ugo Scaiella
16/12/2010 Self-evaluation at home on Lucene: a small project.
20/12/2010 Self-evaluation at home on Lucene: a small project.
10/01/2011 The Web Graph: Properties, storage (compression). 19.1-19.2 and 20.4 in [MRS]
Lab: the Web-Graph library web site Ugo Scaiella
13/01/2011 Web search-engines: structure, crawling, link-based ranking. chap 20 and 21 in [MRS]
6 Automated text categorization paper Fabrizio Sebastiani
2 Latent Semantic Indexing and Random Projections. chap 18 in [MRS]
4 Clustering: sketch of k-means, MST-based, max-cut, spectral. chap 16 and 17 in [MRS]
magistraleinformatica/ir/ir10/start.1292522123.txt.gz · Ultima modifica: 16/12/2010 alle 17:55 (13 anni fa) da Paolo Ferragina