====== Information Retrieval - Academic Year 2010/2011 ====== ===== General Information ===== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Lectures Schedule:** Monday 14-16 (room C1) and Thursday 9-11 (room I1) * **Question time:** Monday or Thursday at 11-12.30, Room 295 (Ferragina's office), Dept of Computer Science. * **Official Lecture's Log:** The schedule and content of the lectures is available with the official [[http://unimap.unipi.it/registri/dettregistriNEW.php?re=45105::::&ri=4673|registro]]. * News about this course will be distributed via a [[http://twitter.com/FerraginaTeach | Tweeter-channel]] with hashtag ''#ir2010'' ===== Goals ===== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual, html or XML data collections. In particular, we will: * describe the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Results Ranker, Results Classifier/Clusterer; * present and use in the Lab some interesting Open-Source Tools for IR applications, such as Lucene and Web graph; * introduce some basic algorithmic techniques which are now ubiquitous in any IR application for data classification, compression, clustering, projection, and sketching. ===== Exam ===== Project assigned during the course, plus an oral discussion concerning with the project and the course material. ^ Date ^ Text of the exam ^ | 01/02/2011 | {{:magistraleinformatica:ir:ir10:ir110201.doc|text}} | | 21/02/2011 | {{:magistraleinformatica:ir:ir10:ir110221.doc|text}} | | 24/06/2011 | {{:magistraleinformatica:ir:ir10:ir110624.doc|text}} | | 20/07/2011 | {{:magistraleinformatica:ir:ir10:ir110720.doc|text}} | | 01/09/2011 | {{:magistraleinformatica:ir:ir10:ir110901.doc|text}} | ===== Books, notes, ... ===== * C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * Chapter 2 on "Text compression" of //Managing Gigabytes//, I.H. Witten and A. Moffat and T.C. Bell, Morgan Kauffman, Second edition, 1999. ===== Content of the Lectures ===== ^ Date ^ Argument ^ Refs ^ Speaker ^ | 20/10/2010 | Introduction: large data collections and new algorithmic challanges? | {{:magistraleinformatica:ir:ir10:01._introduction.ppt|slides}} | | | 25/10/2010 | Boolean retrieval: Inverted Lists = Dictionary + Postings. Query resolution via list-intersection. | Chap 1 in [MRS], and {{:magistraleinformatica:ir:ir10:02._boolean_retrieval.ppt|slides}} | | | 28/10/2010 | Index construction: BSBI, SPIMI, distributed (Map-Reduce), dynamic. | Chap 4 in [MRS] and {{:magistraleinformatica:ir:ir10:notes.pdf|notes (drop "Snow Plow")}}, {{:magistraleinformatica:ir:ir10:03._index_construction.ppt|slides}} | | | 04/11/2010 | The term vocabulary: parsing, tokenization, lemmatization, stemming, Thesauri, soundex, etc.. Statistical models for texts: Zipf, Luhn. | Chap 2.1, 2.2 and 5.1 in [MRS], {{:magistraleinformatica:ir:ir10:04._dictionary.ppt|slides}} | | | 08/11/2010 | Faster boolean query processing: skips and phrase queries. String search: Hash Tables, Cuckoo hashing. | Chap 2.3, 2.4, and slides | | | 11/11/2010 | Min-Ordered Perfect Hash. Prefix search: Tries, 2-level indexing, Front-coding. | {{:magistraleinformatica:ir:ir10:mg-minordhash.pdf|notes}} and {{:magistraleinformatica:ir:ir10:06._dictionary_indexing.ppt|slides}}| | 15/11/2010 | Tolerant retrieval. | Chap 3 and 5.2 in [MRS], and {{:magistraleinformatica:ir:ir10:07._tolerant_retrieval.ppt|slides }} | | | 18/11/2010 | The Bloom Filter. | {{:magistraleinformatica:ir:ir10:08._bloomfilter.ppt|slides}} are enough | | | 22/11/2010 | IL compressed storage | chaps 5 in [MRS] and {{:magistraleinformatica:ir:ir10:09._ilcompression.ppt|slides}} | | | 02/12/2010 | Document storage via Huffman, its canonical variant and Huffword. LZ-compression: LZ77, LZ78, gzip. | all data compression's material can be found in {{:magistraleinformatica:ir:ir10:mg-compress.pdf|Chap 2}} in [WMB] and {{:magistraleinformatica:ir:ir10:10._datacompression.ppt|slides}} | | | 06/12/2010 | Scoring, term weighting and the vector space model. Relevance feedback and pseudo-relevance. Query expansion. top-k retrieval | chap 6, 7 and 9 in [MRS] | | | 09/12/2010 | Zone indexes. Recommendation Systems (sketch). Quality of the results: Precision, Recall, F-measure. | chap 8 in [MRS] and {{:magistraleinformatica:ir:ir10:11._tfidf.ppt|slides}} | | | 13/12/2010 | Lucene in action. | {{:magistraleinformatica:ir:ir10:lucene_in_action3.ppt|slides}} and [[ http://lucene.apache.org/java/docs/ | web site]] | Ugo Scaiella | | 16/12/2010 | Self-evaluation at home on Lucene: a small project. | | | | 20/12/2010 | Self-evaluation at home on Lucene: a small project. | | | | 10/01/2011 | Web search-engines: Web features and structure, advertising (sketch), crawling, consistent hash. | 19.1-19.4 in [MRS], also this {{:magistraleinformatica:ir:ir10:capitolo5.pdf|note}}, and chap 20 | | | 13/01/2011 | Web search-engines: connectivity server, doc-duplicate detection (shingling and min-hash) and link-based ranking. | 19.6 and 21 in [MRS], {{:magistraleinformatica:ir:ir10:12._websearch-esteso.ppt|slides}} | | | 17/01/2011 | Text Classification via Supervised Learning: definition and applications. | | [[http://nmis.isti.cnr.it/sebastiani/ | Fabrizio Sebastiani]] | | 20/01/2011 | Automated text categorization: the machine learning approach, the indexing step. | | [[http://nmis.isti.cnr.it/sebastiani/ | Fabrizio Sebastiani]] | | 24/01/2011 | Automated text categorization: Methods for the inductive construction of a classifier | {{:magistraleinformatica:ir:ir10:slidescorsoferraginashort.pdf|slides}} | [[http://nmis.isti.cnr.it/sebastiani/ | Fabrizio Sebastiani]] |