====== Information Retrieval - Academic Year 2021/2022 ====== ====== General Information ====== * ** Teacher **: [[http://www.di.unipi.it/~ferragin/|Paolo Ferragina]] * **Course ID**: 289AA * **CFU:** 6 (first semester) * **Language:** English * **Question time:** Monday 15-17, or appointment (given COVID-19 situation this will occur via video-conference in the virtual room of the course) * **Official Lecture's Log:** Here it is the [[https://unimap.unipi.it/registri/dettregistriNEW.php?re=3323819::::&ri=9142| registro]]. * News about this course will be distributed via a [[ https://t.me/CourseIR2021 | Telegram channel]] \\ ====== Goals ====== Study, design and analysis of IR systems which are efficient and effective to process, mine, search, cluster and classify documents, coming from textual as well as any unstructured domain. In the lectures, we will: * study and analyze the main components of a modern search engine: Crawler, Parser, Compressor, Indexer, Query resolver, Query and Document annotator, Results Ranker; * dig into some basic algorithmic techniques which are now ubiquitous in any IR application for data compression, indexing and sketching; * describe few other IR tools which are used either as a component of a search engine or as independent tools and build up the previous algorithmic techniques, such as: Classification, Clustering, Recommendation, Random Sampling, Locality Sensitive Hashing. \\ ====== Schedule of the Lectures ====== ^ Week Schedule ^^^ ^ Day ^ Time Slot ^ Room ^ | Monday | 11:00 - 13:00 | Room A1 (Polo Fibonacci), and the [[https://teams.microsoft.com/l/team/19%3a8eqPWcXdlXhkpfn0gu2d8I9MCuayTtY_e0Q63ksxeos1%40thread.tacv2/conversations?groupId=796e29ec-8a72-42f1-8e39-663ec2cbacd7&tenantId=c7456b31-a220-47f5-be52-473828670aa1|virtual room]] of the course | | Tuesday | 9:00 - 11:00 | Room A1 (Polo Fibonacci), and the [[https://teams.microsoft.com/l/team/19%3a8eqPWcXdlXhkpfn0gu2d8I9MCuayTtY_e0Q63ksxeos1%40thread.tacv2/conversations?groupId=796e29ec-8a72-42f1-8e39-663ec2cbacd7&tenantId=c7456b31-a220-47f5-be52-473828670aa1|virtual room]] of the course | ====== Exams ====== The exam will consist of a written test plus an oral discussion. As last year, I'm planning to have one midterm exam in November and one in December. The two midterms (if >= 18 as vote) will be combined with an oral exam to be given in January, which can **increase or decrease** the final vote (by few points, given that they will somewhat "average" with the vote reported in the written exam). ^ Date ^ Room ^ Text ^ Notes | | 09/11/21, start at 09:00 | room A1 and virtually | {{ :magistraleinformatica:ir:ir21:ir211109.pdf |text}}, {{ :magistraleinformatica:ir:ir21:informationretrieval-2022-comp1.pdf |results}}, {{ :magistraleinformatica:ir:ir21:ir_2021_11_09_tutto_.pdf |solution}} | The **midterm exam** will consist of a set of exercises, and will last for 45mins. The part of the program for the exercises will be detailed in the list of lectures below. | | 14/12/2021, start at 09:00 | room C and virtual | {{ :magistraleinformatica:ir:ir21:ir211214.pdf |text}}, {{ :magistraleinformatica:ir:ir21:results-finalterm-2021.pdf |results}}, {{ :magistraleinformatica:ir:ir21:ir_14.12.2021_soluzione_.pdf |solution}} | The **FinalTerm exam** will have the same structure as the other one, but can participate only the students who got >=18 rank in the first MidTerm. Students have to register at the [[https://forms.office.com/r/gFmFVWVWzp|following form]] by December 7th, 2021.\\ Oral will occur remotely the 20th December, starting at 9:00, on the Teams' room of the course. | | 17/01/2022, start at 09:00 | room C1 | {{ :magistraleinformatica:ir:ir21:ir220117.pdf |text}}, {{ :magistraleinformatica:ir:ir21:ir-jan2022.pdf |results}}, {{ :magistraleinformatica:ir:ir21:ir220117_soluzione_.pdf |solution}} | | | 07/02/2022, start at 09:00 | room A1 | {{ :magistraleinformatica:ir:ir21:ir220207.pdf |text}}, {{ :magistraleinformatica:ir:ir21:ir090222_res.pdf |results}}, {{ :magistraleinformatica:ir:ir21:ir220207_soluzione_.pdf |solution}} | | | 13/06/2022, start at 09:00 | room L1 | {{ :magistraleinformatica:ir:ir21:ir220613.pdf |text}} | | ====== Materials for study ====== * **[MRS]** C.D. Manning, P. Raghavan, H. Schutze. //Introduction to Information Retrieval//. Cambridge University Press, 2008. [ [[http://nlp.stanford.edu/IR-book/information-retrieval-book.html| link]] ] * Some copies of papers or notes (linked below). \\ ====== Lectures ====== For the video-lectures of this year, please look at the following agenda. [[https://web.microsoftstream.com/group/d6ebab24-d618-46c4-93d7-ced9fb302f65|Video-lectures of the last academic year]] are available too (click on “Videos”, and then sort them by name). \\ ^ Date ^ Argument ^ Refs ^ | 13.09.2021 | Introduction to the course: modern IR, not just search engines! Boolean retrieval model. Matrix document-term. Inverted list: dictionary + postings. How to implement an AND, OR and NOT queries, and their time complexities. | {{ :magistraleinformatica:ir:ir21:lect_01-introduction.ppt |Slides}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_13-20210913_112523-Meeting%20Recording.mp4?web=1|Video]]\\ Chapter 1 of [MRS] | | 14.09.2021 | Skip pointers, Zone indexes, Web search engine: its structure, difficulties in their design and their epochs. The Web graph: some useful structural properties (such as Bow Tie). | {{ :magistraleinformatica:ir:ir21:lect_02-crawling_part_a_.ppt |Slides}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_14-20210914_090233-Meeting%20Recording.mp4?web=1|video]].\\ Sections 19.1, 19.2, 19.4 of [MRS]. | | 20.09.2021 | Crawling: problems and algorithmic structure. An example: Mercator. The bloom filter: definition, time/space complexity and error bound. | {{ :magistraleinformatica:ir:ir21:lect_03-crawling_part_b_.ppt |Slide}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_20-20210920_110806-Meeting%20Recording.mp4?web=1|video]].\\ Sections 20.1, 20.2 of [MRS].\\ For doubts on Bloom Filter see [[http://didawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/ir/ir12/reading-bloomfilter.pdf|paper]].| | 21.09.2021 | Spectral Bloom Filter. | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_21-20210921_090227-Meeting%20Recording.mp4?web=1|video]] | | 27.09.2021 | Consistent Hashing. Web graph compression: properties of the web, power laws, and compressing the adjacency lists. | Sect 19.1 and 19.2 of [MRS] and this [[https://itnext.io/introducing-consistent-hashing-9a289769052e|page]] and [[http://web.stanford.edu/class/cs168/l/l1.pdf|note]] for consistent hashing.\\ Sect 20.3 and 20.4 of [MRS].\\ [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_27-20210927_110244-Meeting%20Recording.mp4?web=1|video]] | | 28.09.2021 | Locality-sensitive hashing: basics, hamming distance, proof of the probability bounds. Use in an off-line and in an on-line setting. Comparison between LSH and K-means for the clustering problem. | {{ :magistraleinformatica:ir:ir21:lect_04-lsh_technique.ppt |Slides}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_09_28-20210928_090511-Meeting%20Recording.mp4?web=1|video]] | | 04.10.2021 | Exact-duplicate documents: Karp-Rabin's rolling hash (with properties and error probability). Near-duplicate documents: Shingling, Jaccard similarity, min-hashing (with prob property), LSH on integer vectors. Cosine-similarity among vectors of real-features. | {{ :magistraleinformatica:ir:ir21:lect_05-deduplication.ppt |Slides}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_04-20211004_111011-Registrazione%20della%20riunione.mp4?web=1|video]]. \\ Sect 19.6 of [MRS] | | 05.10.2021 | Exercises on LSH and shingling. | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_05-20211005_091000-Registrazione%20della%20riunione.mp4?web=1|Video]]. \\ Video of the last year ([[https://www.dropbox.com/s/thty80qwy4axz4m/2020-10-05%20IR%20parte%201.mp4?dl=0|part 1]] and [[https://www.dropbox.com/s/l0j2qm1laxgb7lm/2020-10-05%20IR%20parte%202.mp4?dl=0|part 2]]). | | 11.10.2021 | The issue of hierarchical memories: I/O-model. Index construction: multi-way mergesort, BSBI and SPIMI. Sketch on MapReduce. Distributed indexing: Term-based vs Doc-based partitioning. Dynamic indexing: two indexes, a cascade of indexes. | {{ :magistraleinformatica:ir:ir21:lect_06-constructionannotated.pdf |Slides}}.\\ Chapter 4 of [MRS].\\ Video [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_11-20211011_112619-Registrazione%20della%20riunione.mp4|part 1]] and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_11-20211011_115848-Registrazione%20della%20riunione.mp4|part 2]] | | 12.10.2021 | Parsing: tokenization, normalization, lemmatization, stemming, thesauri. Statistical properties of texts: Zipf law: classical and generalized, Heaps law, Luhn's consideration. Dictionary compression: Front coding. | Slides {{ :magistraleinformatica:ir:ir21:lect_08-parsing_and_text_lawsa4.pdf |one}} and {{ :magistraleinformatica:ir:ir21:dictionary-compression.pdf |two}}.\\ Sect. 2.1, 2.2, 5.1 and 5.2 of [MRS].\\ Video [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_12-20211012_090006-Registrazione%20della%20riunione.mp4|part 1]] and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_10_12-20211012_100936-Registrazione%20della%20riunione.mp4|part 2]] | | 19.10.2021 | Exact search: hashing. Prefix search: compacted trie, front coding, 2-level indexing. Edit distance via brute-force approach, or Dynamic Programming (possibly weighted). One-error match. Overlap measure with k-gram index. Edit distance with k-gram index. Wild-card queries (permuterm, k-gram). Phonetic match. Context-sensitive match. | {{ :magistraleinformatica:ir:ir21:lect_09-dict_search.ppt |Slides}} and video ([[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%202021_10_19-20211019_090035-Registrazione%20della%20riunione.mp4|part 1]] and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%202021_10_19-20211019_100000-Registrazione%20della%20riunione.mp4|part 2]]).\\ Chap 3 of [MRS].\\ | | 25.10.2021 | Keyword extraction: statistical, chi-square test, Rake tool. Query processing: soft-AND, skip pointers, caching, phrase queries. Tiered index. | {{ :magistraleinformatica:ir:ir21:lect_08-parsing_and_text_laws.ppt |Slide 1}} and {{ :magistraleinformatica:ir:ir21:lect_10-query_resolver.ppt |Slide 2}}.\\ Sect. 2.3 and 2.4 of [MRS]\\ [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%2025_10_2021-20211025_111251-Registrazione%20della%20riunione.mp4|Video]]. | | 26.10.2021 | Exercises | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%2026_10_2021-20211026_091036-Registrazione%20della%20riunione.mp4|Video]] | | 02.11.2021 | Exercises | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_02-20211102_090303-Meeting%20Recording.mp4|Video]] | | 08.11.2021 | Exercises | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lecture%202021_11_08-20211108_110502-Meeting%20Recording.mp4|Video]] | | 09.11.2021 | **MidTerm Exam** (see above for details, and topics are the ones up to this point in the Agenda). | | | 15.11.2021 | Compressed storage of documents: LZ-based compression. Storage and Transmission of single/group of file(s): Delta compression (Zdelta), File Synchronization (rsync, zsync). | Suggest reading a [[https://www.dropbox.com/s/tsb6j1rmrx3e5zr/Lect%2007%20-%20reading%20-%20rsync%20and%20zsync.pdf?dl=0|paper]].\\ {{ :magistraleinformatica:ir:ir21:lect_07-compression_docs_and_rsync.ppt |Slides}}, [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_15-20211115_110235-Meeting%20Recording.mp4?web=1|Video]]. | | 16.11.2021 | Posting list compression, codes: gamma, variable bytes (t-nibble), PForDelta, Elias-Fano indexing. Text-based ranking: dice, jaccard, tf-idf. Vector space model and cosine similarity doc-doc and query-doc. | Sect. 5.3 of [MRS] and {{:magistraleinformaticanetworking/ae/ae2014/chap_9.pdf|Ferragina's notes}} (only the coders presented in class).\\ {{ :magistraleinformatica:ir:ir21:lect_11_-_compression_integers_new_.ppt |Slides}} and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_16-20211116_092024-Meeting%20Recording.mp4?web=1|Video]]. | | 22.11.2021 | Storage of tf-idf and use for computing document-query similarity. Fast top-k retrieval: high idf, champion lists, many query-terms, fancy hits, clustering. | Sect 6.2 and 6.3 and 7 from [MRS].\\ [[https://www.dropbox.com/s/iyrlc81wuzbtewu/lect%2012-text%20ranking.ppt?dl=0|Slides]] | | 23.11.2021 | Exact Top-K: WAND and blocked-WAND. Relevance feedback, Rocchio, pseudo-relevance feedback, query expansion. Performance measures: precision, recall, F1 and user happiness. | Sect 8.1-8.3 and 9 [MRS].\\ [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_23-20211123_090451-Meeting%20Recording.mp4|Video]] | | 29.11.2021 | Random Walks. Link-based ranking: pagerank, topic-based pagerank, personalized pagerank. Application to Text Summarization. | Chap 21 of [MRS]. [[https://www.dropbox.com/s/mb6y2k93lba9j10/lect%2013-Web%20ranking.ppt?dl=0|Slides]] and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_29-20211129_110458-Meeting%20Recording.mp4|video]] | | 30.11.2021 | HITS. Projections to smaller spaces: Latent Semantic Indexing (LSI). Sketch of the ideas underlying Entity Linkers and Knowledge Graphs. | Chap 18 from [MRS].\\ [[https://www.dropbox.com/s/ylcnklittbc4wne/lect%2014-LSI%20and%20random%20proj%20-%20shorter.ppt?dl=0|Slides]] and [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_11_30-20211130_090133-Meeting%20Recording.mp4?web=1|video]]. | | 06.12.2021 | Elastic Search, with lab: Students are required to bring their own laptop in class, with already installed [[https://docs.docker.com/get-docker/|Docker]] and then the image of [[https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html|ElasticSearch]] via **docker pull** (i.e. first step of "Pulling the image"). | | | 07.12.2021 | Exercises | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Lezione%20del%202021_12_07-20211207_090823-Meeting%20Recording.mp4?web=1|Video]] | | 13.12.2021 | Exercises | [[https://unipiit.sharepoint.com/sites/a__td_50472/Shared%20Documents/General/Recordings/Meeting%20del%202021_12_13-20211213_110529-Meeting%20Recording.mp4?web=1|Video]] | | 14.12.2021 | **FinalTerm exam**. Topics will be the ones that we have dealt with **after** the MidTerm exam. Students have to register at the [[https://forms.office.com/r/gFmFVWVWzp|following form]] by December 7th, 2021. | |