====== Algorithm and Data Structures for Data-intensive Applications A.A. 2024-2025 ====== **Teachers**: Paolo Ferragina and Giovanni Manzini **CFU:** 9 (first semester). **Course ID:** 0002A. **Language:** English. **Question time:** Monday: 15-17, or by appointment (also in video-conferencing through the [[|virtual room]] of the course) **News:** about this course will be distributed via a [[ | Telegram channel]]. **Official lectures schedule:** The schedule and content of the lectures are available below and with the [[https://unimap.unipi.it/registri/dettregistriNEW.php?re=::::&ri=9142|official registro]]. \\ ===== Week Schedule of the Lectures ===== ^ Week Schedule ^^^ ^ Day ^ Time Slot ^ Room ^ | Monday | 11:00 - 13:00 | Room PS4 | | Tuesday | 11:00 - 13:00 | Room PS4 | | Wednesday | 11:00 - 13:00 | Room Fib C | ===== Exam ===== The exam will consist of a written test including two parts: **exercises and "oral" questions**. The exam is passed if in both parts the student gets a sufficient score (expressed in 30), which is then averaged. The first (exercises) and the second (theory questions) parts of the exam can be split into different exam dates, even of different exam sessions. The exam dates are the ones indicated in the calendar on ESAMI. In the case that the second part is not passed or the student abandons the exam, (s)he can keep the rank of the first exam, but this may occur just once. The second time this happens, the rank of the first part is dropped, and the student has to do both parts again. ^ Dates ^ Room ^ Text ^ Notes ^ | 00/00/2024, start at 00:00 | room 00 | text, results | | ===== Goals ===== The goal of this course is to enrich and strengthen the knowledge about basic algorithms and data structures, learned in undergraduate courses, with further models, methodologies, tools, and techniques to design, analyze, experiment, and improve solutions for managing massive datasets of multimodal types: integer, string, vector, text, time series, and (labeled) graph. These knowledge and skills constitute the building blocks upon which Big Data platforms and applications are built (e.g., key-value stores, graph DBs, vector DBs, Transformers, …), so this theoretical part will provide students with the “methodological toolbox” for designing and evaluating their big data solutions, and for choosing the right library to adopt for their implementation. The theoretical lectures will be complemented by some hands-on experience in coding with relevant programming libraries that offer the above-mentioned building blocks. The course provides most of the algorithmic prerequisites for the other courses of this master’s degree. ===== Syllabus ===== * Key issues about I/Os, streaming, and hierarchical memory in managing massive datasets. * Algorithmic and data structure issues concerning atomic versus variable-length items. * Two fundamental tools: sorting and permuting. * Key-value stores: From basic to advanced hash tables and tries. * Textual search engines: compressing and accessing posting lists (integer sequences) and text collections. Deduplicating similar/equal texts, clustering text collections. * Bio-Informatics engines: compressing and indexing arbitrary text for substring search, exact or approximate. * Vector DBs: nearest neighbor search, hamming, or Euclidean distance. * Advanced storage: Streaming and random access to compressed raw files, time series, and (labeled) graphs. Lectures will include in-depth discussions on the practical efficiency of the proposed algorithms and data structures, plus hands-on experience with coding solutions for some topics. ===== Background and Notes of the Course ===== I strongly suggest refreshing your knowledge about basic Algorithms and Data Structures by looking at the well-known book [[https://mitpress.mit.edu/books/introduction-algorithms-third-edition|Introduction to Algorithms]], Cormen-Leiserson-Rivest-Stein (third edition). Specifically, I suggest you look at the chapters 2, 3, 4, 6, 7, 8, 10, 11 (no perfect hash), 12 (no randomly built), 15 (no optimal BST), 18, 22 (no strongly connected components). Also, you could look at the [[http://videolectures.net/mit6046jf05_introduction_algorithms/|Video Lectures]] by Erik Demaine and Charles Leiserson, specifically Lectures 1-7, 9-10, and 15-17. The content of the course will be covered in part by the book published by //Cambridge University Press// as [https://www.cambridge.org/core/books/pearls-of-algorithm-engineering/95061352D7263CCCBD4F243018236EB2|Pearls of Algorithm Engineering], and in part by notes and GitHub refs that will provide the materials of hands-on sessions. \\ ===== Lectures (TO BE CHANGED SIGNIFICANTLY) ===== ^ Date ^ Lecture ^ Biblio ^ | | Key issues about I/Os, streaming, and hierarchical memory in managing massive datasets. | | | | Algorithmic and data structure issues concerning atomic versus variable-length items. | | | | Two fundamental tools: sorting and permitting. | | | | Key-value stores: From basic to advanced hash tables and tries. | | | | Textual search engines: compressing and accessing posting lists (integer sequences) and text collections. Deduplicating similar/equal texts, clustering text collections. | | | | Bio-Informatics engines: compressing and indexing arbitrary text for substring search, exact or approximate. | | | | Vector DBs: nearest neighbor search, hamming, or Euclidean distance. | | | | Advanced storage: Streaming and random access to compressed raw files, time series, and (labeled) graphs. | |