====== Algorithms and data structures for data-intensive applications ====== **Teacher**: [[https://www.santannapisa.it/en/paolo-ferragina|Paolo Ferragina]] e Giovanni Manzini **CFU:** 9. **Language:** English. **Degree:** [[https://didattica.di.unipi.it/en/master-programme-in-informatics-for-digital-health/|Informatics for Digital Health]] \\ ====== Goals ====== The goal of this course is to enrich and strengthen the knowledge about basic algorithms and data structures, learned in undergraduate courses, with further models, methodologies, tools, and techniques to design, analyze, experiment, and improve solutions for managing massive datasets of multimodal types: integer, string, vector, text, time series and (labeled) graph. These knowledge and skills constitute the building blocks upon which Big Data platforms and applications are built (e.g., key-value stores, graph DBs, vector DBs, Transformers, …), so this theoretical part will provide students with the “methodological toolbox” for designing and evaluating their big data solutions, and for choosing the right library to adopt for their implementation. The theoretical lectures will be complemented by some hands-on experience in coding with relevant programming libraries that offer the above-mentioned building blocks. The course provides most of the algorithmic prerequisites for the other courses of this master’s degree. ====== Syllabus ====== * Key issues about I/Os, streaming, and hierarchical memory in managing massive datasets. * Algorithmic and data structure issues concerning atomic versus variable-length items. * Two fundamental tools: sorting and permuting. * Key-value stores: From basic to advanced hash tables and tries. * Textual search engines: compressing and accessing posting lists (integer sequences) and text collections. Deduplicating similar/equal texts, clustering text collections. * Bio-Informatics engines: compressing and indexing arbitrary text for substring search, exact or approximate. * Vector DBs: nearest neighbor search, hamming, or Euclidean distance. * Advanced storage: Streaming and random access to compressed raw files, time series, and (labeled) graphs. Lectures will include in-depth discussions on the practical efficiency of the proposed algorithms and data structures, plus hands-on experience with coding solutions for some topics. ====== Background====== If you wish to refresh your mind on Algorithms and Data Structures, I suggest you to follow the [[http://videolectures.net/mit6046jf05_introduction_algorithms/|Video Lectures]] by Erik Demaine and Charles Leiserson, specifically Lectures 1-5, 7 and 10. There it is missing the part on basic graph problems (representation, DFS, BFS, topological sort) which you may browse in any book, such as [[http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=11866|Introduction to Algorithms]] by Cormen-Leiserson-Rivest-Stein, third edition. ====== Current year ====== * [[.ad2da2024:|Academic Year 2024-2025]] ====== Previous years ======