Indice

Algorithm and Data Structures for Data-intensive Applications A.A. 2024-2025

Teachers: Paolo Ferragina and Giovanni Manzini

CFU: 9 (first semester).

Course ID: 0002A.

Language: English.

Question time: Monday: 15-17, or by appointment (also in video-conferencing through the virtual room of the course)

News: about this course will be distributed via a Telegram channel.

Official lectures schedule: The schedule and content of the lectures are available below and with the official registro.

Week Schedule of the Lectures

Week Schedule
Day Time Slot Room
Monday 11:00 - 13:00 Room PS4
Tuesday 11:00 - 13:00 Room PS4
Wednesday 11:00 - 13:00 Room Fib C

Exam

The exam will consist of a written test including two parts: exercises and “oral” questions. The exam is passed if in both parts the student gets a sufficient score (expressed in 30), which is then averaged.

The first (exercises) and the second (theory questions) parts of the exam can be split into different exam dates, even of different exam sessions. The exam dates are the ones indicated in the calendar on ESAMI. In the case that the second part is not passed or the student abandons the exam, (s)he can keep the rank of the first exam, but this may occur just once. The second time this happens, the rank of the first part is dropped, and the student has to do both parts again.

Dates Room Text Notes
00/00/2024, start at 00:00 room 00 text, results

Goals

The goal of this course is to enrich and strengthen the knowledge about basic algorithms and data structures, learned in undergraduate courses, with further models, methodologies, tools, and techniques to design, analyze, experiment, and improve solutions for managing massive datasets of multimodal types: integer, string, vector, text, time series, and (labeled) graph.

These knowledge and skills constitute the building blocks upon which Big Data platforms and applications are built (e.g., key-value stores, graph DBs, vector DBs, Transformers, …), so this theoretical part will provide students with the “methodological toolbox” for designing and evaluating their big data solutions, and for choosing the right library to adopt for their implementation.

The theoretical lectures will be complemented by some hands-on experience in coding with relevant programming libraries that offer the above-mentioned building blocks. The course provides most of the algorithmic prerequisites for the other courses of this master’s degree.

Syllabus

Lectures will include in-depth discussions on the practical efficiency of the proposed algorithms and data structures, plus hands-on experience with coding solutions for some topics.

Background and Notes of the Course

I strongly suggest refreshing your knowledge about basic Algorithms and Data Structures by looking at the well-known book Introduction to Algorithms, Cormen-Leiserson-Rivest-Stein (third edition). Specifically, I suggest you look at the chapters 2, 3, 4, 6, 7, 8, 10, 11 (no perfect hash), 12 (no randomly built), 15 (no optimal BST), 18, 22 (no strongly connected components). Also, you could look at the Video Lectures by Erik Demaine and Charles Leiserson, specifically Lectures 1-7, 9-10, and 15-17.

The content of the course will be covered in part by the book published by Cambridge University Press as [https://www.cambridge.org/core/books/pearls-of-algorithm-engineering/95061352D7263CCCBD4F243018236EB2|Pearls of Algorithm Engineering], and in part by notes and GitHub refs that will provide the materials of hands-on sessions.


Lectures (TO BE CHANGED SIGNIFICANTLY)

Date Lecture Biblio
Key issues about I/Os, streaming, and hierarchical memory in managing massive datasets.
Algorithmic and data structure issues concerning atomic versus variable-length items.
Two fundamental tools: sorting and permitting.
Key-value stores: From basic to advanced hash tables and tries.
Textual search engines: compressing and accessing posting lists (integer sequences) and text collections. Deduplicating similar/equal texts, clustering text collections.
Bio-Informatics engines: compressing and indexing arbitrary text for substring search, exact or approximate.
Vector DBs: nearest neighbor search, hamming, or Euclidean distance.
Advanced storage: Streaming and random access to compressed raw files, time series, and (labeled) graphs.