Differenze

Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.

--- magistraleinformatica:dmi:start [18/09/2024 alle 09:33 (17 mesi fa)] – Update teaching material, and office hours Mattia Setzu
+++ magistraleinformatica:dmi:start [18/12/2025 alle 14:09 (7 settimane fa)] (versione attuale) – Anna Monreale
@@ Linea 1: / Linea 1: @@
-====== Data Mining (309AA) - 9 CFU A.Y. 2024/2025 ======
+====== Data Mining (309AA) - 9 CFU A.Y. 2025/2026 ======
 **Instructors:**
@@ Linea 12: / Linea 12: @@
   * * **Lorenzo Mannocci**
     * University of Pisa
-    * [[lorenzo.mannocci@phd.unipi.it]]
+    * [[lorenzo.mannocci@di.unipi.it]]
 ====== News ======
-  * [14.09.2024] ** The lectures will start on 19th September 2024**
+  * [18-11-2025]: Project deadline available: January 5th, 2026.
+  * [23-09-2025]: Please register yourself and your group for the project .Group registration available  [[https://docs.google.com/spreadsheets/d/1Xl8Hd-giIuJQw0x2NDkXjbGZ2REGF-OukqC5XGU6pzA/edit?gid=0#gid=0|here]].
 ====== Learning Goals ======
-     * Fundamental concepts of data knowledge and discovery.
+The Data Mining course tackles the analysis of large collections of data, and the extraction of information and patterns. It aims to explore core components of the Knowledge Discovery from Data (KDD) process, and focuses on:
-     * Data understanding
+  * Data understanding
-     * Data preparation
+  * Data cleaning, preparation, and transformation
-     * Clustering
+  * Data analysis: outlier detection and data representation
-     * Classification
+  * Data clustering
-     * Pattern Mining and Association Rules
+  * Pattern extraction: itemset, rules, association rules, and sequential patterns
-     * Outlier Detection
+  * Inference models: trees, and ensemble models
-     * Time Series Analysis
+  * Responsible data use: privacy and interpretability
-     * Sequential Pattern Mining
-     * Ethical Issues
 ====== Schedule ======
@@ Linea 34: / Linea 33: @@
 ^  Day of Week  ^  Hour  ^  Room  ^
-|  Tuesday   |  11:00 - 13:00  |  Room C1  |
+|  Tuesday   |  11:00 - 13:00  |  Room C  |
-|  Thursday  |  09:00 - 11:00  |  Room C  |
+|  Wednesday |  14:00 - 16:00  |  Room C  |
-|  Friday    |  09:00 - 11:00  |  Room C1  |
+|  Thursday  |  14:00 - 16:00  |  Room A1  |
 **Office hours - Ricevimento:**
-  * Anna Monreale: TBD
+  * Anna Monreale:TBD- Online using Teams or in my Office (Appointment by email).
   * Mattia Setzu: Infos on [[https://unimap.unipi.it/cercapersone/dettaglio.php?ri=177323&template=dett_didattica.tpl|Unimap]]
-A [[https://teams.microsoft.com/l/team/19%3Aq8IK5DrzMwEE5TxVhuw4QdYEVFJ06KVITI5jSJTmaJ81%40thread.tacv2/conversations?groupId=5fae2fa6-38fd-414f-a0c9-ffbd8e6f0710&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Channel]] will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed, but recordings of the lecture or of the previous years will be made available here for non-attending students.
+A [[ https://teams.microsoft.com/l/team/19%3Ai_Ge38xXm8FdnepLNud6ddbz_OECbBPRKfA1UKbUsQo1%40thread.tacv2/conversations?groupId=41e56778-e965-462a-9fef-250df0ee7055&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Channel]] will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed.
 ====== Teaching Material ======
@@ Linea 69: / Linea 69: @@
 The slides used in the course will be inserted in the calendar after each class. Some are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]].
+===== Past Excercises and past exams of similar courses  =====
+  * Exercises on Clustering: {{ :dm:ex._clustering.pdf |}}
+  * Some text of past exams of a similar course: {{ :dm:2017-1-19.pdf |}}, {{ :dm:2017-9-6.pdf |}}, {{ :dm:2016-05-30-dm1-seconda.pdf |}}, {{ :dm:dm2_exam.2017.06.13_solutions.pdf |}}, {{ :dm:dm2_exam.2017.07.04_solutions.pdf |}}, {{ :dm:dm2_mid-term_exam.2017.06.06_solutions.pdf |}}
+  * Some exercises (partially with solutions) on **sequential patterns** and **time series** can be found in the following texts of exams from the last years: {{ :dm:dm2_exam.2015.04.13.results.pdf|}}, {{ :dm:dm2_exam.2016.04.4_sol.pdf |}}, {{ :dm:dm2_exam.2016.04.5_sol.pdf |}}, {{ :dm:dm2_exam.2016.06.20_sol.pdf |}}, {{ :dm:dm2_exam.2016.07.08_sol.pdf |}}
+   * Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program: {{tdm:verifica2006.pdf|Verifica 2006}}, {{tdm:verifica2005.pdf|Verifica 2005 (con soluzioni)}}, {{tdm:verifica2004.pdf|Verifica 2004}}, {{dm:verifica.05.06.2007.pdf|Verifica 5 giugno 2007}}, {{dm:verifica.26.06.2007.pdf|Verifica 26 giugno 2007}}, {{dm:verifica.24.07.2007_corretto.pdf|Verifica 24 luglio 2007}} (e {{:dm:soluzioni.2008.04.03.pdf|Soluzioni}}), {{:dm:dm-tdm.appello_2008_07_18_parte1.pdf|Verifica 18 luglio 2008 - parte 1}}, {{:dm:dm-tdm.appello_2008_07_18_parte2.pdf|Verifica 18 luglio 2008 - parte 2}},{{:dm:appello.2010.06.01_soluzioni.pdf| Exam with solution 2010-06-01}},{{:dm:appello.2010.06.22_soluzioni.pdf|Exam with solution 2010-06-22}}, {{:dm:appello.2010.09.09_soluzioni.pdf|Exam with solution 2010-09-09}},{{:dm:appello.2010.07.13_soluzioni.pdf| Exam with solution 2010-07-13}}
-**Software**
-Software material available in the [[https://github.com/data-mining-UniPI/teaching23|Github repository]] (available in the coming days).
-====== Class Calendar (2024/2025) ======
+====== Class Calendar (2025/2026) ======
 ===== First Semester  =====
-^ ^ Day ^ Topic ^ Learning material ^ References ^ Video Lectures ^ Teacher ^
+^ ^ Day ^ Topic ^ Teaching material ^ References ^ Teacher ^
-|    |  17.09  | Candeled   |  |   |
+|1.  |  18.09  | Course Overview. Introduction to Data Mining |  {{ :magistraleinformatica:dmi:intro_dm.pdf |Introduction to DM}} | Chap. 1 Kumar Book | Setzu  |
-|1.  |  19.09  | Overview. Introduction to KDD   |  |Chap. 1 Kumar Book | | |
+|    |  23.09  | Canceled for Teacher's health issues         |   |  | |
+|2.  |  24.09  | Data Understanding + Data Preparation        | {{ :magistraleinformatica:dmi:data_understanding.pdf |}} {{ :magistraleinformatica:dmi:data_preparation_and_cleaning.pdf | Data Preparation}}| Chap. 2 Kumar Book and additioanl resource of Kumar Book: [[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Data Exploration Chap.]] If you have the first ed. of KUMAR this is the Chap 3 |Setzu |
+|3.  |  25.09  | Data representation      |{{ :magistraleinformatica:dmi:data_representation.pdf |}} | References: Introduction to linear algebra (Sections 1, 3.1, 4.2, 6.1, 6.4, 6.5, 7.3), [[https://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf|t-SNE paper]], [[https://arxiv.org/abs/1802.03426 | UMAP paper (Section 3)]]  |Setzu |
+|4.  |  30.09  | Data Cleaning + Transformations. PyLab: Data Understanding     | {{ :magistraleinformatica:dmi:5-data_cleaning_transformation.pdf | Data Cleaning & Transformations}}| | Monreale, Mannocci |
+|5.  |  01.10  | PyLab: Data Understanding + Preparation    |{{ :magistraleinformatica:dmi:1_basics_and_understanding.ipynb.zip |}} {{ :magistraleinformatica:dmi:2_feature_engineering_and_data_representation.ipynb.zip |}} {{ :magistraleinformatica:dmi:data_notebook.zip |}}| | Monreale, Mannocci |
+|6.  |  02.10  | Similarities + Introduction to Clustering and Centroid-based clustering  | {{ :magistraleinformatica:dmi:6-data_similarity.pdf |}} {{ :magistraleinformatica:dmi:6-basic_cluster_analysis-intro.pdf |}} {{ :magistraleinformatica:dmi:8-basic_cluster_analysis-kmeans.pdf |}}| | Monreale |
+|7.  |  07.10  | K-means   | {{:magistraleinformatica:dmi:8-basic_cluster_analysis-kmeans.pdf |}}}| | Monreale |
+|8.  |  08.10  | Hierarchical Clustering + Density Based Clustering + Validity   | {{ :magistraleinformatica:dmi:9-basic_cluster_analysis-hierarchical.pdf |}} {{ :magistraleinformatica:dmi:8.basic_cluster_analysis-dbscan-validity.pdf |}} |  | Monreale |
+| 9. | 14.10 | Clustering evaluation and Python notebooks | {{ https://didawiki.cli.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/dmi/12-basic_cluster_analysis-validity.pdf | Clustering validation}} {{ :magistraleinformatica:dmi:3_clustering.ipynb.zip |}} | | Setzu, Mannocci |
+| 10. | 15.10 | Anomaly detection | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/anomaly%20detection/Anomaly%20detection.html.pdf | Slides }} | | Setzu |
+| 11. | 16.10 | Anomaly detection | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/anomaly%20detection/Anomaly%20detection.html.pdf | Slides }}, {{ https://github.com/data-mining-UniPI/teaching25/blob/main/notebooks/outliers.ipynb | Notebook }}, {{ https://github.com/data-mining-UniPI/teaching25/blob/main/notebooks/isolation_forest.py | Rule extraction from isolation forests }} | | Setzu |
+|12.  |  21.10  | Variants of K-means + Association Rule Mining | {{ :magistraleinformatica:dmi:11-basic_cluster_analysis-kmeans-variants.pdf |}} {{ :magistraleinformatica:dmi:17_association_analysis2023.pdf |}} | | Monreale  |
+|13.  |  22.10  | Association Rule Mining: Apriori | {{ :magistraleinformatica:dmi:17_association_analysis2023.pdf |}} | | Monreale  |
+|14.  |  23.10  | Association Rule Mining: CORELS | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/rule_mining/Rule%20extraction.html.pdf | Slides }}, {{ https://corels.cs.ubc.ca/corels/index.html | Online tool }} | | Setzu  |
+|15.  |  28.10  | Visual Analytcs  | {{ :magistraleinformatica:dmi:dm_intro_dataviz_vegaaltair.pdf |Slides}} {{ :magistraleinformatica:dmi:1_bis_basics_and_understanding_altair.ipynb.zip | Code for data visualization with Altair}}| |Monreale, Rinzivillo|
+|16.  |  29.10  | Association Rule Mining: FP-Growth + Sequential Pattern Mining| {{ :magistraleinformatica:dmi:17_2023-fp-growth.pdf |FP-Growth}}{{ :magistraleinformatica:dmi:18_sequential_patterns_2024.pdf |SPM}}| |Monreale|
+|     |  30.10  | Lecture is canceled| | | |
+|17.  |  04.11  | Sequential Pattern Mining with time constraints + Python Lab: FPM + SPM.| For SPM the same set of slides used in the previous lecture {{ :magistraleinformatica:dmi:5_patternmining.ipynb.zip |}} | | Monreale|
+|18.  |  05.11  | Supervised learning and classification | {{ :magistraleinformatica:dmi:supervisinglearning.pdf | Slides}}| | Setzu |
+|19.  |  06.11  | Classification: Decision Trees | {{ :magistraleinformatica:dmi:2025-dt_classification.pdf |Decision Trees }} [[https://unipiit.sharepoint.com/:v:/s/a__td_69096/ESuyvNgtPWxPoLRspBH9q3IB2cvZE9o6a0DRZFQP2gbNww?e=VqxUqL|Video]] | | Monreale |
+|20.  |  07.11  | Classification: Decision Trees |  | | Monreale |
+|21.  |  11.11  | Classification: Decision Trees & evaltuation +  Decision Rules| {{ :magistraleinformatica:dmi:classificationmodelevaluation-2025.pdf |Evaluation}} {{ :magistraleinformatica:dmi:10-rule-based-classifiers.pdf | Decision Rules}}  | | Monreale |
+|22.  |  12.11  | Classification: Decision Rules  + Instance based methods + Q&A for Project work| {{ :magistraleinformatica:dmi:10-knn.pdf |}} | | Monreale |
+|23.  |  13.11  | Exercises: DT simulation, CLustering, sequences | {{ :magistraleinformatica:dmi:dt-learning-simulation.pdf |}} {{ :magistraleinformatica:dmi:learnedtree.pdf |}}{{ :magistraleinformatica:dmi:2025-ex-clustering.pdf |}} {{ :magistraleinformatica:dmi:ex-sequences.pdf |}}| | Monreale |
+|24.  |  18.11  | Advanced Decision Trees, GAMs, and ensemble models | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/machine%20learning/Supervised%20tasks.html.pdf | Slides }} | | Setzu |
+|25.  |  25.11  | Neural networks | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/machine%20learning/Networks.pdf | Slides }} | | Setzu |
+|26.  |  26.11  | Time series, Python Supervised Learning & Imbalanced Scenarios | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/time%20series/Time%20series.html.pdf | Slides }} {{ :magistraleinformatica:dmi:supervised_learning.zip |}} {{ :magistraleinformatica:dmi:data_notebook.zip |}} | | Setzu, Mannocci |
+|27.  |  27.11  | Time series, Python Supervised Learning & Imbalanced Scenarios | {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/time%20series/Time%20series.html.pdf | Slides }}, {{ https://github.com/data-mining-UniPI/teaching25/blob/lectures/time%20series/Time%20series.html | Slides in HTML (w/ working animation) }} | | Setzu |
+|28.  |  02.12  | Shapelet-based Classification, Motif discovery | {{ :magistraleinformatica:dmi:23_time_series_motif-shapelets2023.pdf |Slides}} | {{ :magistraleinformatica:dmi:shaplet.pdf |}} {{ :magistraleinformatica:dmi:matrixprofile.pdf |}} [[https://www.cs.ucr.edu/~eamonn/MatrixProfile.html|Papers and resourse on motif]] |Monreale |
+|29.  |  03.12  | Py: Time Series|{{ :magistraleinformatica:dmi:timeseries.zip |}}| | Monreale, Mannocci |
+|30.  |  04.12  | Responsible AI: introduction and EU Regulations | {{ :magistraleinformatica:dmi:19_rai_privacy2025.pdf | Slides}}|Monreale |
+|31.  |  09.12  | Responsible AI: privacy. | Same slides of previous lecture | {{ :magistraleinformatica:dmi:chap-anonymity.pdf |}} [[https://arxiv.org/abs/1610.05820|MIA attack against ML]]| Monreale|
+|32.  |  10.12  | Responsible AI: Explaianble AI |{{ :magistraleinformatica:dmi:20_explainability_2025.pdf |XAI}}|[[https://christophm.github.io/interpretable-ml-book/|Digital book where students can find some basic XAI models and notions]] {{ :magistraleinformatica:dmi:xai-taxonomy-survey.pdf | XAI Survey describing the taxonony and dimensions of XAI}} {{ :magistraleinformatica:dmi:lore-j.pdf | LORE apaproach}}, {{ :magistraleinformatica:dmi:abele-approach.pdf |ABELE approach}}{{ :magistraleinformatica:dmi:lasts_-_explaining_any_time_series_classifier_2_.pdf |LASTS}} [[https://arxiv.org/abs/1705.07874|SHAP]][[https://arxiv.org/abs/1602.04938|LIME]]|Monreale|
+|33.  |  11.12  | XAI Python Notebook + Private and explanable FL, Assessing privacy in XAI  | {{ :magistraleinformatica:dmi:xai-tutorial.ipynb.zip |XAI Notebook}} {{ :magistraleinformatica:dmi:11-dic-2025-xai.pdf | Slides}} |{{ :magistraleinformatica:dmi:glor-flex_local_to_global_rule-based_explanations_fl.pdf |GLOR-FLEX}} {{ :magistraleinformatica:dmi:fastshap-ex-pri.pdf |FASTSHAP++}} [[https://www.tdp.cat/issues21/tdp.a534a24.pdf|REVEAL]]|Naretto|
+|34.  |  16.12  |Project Presentations - second check - ONLINE - **MANDATORY **|
+|35.  |  17.12  |Project Presentations - second check - ONLINE - **MANDATORY **|
+|36.  |  18.12  |Project Presentations - second check - ONLINE - **MANDATORY **|
+====== Exam ======
+The exam can be taken in one of two ways:
+**Project track**:
+  * Project (70% of the final score) to be delivered after the end of the course
+  * Oral exam (30% of the final score)
+During the course, you will have some “Project presentation” sessions wherein you’ll briefly (~3 minutes) present your work, and receive feedback from the lecturers. These sessions do not contribute to your grade.
+**Written test track**
+  * Written exam (70% of the final score): to be delivered after the end of the course during the exam sessions and can include both theoretical questions and exercises.
+  * Oral exam (30% of the final score)
+Note that a passing grade for the project/written exam is required to be admitted to the oral exam.
+**Project Guidelines:**
+A project consists in data analyses based on the use of data mining tools.
+The project has to be performed by a team of 3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and  well commented Python Notebooks.
+Specifically, if any of these tasks appear in the project track, make sure to focus on the following:
+**Data understanding**
+  * An analysis of all variables, their relations, distributions, and quality
+  * An eventual feature imputation and/or selection
+  * The engineering of additional features, including the aforementioned analyses
+**Clustering Analysis**
+  * A properly justified feature selection phase
+  * Tackling all clusternig families, exploring their respective hyperparameters
+  * An analysis of the best clusterings per family, including cluster description
+  * A comparison of the best clusterings per family
+**Anomaly detection**
+  * A selection of outliers through appropriate algorithms
+  * An interpretation of such outliers
+  * An analysis of the impact of the outliers on the previously performed data understanding
+**Time series analysis**
+  * Appropriate representation choice for the task at hand
+**Supervised learning**
+  * Feature selection
+  * Test different families of models
+  * Proper model validation, including both model performance and model complexity
+  * Comparison of the best models of each family
+**Explainability**
+  * Justified selection of instances to explain
+  * Analysis of the explanations
+**Project and Deadlines**
+Information about the dataset to be analyzed and project description:
+  * **Dataset.** https://drive.google.com/file/d/1K9garfm03-PFUMYyOenH9kqEJ7D5RrmD/view?usp=sharing
+  * **Project description.** {{ :magistraleinformatica:dmi:data_mining_project.pdf |}}
+  * **Project description Task 4.** {{ :magistraleinformatica:dmi:data_mining_project2.pdf |}}
+  * **Dataset Task 4.** https://drive.google.com/file/d/1Li2roWMoREN6_nKy-trB7pXWDA1xkAzh/view?usp=sharing
+  * ** Project description Task 5.
+  * **Project Question & Answers.{{ :magistraleinformatica:dmi:25-26-data_mining_project_includingt5.pdf |Complete Project Description}}
+  * **Deadline.** January 5th, 2026.
+  * **Delivery instructions.** The final deadline of the project is **5th January 2026 at 23:59**. This deadline is **STRICT**. No extension is possible because then the winter session of exams starts. **Groups that will not deliver the project by 5th January will need to do the written exam during the exam sessions.** Each group must deliver by email to anna.monreale@unipi.it, mattia.setzu@unipi.it, lorenzo.mannocci@di.unipi.it a zipped folder named DM_GroupID.zip and containing 4 folders and 1 pdf file: a folder named DM_GroupID_TASK1, containing source code of data understanding; a folder named DM_GroupID_TASK2, containing source code of data clustering; a folder named DM_GroupID_TASK3, containing source code of classification and explanation analysis; a folder named DM_GroupID_TASK4, containing source code of time series analysis; a pdf file with maximum 25+2 pages including figures discussing the results of the tasks (25 pages for tasks 1-4 and 2 pages for task 5). The name of this file must be: DM_Report_GroupID.pdf. The file must contain the list of authors (i.e., members of the group). **The subject of the email must be “DMProject25_GroupID”**
+  * **How to book for the exam colloquium?** In https://esami.unipi.it/ you can find the dates for the exam: one for January and one for February. Each student must do the registration on one of the 2 dates. These are not the dates of the colloquium but we will use the list of registered students for organizing the exam dates. We will share with you a calendar for the oral exam.
-====== Exams ======
-TBD
 ====== Previous years =====
+[[DM-INF 2024-2025]]
 [[DM-INF 2023-2024]]