Indice

Data Mining A.A. 2020/21

DM1 - Data Mining: Foundations (6 CFU)

Instructors:

Teaching Assistant

DM2 - Data Mining: Advanced Topics and Applications (6 CFU)

Instructors:

News

Learning Goals

Hours and Rooms

DM1

Classes

Day of Week Hour Room
Monday 14:00 - 16:00 MS Teams
Wednesday 16:00 - 18:00 MS Teams

Office hours - Ricevimento:

DM 2

Classes

Day of Week Hour Room
Monday 14:00 - 16:00 MS Teams
Wednesday 16:00 - 18:00 MS Teams

Office Hours - Ricevimento:

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Slides

Software

Class Calendar (2020/2021)

First Semester (DM1 - Data Mining: Foundations)

Day Room Topic Learning material Instructor
1. 16.09.2020 14:00-16:00 MS Teams Introduction. Course Overview Introduction DM Pedreschi
2. 23.09.2020 16:00-18:00 MS Teams Data Understanding Slides DU Slides on Descriptive Statistics Pedreschi
3. 28.09.2020 14:00-16:00 MS Teams Data Understanding Pedreschi
4. 30.09.2020 16:00-18:00 MS Teams Data Preparation Slides DP Pedreschi
5. 05.10.2020 14:00-16:00 MS Teams Lab: Introduction to Python and Knime Python Introduction, Knime simple workflow Lecture 5 part 1, Lecture 5 part 2 Guidotti, Citraro
6. 07.10.2020 16:00-18:00 MS Teams Lab: Data Understanding & Preparation Dataset: Iris, Titanic, Knime: 01_data_understanding.zip Python: titanic_data_understanding2.ipynb.zip Lecture 6 part 1, Lecture 6 part 2 Guidotti, Citraro
7. 12.10.2020 14:00-16:00 MS Teams Clustering: Intro & K-means Slides clustering 1 Nanni
8. 14.10.2020 16:00-18:00 MS Teams Clustering: Hierarchical methods Slides clustering 2 Nanni
9. 19.10.2020 14:00-16:00 MS Teams Clustering: Density-based methods and exercises Slides clustering 3, Clustering exercises Nanni
10. 21.10.2020 16:00-18:00 MS Teams Clustering: Validation methods and exercises Slides clustering 4 Nanni
11. 26.10.2020 14:00-16:00 MS Teams Lab: Clustering Knime , Python Iris Python Titanic Citraro
12. 28.10.2020 16:00-18:00 MS Teams Classification: Intro and Decision Trees Slides classification Nanni
02.11.2020 14:00-16:00 No Lecture. Project Week.
04.11.2020 16:00-18:00 No Lecture. Project Week.
13. 09.11.2020 14:00-16:00 MS Teams Classification: Decision Trees/2 Nanni
14. 11.11.2020 16:00-18:00 MS Teams Classification: Decision Trees/3 Nanni
15. 16.11.2020 14:00-16:00 MS Teams Classification: Decision Trees/4 Sample exercise Nanni
16. 18.11.2020 16:00-18:00 MS Teams Classification: Decision Trees/5 + Exercises Exercises 1, Excercises 2 Nanni
17. 23.11.2020 14:00-16:00 MS Teams Classification: KNN Slides, Exercise 1 (KNN only), Exercise 2 Nanni
18. 25.11.2020 16:00-18:00 MS Teams Lab: Clustering knime_classification python_classification python_classification2 Citraro
19. 02.12.2020 16:00-18:00 MS Teams Pattern & Association Rule Mining - Apriori algorithm for frequent itemset mining 2-dm2-restructured_assoc-2020.pdf Pedreschi
20. 07.12.2020 14:00-16:00 MS Teams Pattern & Association Rule Mining - Rule mining and evaluation, Closed and maximal itemsets, Multi-dimensional, Quantitative and Multy-level association rules Pedreschi
21. 14.12.2020 14:00-16:00 Lab Pattern Mining knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, http://www.borgelt.net/pyfim.html ex-frequentpatterns-ar.pdf Citraro

Second Semester (DM2 - Data Mining: Advanced Topics and Applications)

Day Room Topic Learning material Instructor Recordings
1. 15.02.2021 14:00-16:00 MS Teams Introduction, CRIPS, KNN Intro, CRISP, KNN Guidotti 1stPart, 2ndPart
2. 17.02.2021 16:00-18:00 MS Teams Performance Evaluation Eval, occupancy_data, KNN_Eval_Notebook Guidotti Dataset, Lecture
3. 22.02.2021 14:00-16:00 MS Teams Imbalanced Learning ImbLearn, DimRed_notebook, ImbLearn_notebook Guidotti 1stPart, 2ndPart
4. 23.02.2021 16:00-18:00 MS Teams Anomaly Detection MLE, Anomaly Detection, Anomaly_notebook Guidotti 1st Part, 2nd Part
5. 01.03.2021 14:00-16:00 MS Teams Anomaly Detection Anomaly Detection, Anomaly_notebook Guidotti 1st Part, 2nd Part
6. 03.02.2021 16:00-18:00 MS Teams Anomaly Detection Anomaly Detection, Anomaly_notebook, Extended Isolation Forest link Guidotti 1st Part, 2nd Part
7. 08.03.2021 14:00-16:00 MS Teams Naive Bayes Classifier NBC, NBC_notebook, Ex1_Miro, Ex2_Miro Guidotti 1st Part, 2nd Part
10.02.2021 16:00-18:00 Lezione sul tema “Da Pisa al Fermilab di Chicago: Viaggio verso un rivoluzionario computer quantistico” della prof.ssa Anna Grassellino Link Guidotti
8. 15.03.2021 14:00-16:00 MS Teams Linear and Logistic Regression, Rule-based Classifiers Regression, RuleBased, Regression_Notebook Guidotti 1stPart, 2ndPart
9. 17.03.2021 16:00-18:00 MS Teams Rule-based Classifiers, Support Vector Machines RuleBased, RuleBased_Notebook, SVM, SVM_Notebook Guidotti 1st Part, 2nd Part
10. 22.03.2021 14:00-16:00 MS Teams (Nonlinear) Support Vector Machines, Linear Perceptron SVM, SVM_Notebook, Linear Perceptron Guidotti 1st Part, 2nd Part
11. 24.03.2021 16:00-18:00 MS Teams Neural Networks, Deep Neural Networks Neural Network, NN_Notebook Guidotti 1st Part, 2nd Part
- 25.03.2021 15:00-17:00 MS Teams Neural Networks Forward and Backpropagation Example, Case Study Music NN_Implementation, Case Study Guidotti 1st Part, 2nd Part
12. 29.03.2021 14:00-16:00 MS Teams Neural Networks (Training Tricks), Ensemble Classifiers Ensemble Classifiers Guidotti 1st Part, 2nd Part
13. 31.03.2021 16:00-18:00 MS Teams Ensemble Classifiers Ensemble Classifiers, Ensemble_Notebook Guidotti 1st Part, 2nd Part
14. 12.04.2021 14:00-16:00 MS Teams Time Series Similarity Time Series Similarity Guidotti 1st Part, 2nd Part
15. 14.04.2021 16:00-18:00 MS Teams Time Series Similarity, Approximation and Clustering Time Series Similarity, Time Series Approximation and Clustering Guidotti 1st Part, 2nd Part
16. 19.04.2021 14:00-16:00 MS Teams Time Series Motifs TS_Similarty_Notebook, Time Series Motifs, TS Datasets, Keras Accuracy Guidotti 1st Part, 2nd Part
17. 21.04.2021 16:00-18:00 MS Teams Time Series Classification Time Series Classification, TS_Plot, TS_Similarty_Notebook (updated) Guidotti 1st Part, 2nd Part, Office Hours
18. 26.04.2021 14:00-16:00 MS Teams Time Series Classification Time Series Classification, TS_Shapelet_Motif_Notebook, TS_classification_Notebook, TS_from_MP3_Notebook Guidotti 1st Part, 2nd Part, Tutorial MP3
19. 28.04.2021 16:00-18:00 MS Teams Sequential Pattern Mining Sequential Pattern Mining Guidotti 1st Part, 2nd Part
20. 03.05.2021 14:00-16:00 MS Teams Sequential Pattern Mining (Timing Constraints) Sequential Pattern Mining, SPM_Notebook, TS_extraction_RMS, RMSE_TS Dataset Guidotti 1st Part, 2nd Part, Tutorial RMSE
21. 05.05.2021 16:00-18:00 MS Teams Advanced Clustering Methods Advanced Clustering Methods Guidotti 1st Part, 2nd Part
22. 10.05.2021 14:00-16:00 MS Teams Transactional Clustering Methods Transactional Clustering Methods, ACM_notebooks Guidotti Hint Clus TS 1st Part, 2nd Part
23. 12.05.2021 16:00-18:00 MS Teams Explainable Artificial Intelligence XAI, ACM_Notebook Guidotti ACM_Notebook 1st Part, 2nd Part
24. 17.05.2021 14:00-16:00 MS Teams Explainable Artificial Intelligence XAI, XAI_Notebook Guidotti 1st Part, 2nd Part

Exams

Exam DM1

The exam is composed of two parts:

Tasks of the project:

  1. Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification: Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)

Guidelines for the project are here.

Exam DM part II (DMA)

Exam Rules

Exam Booking Periods

Exam Booking Agenda

The link to the agenda for booking a slot for the exam is displayed at the end of the registration. During the exam the camera must remain open and you must be able to share your screen. For the exam could be required the usage of the Miro platform (https://miro.com/app/dashboard/).

The exam is composed of two parts:

Project Guidelines

N.B. When “solving the classification task”, remember, (i) to test, when needed, different criteria for the parameter estimation of the algorithms, and (ii) to evaluate the classifiers (e.g., Accuracy, F1, Lift Chart) in order to compare the results obtained with an imbalanced technique against those obtained from using the “original” dataset.

Exam Dates

Exam Sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 MS Teams Please, use the system for registration: https://esami.unipi.it/

Past Exams

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

Previous years