Indice

Data Mining A.A. 2019/20

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

DM 2: Advanced Topics on Data Mining and Applications (6 CFU)

Instructors:

DM: Data Mining (9 CFU)

Instructors:

News

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week Hour Room
Lunedì/Monday 14:00 - 16:00 Aula E1
Mercoledì/Wednesday 16:00 - 18:00 Aula A1
Venerdì/Friday 11:00 - 13:00 Aula C1

Office hours - Ricevimento:

DM 2

Classes - Lezioni

Day of Week Hour Room
Monday 09:00 - 11:00 C
Wednesday 16:00 - 18:00 C1

Office hours - Ricevimento:

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Slides of the classes -- Slides del corso

Past Exams

* Exercises on Clustering: ex._clustering.pdf

* Some text of past exams on DM1 (6CFU):

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

Data mining software

Class calendar - Calendario delle lezioni (2019/2020)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

Day Topic Learning material Instructor
1. 16.09 14:00-16:00 Overview. Introduction to KDD Course Overview Introduction DM Pedreschi
18.09 16:00-18:00 Lecture canceled (Event at Scuola S. Anna Information in News Section of this page) Pedreschi
2. 20.09 11:00-13:00 Introduction to KDD: technologies, Application and Data Pedreschi
3. 23.09 14:00-16:00 Data Understanding (from Bertold book!) Slides DU Slides on Descriptive Statistics useful for clarifying some statistical notions of statistics. Unfortunately this material is only in Italian. Monreale
4. 25.09 16:00-18:00 Data Preparation Slides DP Monreale
27.09 11:00-13:00 Climate Strike
5. 30.09 14:00-16:00 Introduction to Python. Python Introduction Monreale
6. 02.10 16:00-18:00 Clustering: Introduction + Centroid-based clustering, K-means Clustering: Intro and K-means Pedreschi
7. 04.10 11:00-13:00 Lab: Data Understanding & Preparation in Knime Knime: 01_data_understanding.zip Data: Titanic File Monreale
8. 07.10 14:00-16:00 Lab: DU Python + Project presentation Python: titanic_data_understanding2.ipynb.zip Monreale
9. 09.10 16:00-18:00 Clustering: K-means + Hierarchical 5.basic_cluster_analysis-hierarchical.pdf Monreale
10. 11.10 11:00-13:00 Suppressed for Internet festival Pedreschi
11. 14.10 14:00-16:00 Clustering: DBSCAN & VALIDITY 6.basic_cluster_analysis-dbscan-validity.pdf Pedreschi
12. 16.10 16:00-18:00 Exercises on Clustering Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX Monreale
13. 18.10 11:00-13:00 Lab: Clustering clustering_knime clustering_python Monreale
14. 21.10 14:00-16:00 Classification 7.chap3_basic_classification-2019.pdfA visual intro to machine learning Pedreschi
15. 23.10 16:00-18:00 Classification Pedreschi
16. 25.10 11:00-13:00 Classification Pedreschi
17. 28.10 14:00-16:00 LAB: Classificazione knime_classification python_classification Monreale
18. 30.10 16:00-18:00 Exercises Classification + Discussion Clustering ex-classification.pdf Monreale
19. 04.11 14:00-15:00 Pattern Mining Note: the lecture will terminate at 15:00 to allow for the participation of the Informatica50 event (see news) slides Pedreschi
20. 06.11 16:00-18:00 Pattern Mining Pedreschi
08-14.11 Project work
21. 15.11 11:00-13:00 Exercises and Lab on Pattern Mining knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, http://www.borgelt.net/pyfim.html ex-frequentpatterns-ar.pdf Monreale
18.11 14:00-16:00 Suppressed for weather conditions
20.11 16:00-18:00 Suppressed
22. 22.11 11:00-13:00 Exercises Classification Monreale
Next Classes are dedicated to DM of 9 CFU
23. 25.11 14:00-16:00 Alternative methods for classification/1 K-Nearest Neighbors & Naive Bayes Pedreschi
24. 27.11 16:00-18:00 Alternative methods for classification/2 Wisdom of the crowd & Ensemble methods: Bagging, Random Forest & Boosting Galton's "Vox Populi" 1907 Nature paper Pedreschi
25. 29.11 11:00-13:00 Alternative methods for classification/3 Recap Ensemble methods & Hints to Rule-based classification Pedreschi
26. 02.12 14:00-16:00 Alternative Methods for Pattern Mining + Ex on KNN and NB fp-growth.pdf KNN & NB Monreale
27. 04.12 16:00-18:00 Alternative Methods for Clustering 1-alternative-clustering-2019.pdf2-transactionalclustering-2019.pdf Monreale
28. 06.12 11:00-13:00 Sequential Pattern Mining Sequential patterns Pedreschi
29. 09.12 14:00-16:00 Exercises on sequential pattern mining & ROCK exsequentialpatternmining.pdf ex-clustering-rock.pdf Monreale
30. 11.12 16:00-18:00 Black Box Explanations 2019-dm_xai.pdf Material: LORE LIME Survey ABELE Monreale
31. 13.12 11:00-13:00 Exercises on written exam - all students 9_cfu_ex.pdf ex_clustering_fpm_dt.pdf hierarchical_max_sim.pdf Monreale
32. 16.12 13:30-16:00 Mid-term Test (Rooms A, E1, C1) Monreale
30. 18.12 16:00-18:00 Privacy in DM. Project. privacydt.pdf Overview on Privacy Privacy by design Monreale

Second part of course, second semester (DM2 - Advanced Topics on Data Mining and Applications)

Day Room (Aula) Topic Learning material Instructor (Guidotti)
1. 17.02.2020 09:00-11:00 C Introduction, Instance-based and Bayesian Classifiers Intro, Libraries, Instance-Based and Bayesian Classifiers
2. 19.02.2020 16:00-18:00 C1 Linear and Logistic Regression, Dimensionality Reduction, Exercises KNN and Naive Bayes Regression, Dimensionality Reduction, Ex_KNN_NB_Lift, Appendix
3. 24.02.2020 09:00-11:00 C Imbalanced Learning, Performance Evaluation and Rule-based Classifiers Imbalanced Learning Rule-based Classifiers
4. 26.02.2020 16:00-18:00 C1 Exercises Lift, ROC, KNN and Naive Bayes. Lab KNN and Naive Bayes. Ex_KNN_NB_Lift, Lab_KNN_NB, Data Preparation, Churn Dataset, Iris Dataset
5. 02.03.2020 09:00-11:00 C Lab Regression, Dimensionality Reduction, Imbalanced Learning + CAT1 Regression, Dimensionality Reduction, Imbalanced Learning Airquality Dataset
6. 04.03.2020 16:00-18:00 C1 CRISP-DM, SVM, Intro NN CRISP-DM, SVM, NN
7. 09.03.2020 09:00-11:00 online Neural Network, Exercises NN NN , Ex_NN_Ensemble
8. 11.03.2020 16:00-18:00 online Neural Network, Exercises NN, Deep Neural Network, Intro Ensemble, Exercises Ensemble NN , DNN Ex_NN_Ensemble
9. 16.03.2020 09:00-11:00 online Ensemble Classifiers, Exercises Ensemble Ensemble, Ex_NN_Ensemble
10. 18.03.2020 16:00-18:00 online Lab SVM, Neural Network, Ensemble Lab_SVM_NN_RF
11. 23.03.2020 09:00-11:00 online Time Series Similarity, Ex DTW Time Series Similarity, Ex_DTW
12. 25.03.2020 16:00-18:00 online Time Series Motif/Shapelet, Ex Matrix Profile Time Series Motif/Shapelet, Ex_MP
13. 30.03.2020 09:00-11:00 online Time Series Stationariety and Forecasting Time Series Forecasting
14. 01.04.2020 16:00-18:00 online Lab Time Series Lab_TS
15. 06.04.2020 09:00-11:00 online Time Series Classification, Lab Time Series Time Series Classification, Lab_TS, Data Partitioning
- 08.04.2020 Reading/Project Week
- 15.04.2020 Reading/Project Week
16. 20.04.2020 09:00-11:00 online Sequential Pattern Mining SPM
17. 22.04.2020 16:00-18:00 online SPM Time Constraints, Exercises, Lab Ex_SPM, Lab_SPM
18. 27.04.2020 09:00-11:00 online Advanced Clustering, Ex, SPM, Lab EM, X-Means Advanced Clustering , Lab_AC
19. 29.04.2020 16:00-18:00 online Transactional Clustering, Ex TC, Lab K-Mode Ex_SPM_TC
20. 04.05.2020 09:00-11:00 online Anomaly Detection, Ex AD Anomaly Detection , Ex_AD
21. 06.05.2020 16:00-18:00 online Anomaly Detection, Ex AD, Lab AD Anomaly Detection , Ex_AD, Lab_AD
22. 11.05.2020 09:00-11:00 online Ethics: Privacy Privacy
23. 13.05.2020 16:00-18:00 online Ethics: Explainability Explainability
24. 18.05.2020 09:00-11:00 online Ethics: Local Explainability, Inspection, Transparent Methods, Lab Explainability, Lab_XAI
- 20.05.2020 Reading/Project Week
- 25.05.2020 Reading/Project Week
- 27.05.2020 Reading/Project Week

Exams

Exam DM part I (DMF)

RULES FOR EXAMS for COMPUTER SCIENCE - 9CFU: EXAM RULES Summer Session - 9 CFU

RULES FOR EXAMS for DATA SCIENCE & BI and DIGITAL HUMANITIES - DM1(6CFU): EXAM RULES Summer Session - DM1(6CFU)

The exam is composed of two parts:

Tasks of the project:

  1. Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification: Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)
  5. ADDITIONAL TASK for DM9 CFU (OPTIONAL): Students for computer science (DM9CFU) can decide to deliver an additional task for the project selected among the following for additional bonus of 3 points:
    1. Classification: Compare results of classification by decision tree with KNN, Naive Bayesian, analysing also the runtime at training and test phase.
    2. Clustering: Is it possible to apply EM clustering? Does the quality of the clustering result improve?

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
DM1: First Mid-term 2018 16.12.2019 13:30-16:00 Room E1, C1, A Please, use the system for registration: https://esami.unipi.it/

Appelli regolari / Exam sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 Room E
2.06.02.2019 14:00 - 18:00 Room E
3.19.06.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results
4.10.07.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results
5.08.06.2020 09:00 - 18:00 Microsoft Teams From 08/06 to 25/06. Please register ( here) and select your slot here. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 01/06.
6.26.06.2020 09:00 - 18:00 Microsoft Teams From 26/06 to 16/07. Please register ( here) and select your slot here. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 21/06.
7.17.07.2020 09:00 - 18:00 Microsoft Teams From 17/07 to 29/07. Please register ( here) and select your slot at the agenda link that will be available from 12/07 only for those registered for the exam. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 10/07. It is mandatory to submit the project before 15/07.

Previous years