Data Mining A.A. 2021/22

DM1 - Data Mining: Foundations (6 CFU)

Instructors:

Dino Pedreschi
- KDDLab, Università di Pisa
- http://www-kdd.isti.cnr.it
- dino [dot] pedreschi [at] unipi [dot] it

Mirco Nanni
- KDDLab, ISTI - CNR, Pisa
- http://www-kdd.isti.cnr.it
- mirco [dot] nanni [at] isti [dot] cnr [dot] it

Teaching Assistant

Salvatore Citraro
- KDDLab, Università di Pisa
- http://www-kdd.isti.cnr.it
- salvatore [dot] citraro [at] phd [dot] unipi [dot] it

DM2 - Data Mining: Advanced Topics and Applications (6 CFU)

Instructors:

Riccardo Guidotti
- KDDLab, Università di Pisa
- https://kdd.isti.cnr.it/people/guidotti-riccardo
- riccardo [dot] guidotti [at] di [dot] unipi [dot] it

Teaching Assistant

Francesco Spinnato
- KDDLab, Scuola Normale Superiore
- https://kdd.isti.cnr.it/people/spinnato-francesco
- francesco [dot] spinnato [at] sns [dot] it

News

[24.02.2022] Project Groups link
[28.04.2022] The exams are held “in person”, compatibly with the availability of adequate spaces. For particular categories of students (students with disabilities and international or Erasmus students), the exams ensured remotely upon request presented by the student when completing the registration form for the exam or, after the closing of the registration deadline, by filling in the appropriate form available at the following link: http://su.unipi.it/DichiarazioneEsameRemoto.
[05.05.2022] Rules for DM2 exam available here.
[26.05.2022] It is now possible to book slots for oral exams here. The exams will be in Aula X1.
[29.08.2022] The exams for the September session will be held in Sala Polifunzionale 308 in the Department of Computer Science.

Learning Goals

DM1
- Fundamental concepts of data knowledge and discovery.
- Data understanding
- Data preparation
- Clustering
- Classification
- Pattern Mining and Association Rules
- Clustering

DM2
- Outlier Detection
- Dimensionality Reduction
- Regression
- Advanced Classification
- Time Series Analysis
- Sequential Pattern Mining
- Advanced Clustering
- Transactional Clustering
- Ethical Issues

Hours and Rooms

DM1

Classes

Day of Week	Hour	Room
Monday	11:00 - 13:00	Aula C / MS Teams
Thursday	11:00 - 13:00	Aula A1 / MS Teams

Office hours - Ricevimento:

Prof. Pedreschi: Monday 16:00 - 18:00, Online
Prof. Nanni: appointment by email, Online

DM 2

Classes

Day of Week	Hour	Room
Monday	11:00 - 13:00	MS Teams
Thursday	11:00 - 13:00	MS Teams

Office Hours - Ricevimento:

Room 268 Dept. of Computer Science
Tuesday: 15-17, Room: MS Teams
Appointment by email

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- I capitoli 3, 5, 7 sono disponibili sul sito del publisher. – Chapters 3,5 and 7 are also available at the publisher's Web site.
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.

Slides

The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Software

Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
Scikit-learn: python library with tools for data mining and data analysis Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
KNIME The Konstanz Information Miner. Download page
WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Didactic Data Mining DDM

Class Calendar (2021/2022)

First Semester (DM1 - Data Mining: Foundations)

	Day	Room	Topic	Learning material	Recording	Instructor
1.	16.09.2021 11:00-12:45	Aula Fib A1	Introduction.	Introducing DM1 Project-work guidelines (updated 22.11.2021)	Lecture 1	Pedreschi
2.	20.09.2021 11:00-12:45	Aula Fib C	Course overview	Overview of contents	Lecture 2	Pedreschi
3.	23.09.2021 11:00-12:45	Aula Fib A1	Data Understanding	Slides	Lecture 3	Pedreschi
4.	27.09.2021 11:00-12:45	Aula Fib C	Data Preparation	Slides	Lecture 4	Pedreschi
5.	30.09.2021 11:00-12:45	Aula Fib A1	Lab: Data Understanding & Preparation – Python	Python Introduction Dataset: Iris Hands-On Python (Iris)	Lecture 5	Citraro
6.	04.10.2021 11:00-12:45	Aula Fib C	Lab: Data Understanding & Preparation – Python (cont.) & KNIME	Dataset: Titanic Hands-On Python (Titanic), Titanic DU+DP (complete) KMIME: Intro, KNIME DU+DP	Lecture 6	Citraro
7.	07.10.2021 11:00-12:45	Aula Fib A1	Clustering: Intro & K-means	Clustering intro and k-means [revised version]	Lecture 7	Nanni
	~~11.10.2021 11:00-12:45~~	~~Aula Fib C~~
8.	14.10.2021 11:00-12:45	Aula Fib A1	Clustering: k-means		Lecture 8	Nanni
	~~18.10.2021 11:00-12:45~~	~~Aula Fib C~~
9.	21.10.2021 11:00-12:45	Aula Fib A1	Clustering: Hierarchical methods	Clustering: Hierarchical Methods	Lecture 9	Nanni
10.	25.10.2021 11:00-12:45	Aula Fib C	Clustering: density-base methods & exercises	Clustering: Density-based methods	Lecture 10	Nanni
11.	28.10.2021 11:00-12:45	Aula Fib A1	Lab: Clustering	Python Hands-On Clust. (Iris) Python Titanic Knime	Lecture 11	Citraro
12.	04.11.2021 11:00-12:45	Aula Fib A1	Classification: intro and decision trees	Classification and decision trees (updated 11.11.2021)	Lecture 12	Nanni
13.	08.11.2021 11:00-12:45	Aula Fib C	Classification: decision trees/2		Lecture 13	Nanni
14.	11.11.2021 11:00-12:45	Aula Fib A1	Classification: decision trees/3		Lecture 14	Nanni
15.	15.11.2021 11:00-12:45	Aula Fib C	Classification: decision trees/4		Lecture 15	Nanni
16.	18.11.2021 11:00-12:45	Aula Fib A1	Classification: decision trees exercises	Exercise	Lecture 16	Nanni
17.	22.11.2021 11:00-12:45	Aula Fib C	Lab:Classification	knime_classification Hands_on_Python_Titanic Python_Iris	Online: TBD Lecture 17 (offline)	Citraro
18.	25.11.2021 11:00-12:45	Aula A1	Pattern Mining - 1	Slides	Lecture 18	Pedreschi
19.	29.11.2021 11:00-12:45	Aula C	Pattern Mining - 2		Lecture 19	Pedreschi
20.	02.12.2021 11:00-12:45	Aula A1	Lab: Pattern Mining	Apriori Exercise Hands_on_Python_Titanic KNIME	Lecture 20	Citraro

Second Semester (DM2 - Data Mining: Advanced Topics and Applications)

	Day	Room Teams	Topic	Learning material	Instructor	Recordings
01.	14.02.2022 11:00–13:00	C	Introduction, CRIPS, Evaluation, KNN	Intro, CRISP, Eval, KNN, Notebbok_KNN_Eval	Guidotti	link
02.	17.02.2022 11:00–13:00	A1	Imbalanced Learning, Evaluation	ImbLearn Eval, ImbLearn	Guidotti	link
03.	21.02.2022 11:00–13:00	C	Dimensionality Reduction	DimRed, Notebook_DimRed	Guidotti	link
04.	24.02.2022 11:00–13:00	A1	Outlier Detection (part 1)	Outlier Detection, Notebook_OutlierDetection	Guidotti	link
05.	28.02.2022 11:00–13:00	C	Outlier Detection (part 2)	Outlier Detection, Notebook_OutlierDetection	Guidotti	link
06.	03.03.2022 11:00–13:00	A1	Outlier Detection (part 3)	Outlier Detection, Notebook_OutlierDetection	Guidotti	link
07.	07.03.2022 11:00–13:00	C	Naive Bayes Classifier, Linear Regression	NBC , Notebook_NBC, LinReg	Guidotti	link
08.	10.03.2022 11:00–13:00	A1	Linear Regression, Gradient Descent, Maximum Likelihood Estimation, Odds	LinReg, GradDes, MLE, Odds	Guidotti	link
09.	14.03.2022 11:00–13:00	C	Logistic Regression, Support Vector Machines	LogReg, SVM, Notebook_LR, Notebook_SVM	Guidotti	link
10.	17.03.2022 11:00–13:00	A1	Linear and Logistic Perceptron	Perceptron	Guidotti	link1, link2
11.	21.03.2022 11:00–13:00	C	Neural Networks	NeuralNetwork, Notebook_NN, Notebook_NN_impl	Guidotti	link
12.	24.03.2022 11:00–13:00	A1	Ensemble Classifiers, Bagging, Random Forest	EnsembleClassifiers, Notebook_ENS	Guidotti	link
13.	28.03.2022 11:00–13:00	C	Boosting, Gradient Boost	GBM	Guidotti	link
14.	31.03.2022 11:00–13:00	A1	XGBoost, LightGBM	GBM, Notebook_GBM	Guidotti	link
15.	04.04.2022 11:00–13:00	C	Time Series Introduction, Distance Functions	TS_Intro_Distances, Notebook_TS_Sim, Notebook_TS_DTW_Impl, Notebook_TS_DTW_Constr_Impl	Guidotti	link
16.	07.04.2022 11:00–13:00	A1	Time Series Approximations, Clustering	TS_Approx_Clustering, Notebook_TS_ApproxClus	Guidotti	link
17.	11.04.2022 11:00–13:00	C	Time Series Motifs, Discord, Matrix Profile	TS_MatrixProfile, TS_MatrixProfile	Guidotti	link
18.	14.04.2022 11:00–13:00	A1	Time Series Classification	TS_Classification Notebook_TSC, Notebook_TSC_SoA	Guidotti	link
19.	21.04.2022 11:00–13:00	A1	Sequential Pattern Mining	SPM	Guidotti	link
20.	28.04.2022 11:00–13:00	A1	Sequential Pattern Mining	SPM, Notebook_SPM	Guidotti	link
21.	02.05.2022 11:00–13:00	C	Advanced Clustering Approaches	Advanced_Clustering, Notebook_AC	Guidotti	link
22.	05.05.2022 11:00–13:00	A1	Transactional Clustering	Transactional Clustering, Notebook_TC	Guidotti	link
23.	09.05.2022 11:00–13:00	C	Explainable Artificial Intelligence	Explainability, Notebook_XAI	Guidotti	link
24.	12.05.2022 11:00–13:00	A1	Explainable Artificial Intelligence	Explainability, Notebook_XAI	Guidotti	link

Exams

Exam DM1

The exam is composed of two parts:

An oral exam , that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises.

A project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification (guidelines will be provided for more details). The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM1 2021-2022] Project” in the subject.

Project 1

Assigned: 30/09/2021
MidTerm Deadline: 21/11/2021 (half project required, i.e., Data understanding & Preparation and at least 2 clustering algorithms)
Final Deadline: 14/01/2022 (complete project required)
Data: choose between Glasgow Norms, Seismic Bumps

Project 2

Assigned: After Project 1 Final Deadline
Data: Hr-Analytics
Deadline: one week before the oral exam

Exam DM part II (DMA)

Exam Rules

Rules for DM2 exam available here.

Exam Booking Periods

3rd Appello: 08/05/2022 00:00 - 05/06/2022 23:59
4th Appello: 29/05/2022 00:00 - 26/06/2022 23:59
5th Appello: 19/06/2022 00:00 - 17/07/2022 23:59
6th Appello: 07/07/2022 00:00 - 31/08/2022 23:59

Exam Booking Agenda

Agenda Link: here
3rd Appello: starts 07/06/2022
4th Appello: starts 28/06/2022
5th Appello: starts 19/07/2022
6th Appello: starts 05/09/2022
Important! if you book in the agenda in data in days between 07/06/2022 and 27/06/2022 you MUST be registered for the 3rd appello, if you book in the agenda in data in days between 28/06/2022 and 18/07/2022 you must be registered for the 4th appello, if you book in the agenda in data in days after 19/07/2022 you must be registered for the 5th appello.

For online exams the camera must remain open and you must be able to share your screen. For the online exams could be required the usage of the Miro platform (https://miro.com/app/dashboard/).

The exam is composed of two parts:

A project, that consists in employing the methods and algorithms presented during the classes for solving exercises on a given dataset. The project has to be realized by max 3 people. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages (suggested 25) of text including figures + 1 cover page (minimum font 11, minimum interline 1). The project must be delivered at least 7 days before the oral exam. The project must be delivered to riccardo [dot] guidotti [at] unipi [dot] it AND francesco [dot] spinnato [at] sns [dot] it with subject “[DM2 Project]”

An oral exam, that includes: (1) discussing topics presented during the classes, including the theory of the parts already covered by the written exam; (2) resolving simple exercises using the Miro platform; (3) discussing the project report with a group presentation;

Dataset: the data is about Human Activity Recognition
- Data can be downloaded here
- Submission Draft 1: 20/04/2022 23:59 Italian Time (we expect Modules 1 and 2)
- Submission Draft 2: 20/05/2022 23:59 Italian Time (we expect Modules 1, 2 and 3)
- Final Submission: one week before the oral exam.

Project Guidelines

Module 1 - Imbalanced Learning, Dimensionality Reduction, Anomaly Detection
1. Explore and prepare the dataset. You are allowed to take inspiration from existing notebooks you can find online and figure out your personal research perspective (from choosing a subset of variables to the class to predict…). You are welcome in creating new variables and performing all the pre-processing steps the dataset needs.
2. Define one or more (simple) classification tasks and solve them with Decision Tree and KNN.
3. Identify the top 1% outliers: adopt at least three different methods from different families (e.g., density-based, angle-based… ) and compare the results. Deal with the outliers by removing them from the dataset or by treating the anomalous variables as missing values and employing replacement techniques. In this second case, you should check that the outliers are not outliers anymore. Justify your choices in every step.
4. Analyze the value distribution of the class to predict with respect to point 2; if it is unbalanced leave it as it is, otherwise turns the dataset into an imbalanced version (e.g., 96% - 4%, for binary classification). Then solve the classification task using the Decision Tree or the KNN by adopting various techniques of imbalanced learning.
5. Exploit and tests different dimensionality reduction techniques for (i) visualization in two dimensions, (ii) improve classification performance, (iii) improve outlier detection.
6. Draw your conclusions about the techniques adopted in this analysis.

Module 2 - Advanced Classification Methods
1. Solve the classification task defined in Module 1 (or define new ones) with the other classification methods analyzed during the course: Naive Bayes Classifier, Logistic Regression, Support Vector Machines, Neural Networks, Ensemble Methods, Gradient Boosting Machines and evaluate each classifier with the techniques presented in DM1 (accuracy, precision, recall, F1-score, ROC curve). Perform hyper-parameter tuning phases and justify your choices.
2. Besides the numerical evaluation draw your conclusions about the various classifiers, e.g. for Neural Networks: what are the parameter sets or the convergence criteria which avoid overfitting? For Ensemble classifiers how the number of base models impact the classification performance? For any classifier which is the minimum amount of data required to guarantee an acceptable level of performance? Is this level the same for any classifier? What is revealing the feature importance of Random Forests?
3. Select two continuous attributes, define a simple linear univariate regression problem and try to solve it using different techniques reporting various evaluation measures. Plot the two-dimensional dataset. Then generalize to multiple linear regression and observe how the performance varies. Solve it using linear regressions, regularized linear regressions (such as Lasso and Ridge) but also machine learning approaches such as Gradient Boosting Machines.

Module 3 - Time Series Analysis
1. Prepare a dataset on which you can run time series clustering; motif/anomaly discovery and classification.
2. On the dataset created, compute classification with KNN based on Euclidean/Manhattan and DTW distances and compare the results.
3. To perform the clustering you can choose among different distance functions and clustering algorithms. Remember that you can reduce the dimensionality through time series approximation. Analyze the clusters and highlight similarities and differences.
4. Analyze the dataset for finding motifs and/or anomalies. Visualize and discuss them and their relationship with other features.
5. Solve the classification task on the time series dataset(s) and evaluate each result. In particular, you should use shapelet-based classifiers and structural-based classifiers. Analyze the shapelets retrieved and discuss if there are any similarities/differences with motifs and/or shapelets.

Module 4 - Sequential Patterns and Advanced Clustering
1. Sequential Pattern Mining: Convert the time series into a discrete format (e.g., by using SAX) and extract the most frequent sequential patterns (of at least length 3/4) using different values of support, then discuss the most interesting sequences.
2. Advanced Clustering: On a dataset already prepared for one of the previous tasks in Module 1 or Module 2, run at least one clustering algorithm presented in the advanced clustering lectures (e.g. X-Means, Bisecting K-Means, OPTICS). Discuss the results that you find analyzing the clusters and reporting external validation measures (e.g SSE, silhouette).

Module 5 - Explainability
1. Try to use one or more explanation methods (e.g., TREPAN, LIME, LORE, SHAP, Counterfactual Explainers, etc.) to illustrate the reasons for the classification in one of the steps of the previous tasks.

N.B. When “solving the classification task”, remember, (i) to test, when needed, different criteria for the parameter estimation of the algorithms, and (ii) to evaluate the classifiers (e.g., Accuracy, F1, Lift Chart) in order to compare the results obtained with an imbalanced technique against those obtained from using the “original” dataset.

Exam Dates

Exam Sessions

Session	Date	Time	Room	Notes
1.	11.01.2022	14:00 - 18:00	MS Teams	Please, use the system for registration: https://esami.unipi.it/
3.	07.06.2022			Please, use the system for registration: https://esami.unipi.it/
4.	28.06.2022			Please, use the system for registration: https://esami.unipi.it/
5.	19.07.2022			Please, use the system for registration: https://esami.unipi.it/
6.	05.09.2022			Please, use the system for registration: https://esami.unipi.it/

Past Exams

Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures.

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

DidaWiki

Indice