Data Mining A.A. 2020/21

DM1 - Data Mining: Foundations (6 CFU)

Instructors:

Dino Pedreschi
- KDDLab, Università di Pisa
- http://www-kdd.isti.cnr.it
- dino [dot] pedreschi [at] unipi [dot] it

Mirco Nanni
- KDDLab, ISTI - CNR, Pisa
- http://www-kdd.isti.cnr.it
- mirco [dot] nanni [at] isti [dot] cnr [dot] it

Teaching Assistant

Salvatore Citraro
- KDDLab, Università di Pisa
- http://www-kdd.isti.cnr.it
- salvatore [dot] citraro [at] phd [dot] unipi [dot] it

DM2 - Data Mining: Advanced Topics and Applications (6 CFU)

Instructors:

Riccardo Guidotti
- KDDLab, Università di Pisa
- https://kdd.isti.cnr.it/people/guidotti-riccardo
- riccardo [dot] guidotti [at] di [dot] unipi [dot] it

News

[30.05.2021] Agenda link for DM2 is here.
[20.05.2021] The project must be delivered to riccardo [dot] guidotti [at] unipi [dot] it AND salvatore [dot] citraro [at] phd [dot] unipi [dot] it with subject “[DM2 Project] Draft 2” or “[DM2 Project] Final”
[20.05.2021] CAT4 answers are available here.
[13.05.2021] CAT4 is available here.
[05.05.2021] CAT3 answers are available here.
[28.05.2021] CAT3 is available here.
[14.04.2021] CAT2 answers are available here.
[08.04.2021] CAT2 is available here.
[06.04.2021] The project must be delivered to riccardo [dot] guidotti [at] unipi [dot] it AND salvatore [dot] citraro [at] phd [dot] unipi [dot] it with subject “[DM2 Project] Draft 1”
[08.03.2021] CAT1 answers are available here.
[01.03.2021] CAT1 is available here.
[15.02.2021] Groups should be registered here
[11.02.2021] The course will be held online on MS Teams.
[11.02.2020] The first lesson will be held on 15/02/2021.

Learning Goals

DM1
- Fundamental concepts of data knowledge and discovery.
- Data understanding
- Data preparation
- Clustering
- Classification
- Pattern Mining and Association Rules
- Clustering

DM2
- Outlier Detection
- Regression and Forecasting
- Advanced Classification
- Time Series Analysis
- Sequential Pattern Mining
- Advanced Clustering
- Transactional Clustering
- Ethical Issues

Hours and Rooms

DM1

Classes

Day of Week	Hour	Room
Monday	14:00 - 16:00	MS Teams
Wednesday	16:00 - 18:00	MS Teams

Office hours - Ricevimento:

Prof. Pedreschi: Monday 16:00 - 18:00, Online
Prof. Nanni: appointment by email, Online

DM 2

Classes

Day of Week	Hour	Room
Monday	14:00 - 16:00	MS Teams
Wednesday	16:00 - 18:00	MS Teams

Office Hours - Ricevimento:

Room 268 Dept. of Computer Science
Tuesday: 15-17, Room: MS Teams
Appointment by email

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.

Slides

The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Software

Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
Scikit-learn: python library with tools for data mining and data analysis Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
KNIME The Konstanz Information Miner. Download page
WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page
Didactic Data Mining DDM

Class Calendar (2020/2021)

First Semester (DM1 - Data Mining: Foundations)

	Day	Room	Topic	Learning material	Instructor
1.	16.09.2020 14:00-16:00	MS Teams	Introduction.	Course Overview Introduction DM	Pedreschi
2.	23.09.2020 16:00-18:00	MS Teams	Data Understanding	Slides DU Slides on Descriptive Statistics	Pedreschi
3.	28.09.2020 14:00-16:00	MS Teams	Data Understanding		Pedreschi
4.	30.09.2020 16:00-18:00	MS Teams	Data Preparation	Slides DP	Pedreschi
5.	05.10.2020 14:00-16:00	MS Teams	Lab: Introduction to Python and Knime	Python Introduction, Knime simple workflow Lecture 5 part 1, Lecture 5 part 2	Guidotti, Citraro
6.	07.10.2020 16:00-18:00	MS Teams	Lab: Data Understanding & Preparation	Dataset: Iris, Titanic, Knime: 01_data_understanding.zip Python: titanic_data_understanding2.ipynb.zip Lecture 6 part 1, Lecture 6 part 2	Guidotti, Citraro
7.	12.10.2020 14:00-16:00	MS Teams	Clustering: Intro & K-means	Slides clustering 1	Nanni
8.	14.10.2020 16:00-18:00	MS Teams	Clustering: Hierarchical methods	Slides clustering 2	Nanni
9.	19.10.2020 14:00-16:00	MS Teams	Clustering: Density-based methods and exercises	Slides clustering 3, Clustering exercises	Nanni
10.	21.10.2020 16:00-18:00	MS Teams	Clustering: Validation methods and exercises	Slides clustering 4	Nanni
11.	26.10.2020 14:00-16:00	MS Teams	Lab: Clustering	Knime , Python Iris Python Titanic	Citraro
12.	28.10.2020 16:00-18:00	MS Teams	Classification: Intro and Decision Trees	Slides classification	Nanni
	02.11.2020 14:00-16:00		No Lecture. Project Week.
	04.11.2020 16:00-18:00		No Lecture. Project Week.
13.	09.11.2020 14:00-16:00	MS Teams	Classification: Decision Trees/2		Nanni
14.	11.11.2020 16:00-18:00	MS Teams	Classification: Decision Trees/3		Nanni
15.	16.11.2020 14:00-16:00	MS Teams	Classification: Decision Trees/4	Sample exercise	Nanni
16.	18.11.2020 16:00-18:00	MS Teams	Classification: Decision Trees/5 + Exercises	Exercises 1, Excercises 2	Nanni
17.	23.11.2020 14:00-16:00	MS Teams	Classification: KNN	Slides, Exercise 1 (KNN only), Exercise 2	Nanni
18.	25.11.2020 16:00-18:00	MS Teams	Lab: Clustering	knime_classification python_classification python_classification2	Citraro
19.	02.12.2020 16:00-18:00	MS Teams	Pattern & Association Rule Mining - Apriori algorithm for frequent itemset mining	2-dm2-restructured_assoc-2020.pdf	Pedreschi
20.	07.12.2020 14:00-16:00	MS Teams	Pattern & Association Rule Mining - Rule mining and evaluation, Closed and maximal itemsets, Multi-dimensional, Quantitative and Multy-level association rules		Pedreschi
21.	14.12.2020 14:00-16:00		Lab Pattern Mining	knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, http://www.borgelt.net/pyfim.html ex-frequentpatterns-ar.pdf	Citraro

Second Semester (DM2 - Data Mining: Advanced Topics and Applications)

	Day	Room	Topic	Learning material	Instructor	Recordings
1.	15.02.2021 14:00-16:00	MS Teams	Introduction, CRIPS, KNN	Intro, CRISP, KNN	Guidotti	1stPart, 2ndPart
2.	17.02.2021 16:00-18:00	MS Teams	Performance Evaluation	Eval, occupancy_data, KNN_Eval_Notebook	Guidotti	Dataset, Lecture
3.	22.02.2021 14:00-16:00	MS Teams	Imbalanced Learning	ImbLearn, DimRed_notebook, ImbLearn_notebook	Guidotti	1stPart, 2ndPart
4.	23.02.2021 16:00-18:00	MS Teams	Anomaly Detection	MLE, Anomaly Detection, Anomaly_notebook	Guidotti	1st Part, 2nd Part
5.	01.03.2021 14:00-16:00	MS Teams	Anomaly Detection	Anomaly Detection, Anomaly_notebook	Guidotti	1st Part, 2nd Part
6.	03.02.2021 16:00-18:00	MS Teams	Anomaly Detection	Anomaly Detection, Anomaly_notebook, Extended Isolation Forest link	Guidotti	1st Part, 2nd Part
7.	08.03.2021 14:00-16:00	MS Teams	Naive Bayes Classifier	NBC, NBC_notebook, Ex1_Miro, Ex2_Miro	Guidotti	1st Part, 2nd Part
	10.02.2021 16:00-18:00		Lezione sul tema “Da Pisa al Fermilab di Chicago: Viaggio verso un rivoluzionario computer quantistico” della prof.ssa Anna Grassellino	Link	Guidotti
8.	15.03.2021 14:00-16:00	MS Teams	Linear and Logistic Regression, Rule-based Classifiers	Regression, RuleBased, Regression_Notebook	Guidotti	1stPart, 2ndPart
9.	17.03.2021 16:00-18:00	MS Teams	Rule-based Classifiers, Support Vector Machines	RuleBased, RuleBased_Notebook, SVM, SVM_Notebook	Guidotti	1st Part, 2nd Part
10.	22.03.2021 14:00-16:00	MS Teams	(Nonlinear) Support Vector Machines, Linear Perceptron	SVM, SVM_Notebook, Linear Perceptron	Guidotti	1st Part, 2nd Part
11.	24.03.2021 16:00-18:00	MS Teams	Neural Networks, Deep Neural Networks	Neural Network, NN_Notebook	Guidotti	1st Part, 2nd Part
-	25.03.2021 15:00-17:00	MS Teams	Neural Networks Forward and Backpropagation Example, Case Study Music	NN_Implementation, Case Study	Guidotti	1st Part, 2nd Part
12.	29.03.2021 14:00-16:00	MS Teams	Neural Networks (Training Tricks), Ensemble Classifiers	Ensemble Classifiers	Guidotti	1st Part, 2nd Part
13.	31.03.2021 16:00-18:00	MS Teams	Ensemble Classifiers	Ensemble Classifiers, Ensemble_Notebook	Guidotti	1st Part, 2nd Part
14.	12.04.2021 14:00-16:00	MS Teams	Time Series Similarity	Time Series Similarity	Guidotti	1st Part, 2nd Part
15.	14.04.2021 16:00-18:00	MS Teams	Time Series Similarity, Approximation and Clustering	Time Series Similarity, Time Series Approximation and Clustering	Guidotti	1st Part, 2nd Part
16.	19.04.2021 14:00-16:00	MS Teams	Time Series Motifs	TS_Similarty_Notebook, Time Series Motifs, TS Datasets, Keras Accuracy	Guidotti	1st Part, 2nd Part
17.	21.04.2021 16:00-18:00	MS Teams	Time Series Classification	Time Series Classification, TS_Plot, TS_Similarty_Notebook (updated)	Guidotti	1st Part, 2nd Part, Office Hours
18.	26.04.2021 14:00-16:00	MS Teams	Time Series Classification	Time Series Classification, TS_Shapelet_Motif_Notebook, TS_classification_Notebook, TS_from_MP3_Notebook	Guidotti	1st Part, 2nd Part, Tutorial MP3
19.	28.04.2021 16:00-18:00	MS Teams	Sequential Pattern Mining	Sequential Pattern Mining	Guidotti	1st Part, 2nd Part
20.	03.05.2021 14:00-16:00	MS Teams	Sequential Pattern Mining (Timing Constraints)	Sequential Pattern Mining, SPM_Notebook, TS_extraction_RMS, RMSE_TS Dataset	Guidotti	1st Part, 2nd Part, Tutorial RMSE
21.	05.05.2021 16:00-18:00	MS Teams	Advanced Clustering Methods	Advanced Clustering Methods	Guidotti	1st Part, 2nd Part
22.	10.05.2021 14:00-16:00	MS Teams	Transactional Clustering Methods	Transactional Clustering Methods, ACM_notebooks	Guidotti	Hint Clus TS 1st Part, 2nd Part
23.	12.05.2021 16:00-18:00	MS Teams	Explainable Artificial Intelligence	XAI, ACM_Notebook	Guidotti	ACM_Notebook 1st Part, 2nd Part
24.	17.05.2021 14:00-16:00	MS Teams	Explainable Artificial Intelligence	XAI, XAI_Notebook	Guidotti	1st Part, 2nd Part

Exams

Exam DM1

The exam is composed of two parts:

An oral exam , that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises.

A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification (see the guidelines for more details). The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM1 2020-2021] Project” in the subject.

Tasks of the project:

Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
Classification: Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details)
Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)

Project 1
1. Dataset: IBM-HR
2. Assigned: 16/09/2020
3. Midterm Deadline: 21/11/2020 (half project required, i.e., data understanding and at least two clustering algorithms)
4. Final Deadline: ~~07/01/2021~~ 14/01/2021(complete project required)
5. Data: here
6. Description: IBM-HR
7. (please download the data from here and not from the link with the description as we are using a different version of the data)

Project 2
1. Dataset: Bank Loan Status
2. Assigned: 15/01/2021
3. Deadline: 4 days before the oral exam
4. This dataset must be used for all tasks. For the classification task, you have to split the dataset into train and test set and the class to predict is the variable “Loan Status”.
5. This dataset is valid for all the exam sessions until September.
6. Download the dataset Bank Loan Status dataset (in CSV format, zipped)

Guidelines for the project are here.

Exam DM part II (DMA)

Exam Rules

Rules for DM2 exam available here.

Exam Booking Periods

3rd Appello: 04/05/2021 00:00 - 29/05/2021 23:59
4th Appello: 25/05/2021 00:00 - 19/06/2021 23:59
5th Appello: 15/06/2021 00:00 - 10/07/2021 23:59

Exam Booking Agenda

Agenda Link: here
3rd Appello: starts 03/06/2021
4th Appello: starts 24/06/2021
5th Appello: starts 15/07/2021
Important! if you book in the agenda in data in days between 03/06/2021 and 23/06/2021 you MUST be registered for the 3rd appello, if you book in the agenda in data in days between 24/06/2021 and 14/07/2021 you must be registered for the 4th appello, if you book in the agenda in data in days after 15/07/2021 you must be registered for the 5th appello.

The link to the agenda for booking a slot for the exam is displayed at the end of the registration. During the exam the camera must remain open and you must be able to share your screen. For the exam could be required the usage of the Miro platform (https://miro.com/app/dashboard/).

The exam is composed of two parts:

A project, that consists in employing the methods and algorithms presented during the classes for solving exercises on a given dataset. The project has to be realized by max 3 people. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages (suggested 25) of text including figures + 1 cover page (minimum font 11, minimum interline 1). The project must be delivered at least 7 days before the oral exam. The project must be delivered to riccardo [dot] guidotti [at] unipi [dot] it AND salvatore [dot] citraro [at] phd [dot] unipi [dot] it with subject “[DM2 Project]”

An oral exam, that includes: (1) discussing topics presented during the classes, including the theory of the parts already covered by the written exam; (2) resolving simple exercises using the Miro platform; (3) discussing the project report with a group presentation;

Dataset: the data is about Music Analysis and can be downloaded here: github (or here uci)
- Data can be downloaded here fma_metadata.zip
- Submission Draft 1: 19/04/2020 23:59 Italian Time (we expect Module 1 and Module 2)
- Submission Draft 2: 22/05/2020 23:59 Italian Time (we expect Module 3)
- Final Submission: one week before the oral exam.

Project Guidelines

Module 1 - Introduction, Imbalanced Learning and Anomaly Detection
1. Explore and prepare the dataset. You are allowed to take inspiration from the associated GitHub repository and figure out your personal research perspective (from choosing a subset of variables to the class to predict…). You are welcome in creating new variables and performing all the pre-processing steps the dataset needs.
2. Define one or more (simple) classification tasks and solve it with Decision Tree and KNN. You decide the target variable.
3. Identify the top 1% outliers: adopt at least three different methods from different families (e.g., density-based, angle-based… ) and compare the results. Deal with the outliers by removing them from the dataset or by treating the anomalous variables as missing values and employing replacement techniques. In this second case, you should check that the outliers are not outliers anymore. Justify your choices in every step.
4. Analyze the value distribution of the class to predict with respect to point 2; if it is unbalanced leave it as it is, otherwise turn the dataset into an imbalanced version (e.g., 96% - 4%, for binary classification). Then solve the classification task using the Decision Tree or the KNN by adopting various techniques of imbalanced learning.
5. Draw your conclusions about the techniques adopted in this analysis.

Module 2 - Advanced Classification Methods
1. Solve the classification task defined in Module 1 (or define new ones) with the other classification methods analyzed during the course: Naive Bayes Classifier, Logistic Regression, Rule-based Classifiers, Support Vector Machines, Neural Networks, Ensemble Methods and evaluate each classifier with the techniques presented in Module 1 (accuracy, precision, recall, F1-score, ROC curve). Perform hyper-parameter tuning phases and justify your choices.
2. Besides the numerical evaluation draw your conclusions about the various classifiers, e.g. for Neural Networks: what are the parameter sets or the convergence criteria which avoid overfitting? For Ensemble classifiers how the number of base models impacts the classification performance? For any classifier which is the minimum amount of data required to guarantee an acceptable level of performance? Is this level the same for any classifier? What is revealing the feature importance of Random Forests?
3. Select two continuous attributes, define a regression problem and try to solve it using different techniques reporting various evaluation measures. Plot the two-dimensional dataset. Then generalize to multiple linear regression and observe how the performance varies.

Module 3 - Time Series Analysis
1. Select the feature(s) you prefer and use it (them) as a time series. You can use the temporal information provided by the authors’ datasets, but you are also welcome in exploring the .mp3 files to build your own dataset of time series according to your purposes. You should prepare a dataset on which you can run time series clustering; motif/anomaly discovery and classification.
2. On the dataset created, compute clustering based on Euclidean/Manhattan and DTW distances and compare the results. To perform the clustering you can choose among different distance functions and clustering algorithms. Remember that you can reduce the dimensionality through approximation. Analyze the clusters and highlight similarities and differences.
3. Analyze the dataset for finding motifs and/or anomalies. Visualize and discuss them and their relationship with other features.
4. Solve the classification task on the time series dataset(s) and evaluate each result. In particular, you should use shapelet-based classifiers. Analyze the shapelets retrieved and discuss if there are any similarities/differences with motifs and/or shapelets.

Module 4 - Sequential Patterns and Advanced Clustering
1. Sequential Pattern Mining: Convert the time series into a discrete format (e.g., by using SAX) and extract the most frequent sequential patterns (of at least length 3/4) using different values of support, then discuss the most interesting sequences.
2. Advanced Clustering: On a dataset already prepared for one of the previous tasks in Module 1 or Module 2, run at least one clustering algorithm presented in the advanced clustering lectures (e.g. X-Means, Bisecting K-Means, OPTICS). Discuss the results that you find analyzing the clusters and reporting external validation measures (e.g SSE, silhouette).
3. Transactional Clustering: By using categorical features, or by turning a dataset with continuous variables into a dataset with categorical variables (e.g. by using binning), run at least one clustering algorithm presented in the transactional clustering lectures (e.g. K-Modes, ROCK). Discuss the results that you find analyzing the clusters and reporting external validation measures (e.g SSE, silhouette).

Module 5 - Explainability (optional)
1. Try to use one or more explanation methods (e.g., LIME, LORE, SHAP, etc.) to illustrate the reasons for the classification in one of the steps of the previous tasks.

N.B. When “solving the classification task”, remember, (i) to test, when needed, different criteria for the parameter estimation of the algorithms, and (ii) to evaluate the classifiers (e.g., Accuracy, F1, Lift Chart) in order to compare the results obtained with an imbalanced technique against those obtained from using the “original” dataset.

Exam Dates

Exam Sessions

Session	Date	Time	Room	Notes	Marks
1.	16.01.2019	14:00 - 18:00	MS Teams	Please, use the system for registration: https://esami.unipi.it/

Past Exams

Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures.

Reading About the "Data Scientist" Job

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video

Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

DidaWiki

Indice