Data Mining A.A. 2019/20

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

Dino Pedreschi
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
- http://www-kdd.isti.cnr.it
- dino [dot] pedreschi [at] unipi [dot] it

DM 2: Advanced Topics on Data Mining and Applications (6 CFU)

Instructors:

Riccardo Guidotti
- KDDLab, Università di Pisa, Pisa
- https://kdd.isti.cnr.it/people/guidotti-riccardo
- riccardo [dot] guidotti [at] di [dot] unipi [dot] it

DM: Data Mining (9 CFU)

Instructors:

Dino Pedreschi, Anna Monreale
- KDD Laboratory, Università di Pisa, Pisa
- http://www-kdd.isti.cnr.it
- dino [dot] pedreschi [at] unipi [dot] it
- anna [dot] monreale [at] unipi [dot] it

News

[04.07.2020] Third DM2 exams session from 17/07 to 29/07. Please register ( here) before 12/07 and select your slot at the agenda link that will be available from 12/07. We remind to submit the project one week before the exam. It is mandatory to submit the project before 15/07. Doodle will not be used for this session. Every slot can accept up to 4 students (in this case you have to register individually). Slots spans from 17/07 to 29/07 included.
[21.06.2020] In order to help us in correcting the projects and organizing oral exams, everyone has to submit the project with the occupancy detection dataset before midnight of the 15th of July 2020. Another dataset for will be published after this deadline and submission after the 15th of July must use the new dataset. Remains valid the rule that the project must be submitted at least ONE WEEK before the oral exam.
[12.06.2020] New Doodle is available for booking the DM2 exam here.
[22.05.2020] In the section of this page: “Exam DM part I (DMF)” you can find the new rules for the exams of DM(9CU) of computer science and DM1(6CFU) of Data Science & BI and Digital Humanities.
[14.04.2020] CAT4 for auto evaluation is available here (it will not be considered for final evaluation). Report your final mark here. It is recommended to do it before 18th May 2020. Solutions are available here.
[06.05.2020] Submission Draft 2 deadline 25/05/2020. We expect to find Task 2 and 3 completed, and if you started to do something of Task 4 and 5 is well accepted. We do not care about forms and shape what matters now is the content and proof that you continued making analysis on the data as required.
[01.05.2020] Keras Accuracy here.
[30.04.2020] DM2 Exam Rules here.
[14.04.2020] CAT3 for auto evaluation is available here (it will not be considered for final evaluation). Report your final mark here. It is recommended to do it before 9th Aprile 2020. Solutions are available here.
[08.04.2020] Submission Draft 1 deadline 16/04/2020. We expect to find Task 1 completed, Task 2 at a good stage (let say 60/70%), and if you started to do something of Task 3 is well accepted. We do not care about forms and shape what matters now is the content and proof that you started making analysis on the data as required.
[06.04.2020] Reading material available here.
[18.03.2020] CAT2 for auto evaluation is available here (it will not be considered for final evaluation). Report your final mark here. It is recommended to do it before Sunday 22nd Marc 2020. Solutions available here.
[05.03.2020] From Monday 9 we will have lectures online using Microsoft Teams. You can find here - ita, here - eng instructions to join the course. The code for joining the 420AA DATA MINING Team is rc6b0ko. The Microsoft Team will be used for replacing frontal lectures and office time. The material will be uploaded as usual on the DidaWiki web page.
[04.03.2020] Frontal lectures and office times are suspended.
[02.03.2020] CAT1 Results: CAT1
[24.02.2020] Project Dataset Change: Occupancy Detection
[17.01.2020] Declare Project Groups (max 3 people) by next Monday 24° February adding your information at link
[12.01.2020] Project evaluation and Proposed final Mark Results
[28.12.2019] Results of Midterm-Test December 2019: Results. Students that did not pass the midterm test can do the written exam during the winter sessions. When we will have the project mark we will compute the average between the written and project mark
[06.12.2019] Exercises on Clustering: ex._clustering.pdf
[04.11.2019] The lecture of Monday, November 4, terminates at 15:00 to allow for the participation to the Informatica 50 event “Ora che comanda lui, quando tutto è basato sul software” (in Italian), h 15:30 at Aula Magna Storica del Palazzo della Sapienza, UNIPI. Full information: sito web evento
[03.10.2019] Please, fill the spreadsheet with name of the group (Group1, Group2, …), the list of students composing the group.
[26.09.2019] Global Climate Strike: teachers of DM course tomorrow Friday September 27 will join the Global Climate strike, so tomorrow the lecture is suppressed.
[18.09.2019] Event: “Privacy: limite o opportunità? Gli esempi delle Nuove Tecnologie e dei Dati Sanitari” Information here.

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video

Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week	Hour	Room
Lunedì/Monday	14:00 - 16:00	Aula E1
Mercoledì/Wednesday	16:00 - 18:00	Aula A1
Venerdì/Friday	11:00 - 13:00	Aula C1

Office hours - Ricevimento:

Prof. Pedreschi: Lunedì/Monday h 14:00 - 17:00, Dipartimento di Informatica
Prof. Monreale: Lunedì/Monday h 09:00 - 11:00, Dipartimento di Informatica

DM 2

Classes - Lezioni

Day of Week	Hour	Room
Monday	09:00 - 11:00	C
Wednesday	16:00 - 18:00	C1

Office hours - Ricevimento:

Room 268 Dept. of Computer Science
Thursday: 15-17, Room: 286
Appointment by email

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.

Slides of the classes -- Slides del corso

The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Past Exams

* Exercises on Clustering: ex._clustering.pdf

* Some text of past exams on DM1 (6CFU):

2017-1-19.pdf, 2017-9-6.pdf, 2016-05-30-dm1-seconda.pdf

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

dm2_exam.2017.06.13_solutions.pdf, dm2_exam.2017.07.04_solutions.pdf, dm2_mid-term_exam.2017.06.06_solutions.pdf

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

dm2_exam.2015.04.13.results.pdf, dm2_exam.2016.04.4_sol.pdf, dm2_exam.2016.04.5_sol.pdf, dm2_exam.2016.06.20_sol.pdf, dm2_exam.2016.07.08_sol.pdf

Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program:

Data mining software

KNIME The Konstanz Information Miner. Download page
Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
Scikit-learn: python library with tools for data mining and data analysis Documentation page
Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page

Class calendar - Calendario delle lezioni (2019/2020)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

	Day	Topic	Learning material	Instructor
1.	16.09 14:00-16:00	Overview. Introduction to KDD	Course Overview Introduction DM	Pedreschi
	18.09 16:00-18:00	Lecture canceled (Event at Scuola S. Anna Information in News Section of this page)		Pedreschi
2.	20.09 11:00-13:00	Introduction to KDD: technologies, Application and Data		Pedreschi
3.	23.09 14:00-16:00	Data Understanding (from Bertold book!)	Slides DU Slides on Descriptive Statistics useful for clarifying some statistical notions of statistics. Unfortunately this material is only in Italian.	Monreale
4.	25.09 16:00-18:00	Data Preparation	Slides DP	Monreale
	27.09 11:00-13:00	Climate Strike
5.	30.09 14:00-16:00	Introduction to Python.	Python Introduction	Monreale
6.	02.10 16:00-18:00	Clustering: Introduction + Centroid-based clustering, K-means	Clustering: Intro and K-means	Pedreschi
7.	04.10 11:00-13:00	Lab: Data Understanding & Preparation in Knime	Knime: 01_data_understanding.zip Data: Titanic File	Monreale
8.	07.10 14:00-16:00	Lab: DU Python + Project presentation	Python: titanic_data_understanding2.ipynb.zip	Monreale
9.	09.10 16:00-18:00	Clustering: K-means + Hierarchical	5.basic_cluster_analysis-hierarchical.pdf	Monreale
10.	11.10 11:00-13:00	Suppressed for Internet festival		Pedreschi
11.	14.10 14:00-16:00	Clustering: DBSCAN & VALIDITY	6.basic_cluster_analysis-dbscan-validity.pdf	Pedreschi
12.	16.10 16:00-18:00	Exercises on Clustering	Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX	Monreale
13.	18.10 11:00-13:00	Lab: Clustering	clustering_knime clustering_python	Monreale
14.	21.10 14:00-16:00	Classification	7.chap3_basic_classification-2019.pdf A visual intro to machine learning	Pedreschi
15.	23.10 16:00-18:00	Classification		Pedreschi
16.	25.10 11:00-13:00	Classification		Pedreschi
17.	28.10 14:00-16:00	LAB: Classificazione	knime_classification python_classification	Monreale
18.	30.10 16:00-18:00	Exercises Classification + Discussion Clustering	ex-classification.pdf	Monreale
19.	04.11 14:00-15:00	Pattern Mining	Note: the lecture will terminate at 15:00 to allow for the participation of the Informatica50 event (see news) slides	Pedreschi
20.	06.11 16:00-18:00	Pattern Mining		Pedreschi
	08-14.11	Project work
21.	15.11 11:00-13:00	Exercises and Lab on Pattern Mining	knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, http://www.borgelt.net/pyfim.html ex-frequentpatterns-ar.pdf	Monreale
	18.11 14:00-16:00	Suppressed for weather conditions
	20.11 16:00-18:00	Suppressed
22.	22.11 11:00-13:00	Exercises Classification		Monreale
		Next Classes are dedicated to DM of 9 CFU
23.	25.11 14:00-16:00	Alternative methods for classification/1	K-Nearest Neighbors & Naive Bayes	Pedreschi
24.	27.11 16:00-18:00	Alternative methods for classification/2	Wisdom of the crowd & Ensemble methods: Bagging, Random Forest & Boosting Galton's "Vox Populi" 1907 Nature paper	Pedreschi
25.	29.11 11:00-13:00	Alternative methods for classification/3	Recap Ensemble methods & Hints to Rule-based classification	Pedreschi
26.	02.12 14:00-16:00	Alternative Methods for Pattern Mining + Ex on KNN and NB	fp-growth.pdf KNN & NB	Monreale
27.	04.12 16:00-18:00	Alternative Methods for Clustering	1-alternative-clustering-2019.pdf 2-transactionalclustering-2019.pdf	Monreale
28.	06.12 11:00-13:00	Sequential Pattern Mining	Sequential patterns	Pedreschi
29.	09.12 14:00-16:00	Exercises on sequential pattern mining & ROCK	exsequentialpatternmining.pdf ex-clustering-rock.pdf	Monreale
30.	11.12 16:00-18:00	Black Box Explanations	2019-dm_xai.pdf Material: LORE LIME Survey ABELE	Monreale
31.	13.12 11:00-13:00	Exercises on written exam - all students	9_cfu_ex.pdf ex_clustering_fpm_dt.pdf hierarchical_max_sim.pdf	Monreale
32.	16.12 13:30-16:00	Mid-term Test (Rooms A, E1, C1)		Monreale
30.	18.12 16:00-18:00	Privacy in DM. Project.	privacydt.pdf Overview on Privacy Privacy by design	Monreale

Second part of course, second semester (DM2 - Advanced Topics on Data Mining and Applications)

	Day	Room (Aula)	Topic	Learning material
1.	17.02.2020 09:00-11:00	C	Introduction, Instance-based and Bayesian Classifiers	Intro, Libraries, Instance-Based and Bayesian Classifiers
2.	19.02.2020 16:00-18:00	C1	Linear and Logistic Regression, Dimensionality Reduction, Exercises KNN and Naive Bayes	Regression, Dimensionality Reduction, Ex_KNN_NB_Lift, Appendix
3.	24.02.2020 09:00-11:00	C	Imbalanced Learning, Performance Evaluation and Rule-based Classifiers	Imbalanced Learning Rule-based Classifiers
4.	26.02.2020 16:00-18:00	C1	Exercises Lift, ROC, KNN and Naive Bayes. Lab KNN and Naive Bayes.	Ex_KNN_NB_Lift, Lab_KNN_NB, Data Preparation, Churn Dataset, Iris Dataset
5.	02.03.2020 09:00-11:00	C	Lab Regression, Dimensionality Reduction, Imbalanced Learning + CAT1	Regression, Dimensionality Reduction, Imbalanced Learning Airquality Dataset
6.	04.03.2020 16:00-18:00	C1	CRISP-DM, SVM, Intro NN	CRISP-DM, SVM, NN
7.	09.03.2020 09:00-11:00	online	Neural Network, Exercises NN	NN , Ex_NN_Ensemble
8.	11.03.2020 16:00-18:00	online	Neural Network, Exercises NN, Deep Neural Network, Intro Ensemble, Exercises Ensemble	NN , DNN Ex_NN_Ensemble
9.	16.03.2020 09:00-11:00	online	Ensemble Classifiers, Exercises Ensemble	Ensemble, Ex_NN_Ensemble
10.	18.03.2020 16:00-18:00	online	Lab SVM, Neural Network, Ensemble	Lab_SVM_NN_RF
11.	23.03.2020 09:00-11:00	online	Time Series Similarity, Ex DTW	Time Series Similarity, Ex_DTW
12.	25.03.2020 16:00-18:00	online	Time Series Motif/Shapelet, Ex Matrix Profile	Time Series Motif/Shapelet, Ex_MP
13.	30.03.2020 09:00-11:00	online	Time Series Stationariety and Forecasting	Time Series Forecasting
14.	01.04.2020 16:00-18:00	online	Lab Time Series	Lab_TS
15.	06.04.2020 09:00-11:00	online	Time Series Classification, Lab Time Series	Time Series Classification, Lab_TS, Data Partitioning
-	08.04.2020		Reading/Project Week
-	15.04.2020		Reading/Project Week
16.	20.04.2020 09:00-11:00	online	Sequential Pattern Mining	SPM
17.	22.04.2020 16:00-18:00	online	SPM Time Constraints, Exercises, Lab	Ex_SPM, Lab_SPM
18.	27.04.2020 09:00-11:00	online	Advanced Clustering, Ex, SPM, Lab EM, X-Means	Advanced Clustering , Lab_AC
19.	29.04.2020 16:00-18:00	online	Transactional Clustering, Ex TC, Lab K-Mode	Ex_SPM_TC
20.	04.05.2020 09:00-11:00	online	Anomaly Detection, Ex AD	Anomaly Detection , Ex_AD
21.	06.05.2020 16:00-18:00	online	Anomaly Detection, Ex AD, Lab AD	Anomaly Detection , Ex_AD, Lab_AD
22.	11.05.2020 09:00-11:00	online	Ethics: Privacy	Privacy
23.	13.05.2020 16:00-18:00	online	Ethics: Explainability	Explainability
24.	18.05.2020 09:00-11:00	online	Ethics: Local Explainability, Inspection, Transparent Methods, Lab	Explainability, Lab_XAI
-	20.05.2020		Reading/Project Week
-	25.05.2020		Reading/Project Week
-	27.05.2020		Reading/Project Week

Exams

Exam DM part I (DMF)

RULES FOR EXAMS for COMPUTER SCIENCE - 9CFU: EXAM RULES Summer Session - 9 CFU

RULES FOR EXAMS for DATA SCIENCE & BI and DIGITAL HUMANITIES - DM1(6CFU): EXAM RULES Summer Session - DM1(6CFU)

The exam is composed of two parts:

An oral exam , that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory of the practical parts. It is optional for students passing the written part by ONLY the mid-term test.

A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2019-2020] Project” in the subject.

Tasks of the project:

Data Understanding: Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
Clustering analysis: Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
Classification: Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details)
Association Rules: Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)
ADDITIONAL TASK for DM9 CFU (OPTIONAL): Students for computer science (DM9CFU) can decide to deliver an additional task for the project selected among the following for additional bonus of 3 points:
1. Classification: Compare results of classification by decision tree with KNN, Naive Bayesian, analysing also the runtime at training and test phase.
2. Clustering: Is it possible to apply EM clustering? Does the quality of the clustering result improve?

Project 1
1. Dataset: Carvana Data
2. Assigned: 07/10/2019
3. Deadline: ~~05/01/2020~~ 08/01/2020
4. Link: https://www.kaggle.com/t/712fc5e264e748afb0e0616f56f3c102

Project 2
1. Dataset: Bank Loan Status
2. Assigned: 09/01/2020
3. Deadline: 4 days before the oral exam
4. This dataset will be used for all tasks. For the classification task, you have to split the dataset into train and test set and the class to predict is the variable “Loan Status”.
5. This dataset will be valid for all the exam sessions until September.
6. Download the dataset Bank Loan Status dataset (in CSV format, zipped)

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substituted with ongoing tests held during the course.

A project, that consists in employing the methods and algorithms presented during the classes for solving exercises on a given dataset. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages (suggested 25) of text including figures + 1 cover page (minimum font 11, minimum interline 1). The project must be delivered at least 2 days before the oral exam.

An oral exam, that includes: (1) discussing topics presented during the classes, including the theory of the parts already covered by the written exam; (2) discussing the project report with a group presentation.

Dataset: the data is about Occupancy Detection and can be downloaded here: dataset. * Submission Draft 1: 16/04/2020 23:59 Italian Time * Submission Draft 2: 25/05/2020 23:59 Italian Time * Final Submission: one week before the oral exam.

Dataset 2: the data is about Air Quality and can be downloaded here: dataset. The dataset has not a target variable for classification. Thus, define a target variable, for instance “is weekend” and set “true” for weekend days, and “false” for the others.
Final Submission: one week before the oral exam or within 30/11/2020.

Project Task 1 - Basic Classifiers and Evaluation
1. Prepare the dataset in order to build several basic classifiers able to predict room occupancy from the available variables. You are welcome in creating new variables.
2. Solve the classification task with k-NN (testing values of k), Naive Bayes, Logistic Regression, Decision Tree using cross-validation and/or random/grid search for parameter estimation.
3. Evaluate each classifier using Accuracy, Precision, Recall, F1, ROC, AUC and Lift Chart.
4. Try to reduce the dimensionality of the dataset using the methods studied (or new ones). Test PCA and try to solve the classification task in two dimensions. Plot the dataset in the two new dimensions and observe the decision boundary and the one of the trained algorithms.
5. Analyze the value distribution of the class to predict and turn the dataset into an imbalanced version reaching a strong majority-minority distribution (e.g. 96%-4%). Then solve again the classification task adopting the various techniques studied (or new ones).
6. Select two continuous attributes, define a regression problem and try to solve it using different techniques reporting various evaluation measures. Plot the two-dimensional dataset. Then generalize to multiple linear regression and observe how the performance varies.
7. Draw your conclusions about the basic classifiers and techniques adopted in this analysis.

Project Task 2 - Advanced Classifiers and Evaluation
1. Using the dataset for classification prepared for Task 1 build several advanced classifiers able to predict room occupancy from the available variables. In particular, you are required to use SVM (linear and non-linear), NN (Single and Multilayer Perceptron), DNN (design at least two different architectures), Ensemble Classifier (RandomForest, AdaBoost and a Bagging technique in which you can select a base classifier of your choice with a justification).
2. Evaluate each classifier using Accuracy, Precision, Recall, F1, ROC, etc; Draw your conclusion about the classifiers.
3. Highlight in the report different aspects typical of each classifier. For instance for SVM: is a linear model the best way to shape the decision boundary? Or for NN: what are the parameter sets or the convergence criteria suggesting you are avoiding overfitting? How many iterations/base classifiers are needed to allow a good estimation using an ensemble method? Which is the feature importance for the Random Forest?
4. You are NOT required to experiment also in the imbalanced case but if you do it is not considered a mistake.

Project Task 3 - Time Series Analysis and Forecasting/Classification
1. Exploit the temporal information of the dataset preparing it for a univariate framework of analysis, i.e. select a feature and use it as your time series. You are welcome in using more than one reliable temporal split to have more time series of the same feature. You are welcome in creating more than a dataset using more than a feature and report the result on the feature you prefer or more than one. Analyze such datasets for finding motifs and/or anomalies and shaplets. Visualize and discuss them and their relationship with the class of the time series.
2. On the dataset(s) created, compute clustering based on Euclidean/Manhattan and DTW distances and compare the results. To perform the clustering you can choose among different similarity methods, i.e., shape-based, feature-based, approximation-based, compression-based, etc.. Finally, analyze the clusters and the clustering and highlight similarities and differences.
3. Apply forecasting methods on the dataset(s) created. Make sure to preprocess adequately the time series according to the method used (e.g., an exponential smoothing or an autoregression), indeed checking stationarity and reducing trends and seasonality or with the help of a statistically significant test;
4. Solve the classification task on the univariate dataset created using different approaches, i.e., traditional classification, shapelet-based, feature-based, etc.
5. Solve the classification task considering the whole dataset as a multivariate dataset. Develop the classification process you prefer (e.g. exploiting shapelets, traditional classifiers, CNN, or RNN) to maximize accuracy and F1-score.

Project Task 4 - Sequential Pattern Mining
1. Convert the time series into a discrete format (e.g., SAX) in order to prepare the data for the task.
2. Using different values of support, extract the most frequent sequential patterns (of at least length 3/4), then discuss the most interesting sequences.

Project Task 5 - Outlier Detection and Explainability
1. From the original dataset (i.e. not the time series built on Task 3 or sequences of Task 4, nor the preprocessed dataset used in Tasks 1 and 2), identify the top 1% outliers.
2. Adopt at least three different methods belonging to different families (i.e. statistical/depth-based, distance-based, density-based, angle-based, …) and compare the results.
3. (Optional) Try to use an explanation method to illustrate the reasons for the classification in one of the steps of the previous Tasks (if you want to try LORE please ask the code to riccardo.guidotti@unipi.it).

Appelli di esame

Mid-term exams

	Date	Hour	Place	Notes	Marks
DM1: First Mid-term 2018	16.12.2019	13:30-16:00	Room E1, C1, A	Please, use the system for registration: https://esami.unipi.it/

Appelli regolari / Exam sessions

Session	Date	Time	Room	Notes	Marks
1.	16.01.2019	14:00 - 18:00	Room E
2.	06.02.2019	14:00 - 18:00	Room E
3.	19.06.2019	09:00 - 13:00	Room A1	Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September.	Results
4.	10.07.2019	09:00 - 13:00	Room A1	Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September.	Results
5.	08.06.2020	09:00 - 18:00	Microsoft Teams	From 08/06 to 25/06. Please register ( here) and select your slot here. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 01/06.
6.	26.06.2020	09:00 - 18:00	Microsoft Teams	From 26/06 to 16/07. Please register ( here) and select your slot here. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 21/06.
7.	17.07.2020	09:00 - 18:00	Microsoft Teams	From 17/07 to 29/07. Please register ( here) and select your slot at the agenda link that will be available from 12/07 only for those registered for the exam. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 10/07. It is mandatory to submit the project before 15/07.

DidaWiki

Indice