Strumenti Utente

Strumenti Sito


dm:dm.2018-19

Data Mining A.A. 2018/19

DM 1: Foundations of Data Mining (6 CFU)

Instructors - Docenti:

Teaching assistant - Assistente:

DM 2: Advanced topics on Data Mining and case studies (6 CFU)

Instructors:

DM: Data Mining (9 CFU)

Instructors:

Teaching assistant - Assistente:

News

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM1 & DM

Classes - Lezioni

Day of Week Hour Room
Lunedì/Monday 14:00 - 16:00 Aula C1
Mercoledì/Wednesday 14:00 - 16:00 Aula C1
Venerdì/Friday 11:00 - 13:00 Aula C1

Office hours - Ricevimento:

  • Prof. Pedreschi: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica
  • Prof. Monreale: by appointment, Room 374/DO, Dept. of Computer Science.
  • Dr. Guidotti: class-appointment (see calendar)

DM 2

Classes - Lezioni

Day of week Hour Room
Thursday 14 - 16 A1
Friday 16 - 18 C1

Office hours - Ricevimento:

  • Nanni : appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
  • Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.

Slides of the classes -- Slides del corso

Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining"

Past Exams

* Some text of past exams on DM1 (6CFU):

* Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers DM1 (9CFU):

* Some exercises (partially with solutions) on sequential patterns and time series can be found in the following texts of exams from the last years:

Data mining software

Class calendar - Calendario delle lezioni (2018/2019)

First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining)

Day Aula Topic Learning material Instructor
1. 19.09 14:00-16:00 C1 Overview. Introduction. 1.2018-dm-overview.pdf Pedreschi
2. 20.09 16:00-18:00 C1 Introduction Pedreschi
21.09 11:00-13:00 C1 Lecture canceled Pedreschi
3. 24.09 14:00-16:00 C1 KDD Process & Applications. Data Understanding. DM + Applications DU Monreale
4. 26.09 14:00-16:00 C1 Data Understanding. Data Preparation Monreale
5. 28.09 11:00-13:00 C1 Introduction to Python, Knime intro_knime intro_python Monreale/Guidotti
6. 01.10 14:00-16:00 C1 Data Preparation Data Preparation Monreale
7. 03.10 14:00-16:00 C1 Clustering Introduction e Centroid-based clustering 4.basic_cluster_analysis-intro-kmeans.pdf Monreale
05.10 11:00-13:00 C1 Lecture canceled
8. 08.10 14:00-16:00 C1 Knime - Python: Data Understanding du_knime du_python Guidotti
9. 10.10 14:00-16:00 C1 Clustering: K-means & Hierarchical 5.basic_cluster_analysis-hierarchical.pdf Pedreschi
12.10 11:00-13:00 C1 Lecture canceled for IF
10. 15.10 14:00-16:00 C1 Clustering: DBSCAN 6.basic_cluster_analysis-dbscan-validity.pdf Pedreschi
11. 17.10 14:00-16:00 C1 Clustering: Validity Pedreschi
12. 19.10 11:00-13:00 C1 Discussion on Projects - DU Guidotti
13. 22.10 14:00-16:00 C1 Exercises for mid-term test Tool for Dm ex: Didactic Data Mining Ex. Clustering PDF Ex. Clustering PPTX Monreale
14. 24.10 14:00-16:00 C1 Knime - Python: Clustering clustering_knime clustering_python Guidotti
15. 26.10 11:00-13:00 C1 Exercises for mid-term test Ex. Clustering PPTX - complete Ex. Clustering PDF - complete Exercises DU ex-silhouette.pdf Monreale
16. 05.11 14:00-16:00 C1 Classification/1 7.chap3_basic_classification.ppt Monreale
17. 07.11 14:00-16:00 C1 Classification/2 Monreale
09.11 11:00-13:00 C1 CANCELED
18. 12.11 14:00-16:00 C1 LAB: Classification knime_classification python_classification Guidotti
19. 14.11 14:00-16:00 C1 Pattern Mining Explanation of classification/ML models Pattern mining Intro Apriori Algorithm for Pattern/AR Mining Pedreschi
20. 16.11 11:00-13:00 C1 Pattern Mining Pedreschi
21. 19.11 14:00-16:00 C1 Exercises for the mid-term ex-second-midterm.pdf Monreale
22. 21.11 14:00-16:00 C1 Lab Pattern Mining+ Discussion Clustering knime_pattern python_pattern https://anaconda.org/conda-forge/pyfim, https://pypi.org/project/fim/, http://www.borgelt.net/pyfim.htmlGuidotti/Pedreschi
The next lectures are dedicated to the DM of 9 credits
23. 23.11 11:00-13:00 C1 Alternative methods for Pattern Mining. Privacy in DM fp-growth.pdfMonreale
24. 26.11 14:00-16:00 C1 Alternative methods for Clustering. Privacy in DM 1-alternative-clustering.pdfMonreale
25. 28.11 14:00-16:00 C1 Privacy in DM. Transactional Clustering 2-transactionalclustering.pdf privacydt.pdf Papers on Clustering Monreale
26. 30.11 11:00-13:00 C1 Alternative methods for classification/1 K-Nearest Neighbors & Naive Bayes Pedreschi
27. 03.12 14:00-16:00 C1 Alternative methods for classification/2 Ensemble methods Wisdom of the crowd & Ensemble methods Galton's Vox Populi Pedreschi
28. 05.12 14:00-16:00 C1 Alternative methods for classification/3 Pedreschi
29. 07.12 11:00-13:00 C1 Exercises on clustering and classification CLOPE K-mode KNN & NBMonreale
30. 10.12 14:00-16:00 C1 Exercises on Second part - all students Monreale
31. 12.12 14:00-16:00 C1Final Discussion on Project - all students Pedreschi/Guidotti
32. 14.12 11:00-13:00 C1 Cancelled

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Room (Aula) Topic Learning material Instructor (default: Nanni)
1. 21.02.2019 14:00-16:00 A1 Introduction + Sequential patters/1 Introduction, Sequential patterns
2. 22.02.2019 16:00-18:00 C1 Sequential patterns/2
3. 01.03.2019 16:00-18:00 C1 Sequential patterns/3 Sample exercises (fixed)
4. 07.03.2019 14:00-16:00 A1 Sequential patterns/4 Sequential pattern tools: Link to SPMF + Sample datasets, Python2 GSP educational implementation(source), PrefixSpan-py (requires Python3)
5. 08.03.2019 16:00-18:00 C1 Time series/1 Time series
6. 14.03.2019 14:00-16:00 A1 Time series/2 Overview on DM for time series, DTW paper by Sakoe and Chiba, 1978
7. 15.03.2019 16:00-18:00 C1 Time series/3
8. 21.03.2019 14:00-16:00 A1 Time series/4 Preprocessing in Python DTW in Python
9. 22.03.2019 16:00-18:00 C1 Time series/5
10. 28.03.2019 14:00-16:00 A1 Exercises for mid-term exam Exercises from past exams
11. 29.03.2019 16:00-18:00 C1 Exercises for mid-term exam Exercises from past exams (with some solutions)
04.04.2019 16:00-18:00 A1 + E mid-term exam
11. 11.04.2019 14:00-16:00 A1 Classification: alternative methods/1 kNN and Bayes classifier
12. 12.04.2019 16:00-18:00 C1 Classification: alternative methods/2 NN and SVM, Exercises
02.05.2019 14:00-16:00 A1 Cancelled
13. 03.05.2019 16:00-18:00 C1 Classification: alternative methods/3
14. 09.05.2019 14:00-16:00 A1 Classification: alternative methods/4 Ex. on NNs and SVM, Ex. on KNN and Naive Bayes
15. 10.05.2019 16:00-18:00 C1 Classification: Model Evaluation Model performances
16. 16.05.2019 14:00-16:00 A1 Classification: Model Evaluation Unbalanced data, Classification weights
17. 17.05.2019 16:00-18:00 C1 Classification: alternative methods/5 Ensembles, Homeworks!
18. 23.05.2019 14:00-16:00 A1 Exercises + Outlier detection/1 Ex. on Lift chart, Ex. on Ensembles, Outlier detection
19. 24.05.2019 16:00-18:00 C1 Outlier detection/2 Ex. on outliers, Ex. from past exams
20. 31.05.2019 16:00-18:00 C1 Due to a strike, the lesson will not take place. For you convenience, here is some material you can use: Examples of classification and validation in Python, Examples of outlier detection in Python, CRISP-DM guidelines. Feel free to contact me if you need clarifications. Remark: the CRISP-DM model will be not part of the exam program.
06.06.2019 16:00-18:00 E (+A1) mid-term exam 2nd mid-term of last year and its solutions (careful: they were not double-checked).

Exams

Exam DM part I (DMF)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
  • An oral exam (optional) , that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam. It is optional for students passing the written part by ONLY mid-term tests.
  • A project consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2018-2019] Project 2” in the subject. Students who will decide to perform the project during the summer exam sessions will find the dataset of the project online after 31/05/2019. In this case the project must be delivered at least 2 days before the oral exam.

Tasks of the project:

  1. Data Understanding (Collective discussion on: 19/10/2018): Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details)
  2. Clustering analysis (Collective discussion on: 21/11/2018): Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details)
  3. Classification (Collective discussion on: 12/12/2018): Explore the dataset using classification trees and random forest. Use them to predict the target variable. (see Guidelines for details)
  4. Association Rules (Collective discussion on: 12/12/2018): Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details)

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of April and June.
  • An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
  • A project, that consists in exercises that require the use of data mining tools for analysis of data. Exercises include: sequential patterns, time series, classification (alternative methods and validation), outlier detection. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The project must be delivered at least 2 days before the oral exam.
    • Dataset: the data is a time series dataset on air quality, which can be downloaded here: Dataset.
    • Task 1: Time series: Consider only attribute “PT08.S1(CO)” and split the corresponding time series into daily series, deleting those with too many missing values (value = -200) and fixing the others in some way. Make also sure that all time series have 24 values. Compute clustering (with an algorithm of your choice) based on DTW and Euclidean distances and compare the results.
    • Task 2: Sequential patterns: discover contiguous sequential patterns of at least length 4. Before that, time series should be discretized in some way.
    • Task 3:Classification methods: define a target variable “WE” for the time series data set to “true” for weekend days, and “false” for the others. Test the K-NN classification method using DTW as distance measure, and at least another classification method using the 24 values as separate variables.
    • Task 4: Outlier detection: from the original dataset (i.e. the raw records with all attributes, not the time series built only on the “PT08.S1(CO)” attribute), identify the top 1% outliers. Adopt at least two different methods belonging to different families (i.e. model-based, distance-based, density-based, angle-based, …) to identify the 1% of input records with the highest likelihood of being outliers, and compare the results. Before doing the analysis, the records containing missing values should be deleted to avoid trivial results.

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
DM1: First Mid-term 2018 30.10.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/ results
DM1: Second Mid-term 2018 18.12.2018 11-13 Room C1, L1, N1 Please, use the system for registration: https://esami.unipi.it/
DM2: First Mid-term 2019 04.04.2019 16-18 Room A1, E Please, use the system for registration: https://esami.unipi.it/
Text + Solutions
Results
DM2: Second Mid-term 2019 06.06.2019 16-18 Room E
(+ A1 if needed)
Please, use the system for registration: https://esami.unipi.it/
Text
Results

Appelli regolari / Exam sessions

Session Date Time Room Notes Marks
1.16.01.2019 14:00 - 18:00 Room E
2.06.02.2019 14:00 - 18:00 Room E
3.19.06.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results
4.10.07.2019 09:00 - 13:00 Room A1 Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. Results

Appelli straordinari A.A. 2017/18 / Extra sessions A.A. 20167/18

Date Time Room Notes Results

Previous years

dm/dm.2018-19.txt · Ultima modifica: 04/11/2022 alle 12:13 (17 mesi fa) da Salvatore Ruggieri