Data Mining A.A. 2013/14

Instructors - Docenti:

Dino Pedreschi, Fosca Giannotti, Mirco Nanni
- KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa
- http://www-kdd.isti.cnr.it
- dino [dot] pedreschi [at] di [dot] unipi [dot] it
- fosca [dot] giannotti [at] isti [dot] cnr [dot] it
- mirco [dot] nanni [at] isti [dot] cnr [dot] it

Teaching assistant - Assistente:

Anna Monreale
- KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa
- annam [at] di [dot] unipi [dot] it

News

[17/10/2014] Appello straordinario Anno Accademico 2013/2014: venerdì 7 novembre 2014 ore 9:00-11:00 aula C1
[01/09/2014] Starting time of next exam session (Sept. 9th, 2014) has been moved to 3:30 P.M.
[13/07/2014] Results of DM II (written exam) available (Link)
[26/06/2014] The rooms for the second session of exams (30/6-2/7) changed
[11/06/2014] Results of DM I (written exam) available (Link)
[22/05/2014] Evaluation of Homework 6 (DM2) is here: online
[24/04/2014] The deadline for the second exercise is postponded to May 5th, 2014.
[15/04/2014] L'indirizzo di posta datamining [dot] unipi [at] gmail [dot] com deve essere usato solo per la consegna degli esercizi!
[15/04/2014] Domani Mercoledì 16 Aprile 2014 dalle 14 alle 16:45 ricevimento per delucidazioni sulla relazione di classificazione (Studio 375 del Dipartimento di Informatica)
[07/04/2014] Le valutazioni dell'esercizio 5 (DM2) sono / Evaluation of the homework 5 (DM2) is online
[07/04/2014] The text for the second exercise, on sequential patterns, has been released. Deadline: 21/04/2014.
[11/03/2014] Detailed instructions for exams have been published. See: Instructions for exam AA 2013-14. Notice: the proposed seminars can replace only the second exercise, and not the first one.
[30/01/2014] The next exam session will be on Monday 10/02/2014 at 9.30 Room C. Note that, in that date students can do the writing exam or the oral exam. Moreover, Monday we will decide other dates for the oral exam; thus, we invite all students to come for deciding in which date doing the exam.
[20/01/2014] Le valutazioni del primo compito sono online
[07/01/2014] Le valutazioni del secondo esercizio sono / Evaluation of the second homework is online
[09/12/2013] Le valutazioni del primo esercizio sono / Evaluation of the first homework is online
[11/10/2013] Remember to register as a user of this wiki and subscribe to receive a message when this wiki is updated!

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video

Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

Classes - Lezioni: DM 1

Giorno	Orario	Aula
Giovedì/Thursday	14:00 - 16:00	Aula B
Venerdì/Friday	14:00 - 16:00	Aula A1

Classes - Lezioni: DM 2

Giorno	Orario	Aula
Monday	9:00 - 11:00	Aula N1
Wednesday	9:00 - 11:00	Aula L1

Office hours - Ricevimento:

Prof. Pedreschi: Lunedì/Monday h 14:30 - 17:30, Dipartimento di Informatica
Giannotti/Nanni: appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.

Slides of the classes -- Slides del corso

Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining".
The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Testi di esame

Oltre ai testi e (dove disponibili) soluzioni degli appelli d'esame degli anni recenti, sono consultabili i seguenti esercizi proposti in anni precedenti.

Data mining software

KNIME The Konstanz Information Miner. Download page

WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page

Class calendar - Calendario delle lezioni (2013-2014)

First part of course, first semester (DMF - Data mining: foundations)

	Day	Aula	Topic	Learning material	Instructor
1.	26.09.2013 14:00-16:00	B	Intro: data mining & knowledge discovery process	Textbook, Chapt. 1 dm_intro-2011.pdf	Pedreschi
2.	27.09.2013 14:00-16:00	A1	Intro: data mining & knowledge discovery process	Textbook, Chapt. 1 dm_intro-2011.pdf	Pedreschi
3.	03.10.2013 14:00-16:00	B	Data: types and basic measures	Textbook, Chapt. 2 chap2_data_new.pdf	Pedreschi
4.	10.10.2013 14:00-16:00	B	Data: types and basic measures	Textbook, Chapt. 2 chap2_data_new.pdf	Pedreschi
5.	11.10.2013 14:00-16:00	A1	Exploratory data analysis and data understanding.	Textbook, Chapt. 3 chap3_data_exploration.pdf	Pedreschi
6.	17.10.2013 14:00-16:00	B	Frequent Pattern Mining.	Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf	Giannotti
7.	18.10.2013 14:00-16:00	A1	Frequent Pattern Mining.	Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf	Giannotti
8.	24.10.2013 14:00-16:00	B	Association Rule Mining.		Giannotti
9.	25.10.2013 14:00-16:00	A1	Association Rule Mining and Knime	Textbook, Chapt. 6 Example on AR Knime	Monreale
10.	31.10.2013 14:00-16:00	B	Classification and predictive methods	Textbook, Chapt. 4 chap4_basic_classification.pdf	Pedreschi
11.	14.11.2013 14:00-16:00	B	Classification. Decision trees	Textbook, Chapt. 4 chap4_basic_classification.pdf	Pedreschi
12.	15.11.2013 14:00-16:00	A1	Classification. Decision trees	Textbook, Chapt. 4 chap4_basic_classification.pdf	Pedreschi
13.	21.11.2013 14:00-16:00	B	Classification. Rule-based and bayesian methods	Textbook, Chapt. 4 chap4_basic_classification.pdf	Pedreschi
14.	22.11.2013 14:00-16:00	A1	Classification. Validation and Weka Lab		Pedreschi
16.	28.11.2013 14:00-16:00	B	Classification. Validation and Weka Lab. Clustering: introduction.	Textbook, Chapt. 8 dm2014_clustering_intro.pdf	Nanni
15.	29.11.2013 14:00-16:00	A1	Clustering analysis. Centroid-based methods	Textbook, Chapt. 8 dm2014_clustering_kmeans.pdf	Nanni
16.	05.12.2013 14:00-16:00	B	Clustering analysis. Hierarchical methods	Textbook, Chapt. 8 dm2014_clustering_hierarchical.pdf	Nanni
17.	06.12.2013 14:00-16:00	A1	Clustering analysis. Density-based methods	Textbook, Chapt. 8 dm2014_clustering_dbscan.pdf	Nanni
18.	12.12.2013 14:00-16:00	B	Clustering analysis. Validation and Weka Lab	Textbook, Chapt. 8 dm2014_clustering_validation.pdf	Nanni
19.	13.12.2013 14:00-16:00	A1	Wrap-up. Presentation of Second Semester syllabus		Nanni

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

	Day	Aula	Topic	Learning material	Instructor
1.	17.02.2014 9:00-11:00	N1	Introduction + Advanced Classification Methods / 1	Textbook, Chapt. 5 chap5_alternative_classification.pdf	Pedreschi
2.	19.02.2014 9:00-11:00	L1	Advanced Classification Methods / 2		Pedreschi
3.	24.02.2014 9:00-11:00	N1	Advanced Classification Methods / 3		Pedreschi
4.	26.02.2014 9:00-11:00	L1	Case study- CRM1- Customer Segmentation - CRISP	1.dm2-intro-airmiles-stulong-crisp.ppt.pdf	Giannotti
5.	3.03.2014 9:00-11:00	N1	Sequential patterns / 1	2.dm2_association_analysis_in_short_sequentialpatterns.ppt.pdf	Giannotti
6.	5.03.2014 9:00-11:00	L1	Case Study: CRM on retail selling / 1 - Churn analysis	2.dm3_churn-analysis.ppt.pdf	Giannotti
7.	10.03.2014 9:00-11:00	N1	Sequential patterns / 2	3.dm2_sequentialpatterns.ppt.pdf	Giannotti
	12.03.2014 9:00-11:00	L1	Suspended
8.	17.03.2014 9:00-11:00	N1	Graph mining	graph_mining_2014_fixed.pdf	Nanni
9.	19.03.2014 9:00-11:00	L1	Case Study: CRM on retail selling - Promotions/ 1	dm2_crm_promotional-sales_2014.pdf Paper on promotions	Giannotti
10.	24.03.2014 9:00-11:00	N1	Time series / 1	time_series_from_keogh_tutorial.pdf	Nanni
11.	26.03.2014 9:00-11:00	L1	Case Study: CRM on retail selling - Promotions / 2		Giannotti
12.	07.04.2014 9:00-11:00	N1	Time series / 2		Nanni
13.	09.04.2014 9:00-11:00	L1	Case Study: Geo-marketing	Geo-churn, crm2014_pennacchioli_bigdata13.pdf	Nanni
14.	14.04.2014 9:00-11:00	N1	Spatial/Spatiotemporal analysis / 1	7.dm2_mobilitydatamining_.pptx.pdf chap06_mobility_data_mining-1.pdf	Giannotti
15.	16.04.2014 9:00-11:00	L1	Spatial/Spatiotemporal analysis / 2 & Projects presentation	dm2_projects_2014.pdf	Giannotti & Nanni
16.	28.04.2014 9:00-11:00	N1	Case study: Mobility / 1	Mobility case studies 1	Giannotti
17.	30.04.2014 9:00-11:00	L1	Platform M_Atlas		Nanni
18.	05.05.2014 9:00-11:00	N1	Students' short seminars	Mining changes in customer behavior in retail marketing., An e-customer behavior model with online analytical mining for internet marketing planning.	Nanni
19.	07.05.2014 9:00-11:00	L1	Case study: Mobility / 2	Mobility case studies 2	Nanni
20.	12.05.2014 9:00-11:00	N1	Ethical Issues in Data Analytics	Privacy: Regulations and and Privacy Aware Data Mining	Giannotti
21.	14.05.2014 9:00-11:00	L1	Ethical Issues / Fraude Detection Case Study		Giannotti
22.	19.05.2014 9:00-11:00	N1	Projects discussion		Giannotti/Nanni

Modalità di esame

Esame DM parte I

L'esame consiste in una prova scritta ed in una prova orale:

La prova scritta è composta essenzialmente di esercizi sui metodi e algoritmi visti a lezione. I testi degli appelli d'esame passati vengono regolarmente messi online e possono essere presi come riferimento generale. La prova scritta può essere sostituita dalle due verifiche intermedie: nel caso vengano entrambe superate con successo la media dei loro voti costituirà il voto con cui presentarsi all'orale – a meno che non si sostenga nuovamente l'esame scritto, nel qual caso il voto più recente cancella quelli precedenti (in meglio o in peggio). Non è possibile recuperare una sola verifica intermedia durante gli appelli d'esame regolari. Per l'a.a. 2013-2014, le verifiche intermedie sono sostituite da una serie di esercizi che verranno proposti durante il corso.
La prova orale verte sugli aspetti più teorici del corso (definizioni, metodi, algoritmi, ecc.) trattati a lezione, oppure dalla discussione di bibliografia concordata con i docenti.

Esame DM parte II

[ Italian ]

L'esame consta di tre parti:

Un esame scritto, con domande ed esercizi su classificazione (aspetti avanzati), pattern sequenziali e graph mining. L'esame scritto può essere sostituito da due piccole prove in itinere, che vanno consegnate entro i termini comunicati durante il corso:
- Esercitazione 1 su “Advanced Classification Methods” – consegna: 17 marzo 2014.
- Esercitazione 2 su “Sequential patterns, Graph mining & Time series” – consegna: 5 maggio 2014. Questa seconda esercitazione può essere sostituita da un seminario da tenere nei giorni di lezione appositamente indicati. Ogni seminario sarà tenuto congiuntamente da 2 studenti e consisterà in una presentazione di 15 minuti che riassume uno degli articoli del seguente elenco: Elenco_Articoli_2014. Gli interessati sono pregati di contattare i docenti all'indirizzo mirco [dot] nanni [at] isti [dot] cnr [dot] it esprimendo le proprie preferenze sull'articolo da presentare e sulla data di presentazione, le quali, nei limiti del possibile, saranno seguite dai docenti per l'assegnazione dei seminari.

Un progetto, assegnato tra quelli presentati a lezione – vedi: dm2_projects_2014.pdf. Gli interessati sono pregati di (1) scrivere a mirco [dot] nanni [at] isti [dot] cnr [dot] it comunicando i nomi del proprio gruppo (max. 3); (2) svolgere il progetto utilizzando i dati del progetto assegnato e seguendo la traccia acclusa; e (3) inviare ai docenti una relazione che riassuma procedimento e risultati del progetto stesso, almeno 2 giorni prima di sostenere l'esame orale.

Un orale, che include: (1) discussione del progetto svolto tramite presentazione di gruppo (15min per gruppo); (2) discussione degli argomenti trattati a lezione.

[ English ]

The exam is composed of three parts:

A written exam, with exercises and questions about classification (advanced topics), sequential patterns and graph mining. As an alternative, this written exam can be replaced by two excercises that are provided during the course and that must be completed and returned within specified deadlines:
- Exercise 1 on “Advanced Classification Methods” – deadline: March 17, 2014.
- Exercise 2 on “Sequential patterns, Graph mining & Time series” – deadline: May 5th, 2014. This exercise can be replaced by a short seminar to be given during classes in specified time slots. Each seminar will be held jointly by 2 students, and will consist of a 15-minutes summary presentation of one of the papers listed here: Elenco_Articoli_2014. Interested students should contact the teachers at mirco [dot] nanni [at] isti [dot] cnr [dot] it and provide their preferences about the paper to present as well as the date of presentation, which will be followed as much as possible in the assignment process.

A project, assigned among those proposed during the classes – see: dm2_projects_2014.pdf. Interested people should (1) write to mirco [dot] nanni [at] isti [dot] cnr [dot] it to communicate the composition of groups (max. 3 people per group); (2) work out the project following the requiremens provided with the data relative to the project assigned to you; and (3) send to the teachers a short report that summarizes the analysis process and the (main) results. The deadline is 2 days before the oral exam.

Un oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes.

Esercizi 2013-2014

Esercizi DM parte I -- Exercises DM First Part

Data Understanding: CarDrivers dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the Car Drivers dataset (in CSV format, zipped). The dataset contains a number of variables describing the driving habits of a population of car drivers, in terms of number, length and duration of travels, probability of travelling on highways, in cities, at night, entropy of travels over roads, places, or in time, radius of gyration (average distance from mean position or most frequent location L1), and so on. Explore the dataset with the analytical tool of KNIME or Weka (or whatever you like) and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. Description of the variables are here.

Market Basket Analysis: SuperMarket dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the SuperMarket dataset (in CSV format, zipped). Given a database of customer transactions of a supermarket, find the set of frequent items co-purchased and analyse the most interesting association rules that is possible to derive from the frequent patterns. Provide a short document which illustrates the input dataset, the adopted frequent pattern algorithm and the association rule analysis discussing your findings related to the most interesting rules. The database is composed of two files:(1) transactions.csv containing the customer transactions where each row contains a SCONTRINO_ID (transaction code) and COD_MKT_ID (the code of the item purchased); (2) segments-description.csv containing the full description of each item. For each COD_MKT_ID you can find information about the CATEGORY, SECTOR, AREA, SEGMENT and so on. Perform the analysis considering the segment level.

Classification: Census Dataset. Assigned on: 29.11.2013. To be completed within: 13.12.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 3” in the subject. Download the Adult Dataset here:http://archive.ics.uci.edu/ml/datasets/Census+Income, where you can also find the data description. Objective: finding decision trees to predict whether each individual has an income higher or lower than 50.000 dollars. The paper has to illustrate the input dataset, some analyses for the data understanding, the adopted classification methodology and the decision tree validation and interpretation.

Clustering analysis: WarLogs Dataset. Assigned on: 20.12.2013. To be completed within: 13.01.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 4” in the subject. Download the Dataset here:The dataset is in CVS format: warlogs.csv.zip. Description of the variables are here. Problem : the exercise requires to perform two distinct clustering analsyses on the dataset: 1) one aimed to group events based on their impact on the population and on the forces involved – casualties, captured or wounded units, etc; 2)the other aimed to group events based on their location, in order to discover geographic areas where events are more dense. Optionally, the temporal dimension can be involved in the process (e.g. to split the dataset or directly as additional attribute in the clustering). Each cluster should be properly explored and characterize in comparison with the others.

Esercizi DM parte II - DM exercises Part 2

Advanced classification analysis: WarLogs Dataset. Assigned on: 03.03.2014. To be completed within: 17.03.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 5” in the subject. Download the Dataset here: The dataset is in CVS format: warlogs.csv.zip. Description of the variables are here. Problem : Consider again the WarLogs dataset and define your own classification problem, defining the target variable at your choice. Perform the needed preprocessing for the chosen classification problem, and solve it adopting at least 3 different methods from the categories studied (trees, rules, Bayesian, SVM, ensemble, kNN). Discuss the relative performance of the obtained classifier according to the various quality measures.

Sequential pattern analysis: WarLogs Dataset. Assigned on: 07.04.2014. To be completed within: 05.05.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 6” in the subject. Download the Dataset here in CVS format: warlogs.csv.zip. Description of the variables are here. Problem : Build a dataset of sequences that describe, for each day and for each geographical area, the sequence of events happened there. The geographical areas to adopt can be the same indicated in the “region” attribute already in the dataset, or they can be obtained by partitioning the territory in some other way, for instance to try to have more balanced areas. The events to consider can be, for instance, represented by the “category” or “type” attributes in the dataset, or they can be computed considering other informations (kind of casualties, number of wounded or killed victims, etc.). Use this dataset to extract a set of frequent sequential patterns. Tools for sequential patterns. Among possible alternatives, we suggest do adopt one of the following:
- Weka: use the GeneralizedSequentialPatterns associator. The input dataset should contain, for each line, a pair <sequence ID><Event ID>, and the lines should be temporally ordered (there is no explicit timestamp in the data). Here is an example: sequence_data.csv.zip.
- Spam: command-line tool, that can be downloaded here (binaries for Windows and Linux, including sample input file). Notice that the input should contain only numeric (integer) values, therefore some coding is needed. Also, input sequences longer than 64 transactions are not allowed, therefore they should be truncated.

Appelli di esame

Verifiche intermedie/Esercizi

	Data	Orario	Luogo	Note	Voti
I Esercizio e II Esercizio

Appelli regolari / Exam sessions

Session	Date	Time	Room	Notes	Results
1.	Thursday 16 January 2014	9.30	TBD	A1
2.	Monday 10 February 2014	9.30	TBD	C
3.	Thursday 20 February 2014	14.00	TBD	Predreschi's office
4.	Tuesday 25 February 2014	14.00	TBD	Predreschi's office
5.	Monday 9 June 2014	9.00	N1	If needed, exams will continue on 10/6 and 11/6 in room L1	Data Mining I: Results of written exam, June 9th, 2014
6.	Monday 30 June 2014	9.00	N	If needed, exams will continue on 1/7 and 2/7 in rooms P and E	Data Mining II: Results of written exam, June 30th, 2014
7.	Monday 21 July 2014	9.00	L1	If needed, exams will continue on 10/6 and 11/6 in room L1
8.	Tuesday 9 September 2014	15.30	C1

Session	Date	Time	Room	Notes	Results
1.	Monday 19 January 2015	9.00	C
2.	Monday 16 February 2015	9.00	C

Appelli straordinari / Extra sessions

Date	Time	Room	Notes	Results
7 November 2014	9:00-11:00	C1

DidaWiki

Indice