Data Mining A.A. 2013/14
Instructors - Docenti:
Teaching assistant - Assistente:
News
[17/10/2014] Appello straordinario Anno Accademico 2013/2014: venerdì 7 novembre 2014 ore 9:00-11:00 aula C1
[01/09/2014] Starting time of next exam session (Sept. 9th, 2014) has been moved to 3:30 P.M.
[13/07/2014] Results of DM II (written exam) available (
Link)
[26/06/2014] The rooms for the second session of exams (30/6-2/7) changed
[11/06/2014] Results of DM I (written exam) available (
Link)
[22/05/2014] Evaluation of Homework 6 (DM2) is here:
online
[24/04/2014] The deadline for the second exercise is postponded to May 5th, 2014.
[15/04/2014] L'indirizzo di posta datamining [dot] unipi [at] gmail [dot] com deve essere usato solo per la consegna degli esercizi!
[15/04/2014] Domani Mercoledì 16 Aprile 2014 dalle 14 alle 16:45 ricevimento per delucidazioni sulla relazione di classificazione (Studio 375 del Dipartimento di Informatica)
[07/04/2014] Le valutazioni dell'esercizio 5 (DM2) sono / Evaluation of the homework 5 (DM2) is
online
[07/04/2014] The text for the second exercise, on sequential patterns, has been released. Deadline: 21/04/2014.
[11/03/2014] Detailed instructions for exams have been published. See:
Instructions for exam AA 2013-14. Notice: the proposed seminars can replace only the
second exercise, and not the first one.
[30/01/2014] The next exam session will be on Monday 10/02/2014 at 9.30 Room C. Note that, in that date students can do the writing exam or the oral exam. Moreover, Monday we will decide other dates for the oral exam; thus, we invite all students to come for deciding in which date doing the exam.
[20/01/2014] Le valutazioni del primo compito sono
online
[07/01/2014] Le valutazioni del secondo esercizio sono / Evaluation of the second homework is
online
[09/12/2013] Le valutazioni del primo esercizio sono / Evaluation of the first homework is
online
[11/10/2013] Remember to register as a user of this wiki and subscribe to receive a message when this wiki is updated!
Learning goals -- Obiettivi del corso
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:
i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza
Reading about the "data scientist" job
Data, data everywhere. The Economist, Feb. 2010
download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011
link
Welcome to the yotta world. The Economist, Sept. 2011
download
Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012
link
Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012
link
Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics
download
Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012:
YouTube video
Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com.
download
Hours - Orario e Aule
Classes - Lezioni: DM 1
Giorno | Orario | Aula |
Giovedì/Thursday | 14:00 - 16:00 | Aula B |
Venerdì/Friday | 14:00 - 16:00 | Aula A1 |
Classes - Lezioni: DM 2
Giorno | Orario | Aula |
Monday | 9:00 - 11:00 | Aula N1 |
Wednesday | 9:00 - 11:00 | Aula L1 |
Office hours - Ricevimento:
Prof. Pedreschi: Lunedì/Monday h 14:30 - 17:30, Dipartimento di Informatica
Giannotti/Nanni: appointment by email, c/o ISTI-CNR
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
-
I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.
Slides of the classes -- Slides del corso
Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo:
Slides per "Introduction to Data Mining".
-
Testi di esame
Data mining software
Class calendar - Calendario delle lezioni (2013-2014)
First part of course, first semester (DMF - Data mining: foundations)
| Day | Aula | Topic | Learning material | Instructor |
1. | 26.09.2013 14:00-16:00 | B | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 dm_intro-2011.pdf | Pedreschi |
2. | 27.09.2013 14:00-16:00 | A1 | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 dm_intro-2011.pdf | Pedreschi |
3. | 03.10.2013 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 chap2_data_new.pdf | Pedreschi |
4. | 10.10.2013 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 chap2_data_new.pdf | Pedreschi |
5. | 11.10.2013 14:00-16:00 | A1 | Exploratory data analysis and data understanding. | Textbook, Chapt. 3 chap3_data_exploration.pdf | Pedreschi |
6. | 17.10.2013 14:00-16:00 | B | Frequent Pattern Mining. | Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf | Giannotti |
7. | 18.10.2013 14:00-16:00 | A1 | Frequent Pattern Mining. | Textbook, Chapt. 6 2-3tdm-restructured_assoc_2013.pdf | Giannotti |
8. | 24.10.2013 14:00-16:00 | B | Association Rule Mining. | | Giannotti |
9. | 25.10.2013 14:00-16:00 | A1 | Association Rule Mining and Knime | Textbook, Chapt. 6 Example on AR Knime | Monreale |
10. | 31.10.2013 14:00-16:00 | B | Classification and predictive methods | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
11. | 14.11.2013 14:00-16:00 | B | Classification. Decision trees | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
12. | 15.11.2013 14:00-16:00 | A1 | Classification. Decision trees | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
13. | 21.11.2013 14:00-16:00 | B | Classification. Rule-based and bayesian methods | Textbook, Chapt. 4 chap4_basic_classification.pdf | Pedreschi |
14. | 22.11.2013 14:00-16:00 | A1 | Classification. Validation and Weka Lab | | Pedreschi |
16. | 28.11.2013 14:00-16:00 | B | Classification. Validation and Weka Lab. Clustering: introduction. | Textbook, Chapt. 8 dm2014_clustering_intro.pdf | Nanni |
15. | 29.11.2013 14:00-16:00 | A1 | Clustering analysis. Centroid-based methods | Textbook, Chapt. 8 dm2014_clustering_kmeans.pdf | Nanni |
16. | 05.12.2013 14:00-16:00 | B | Clustering analysis. Hierarchical methods | Textbook, Chapt. 8 dm2014_clustering_hierarchical.pdf | Nanni |
17. | 06.12.2013 14:00-16:00 | A1 | Clustering analysis. Density-based methods | Textbook, Chapt. 8 dm2014_clustering_dbscan.pdf | Nanni |
18. | 12.12.2013 14:00-16:00 | B | Clustering analysis. Validation and Weka Lab | Textbook, Chapt. 8 dm2014_clustering_validation.pdf | Nanni |
19. | 13.12.2013 14:00-16:00 | A1 | Wrap-up. Presentation of Second Semester syllabus | | Nanni |
Second part of course, second semester (DMA - Data mining: advanced topics and case studies)
Modalità di esame
Esame DM parte I
L'esame consiste in una prova scritta ed in una prova orale:
La prova scritta è composta essenzialmente di esercizi sui metodi e algoritmi visti a lezione. I testi degli appelli d'esame passati vengono regolarmente messi online e possono essere presi come riferimento generale. La prova scritta può essere sostituita dalle due verifiche intermedie: nel caso vengano entrambe superate con successo la media dei loro voti costituirà il voto con cui presentarsi all'orale – a meno che non si sostenga nuovamente l'esame scritto, nel qual caso il voto più recente cancella quelli precedenti (in meglio o in peggio). Non è possibile recuperare una sola verifica intermedia durante gli appelli d'esame regolari. Per l'a.a. 2013-2014, le verifiche intermedie sono sostituite da una serie di esercizi che verranno proposti durante il corso.
La prova orale verte sugli aspetti più teorici del corso (definizioni, metodi, algoritmi, ecc.) trattati a lezione, oppure dalla discussione di bibliografia concordata con i docenti.
Esame DM parte II
[ Italian ]
L'esame consta di tre parti:
Un esame scritto, con domande ed esercizi su classificazione (aspetti avanzati), pattern sequenziali e graph mining. L'esame scritto può essere sostituito da due piccole prove in itinere, che vanno consegnate entro i termini comunicati durante il corso:
Un
progetto, assegnato tra quelli presentati a lezione – vedi:
dm2_projects_2014.pdf. Gli interessati sono pregati di (1) scrivere a
mirco [dot] nanni [at] isti [dot] cnr [dot] it comunicando i nomi del proprio gruppo (max. 3); (2) svolgere il progetto utilizzando i dati del progetto assegnato e seguendo la traccia acclusa; e (3) inviare ai docenti una relazione che riassuma procedimento e risultati del progetto stesso, almeno 2 giorni prima di sostenere l'esame orale.
[ English ]
The exam is composed of three parts:
A written exam, with exercises and questions about classification (advanced topics), sequential patterns and graph mining. As an alternative, this written exam can be replaced by two excercises that are provided during the course and that must be completed and returned within specified deadlines:
A
project, assigned among those proposed during the classes – see:
dm2_projects_2014.pdf. Interested people should (1) write to
mirco [dot] nanni [at] isti [dot] cnr [dot] it to communicate the composition of groups (max. 3 people per group); (2) work out the project following the requiremens provided with the data relative to the project assigned to you; and (3) send to the teachers a short report that summarizes the analysis process and the (main) results. The deadline is 2 days before the oral exam.
Esercizi 2013-2014
Esercizi DM parte I -- Exercises DM First Part
Data Understanding: CarDrivers dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the
Car Drivers dataset (in CSV format, zipped). The dataset contains a number of variables describing the driving habits of a population of car drivers, in terms of number, length and duration of travels, probability of travelling on highways, in cities, at night, entropy of travels over roads, places, or in time, radius of gyration (average distance from mean position or most frequent location L1), and so on. Explore the dataset with the analytical tool of KNIME or Weka (or whatever you like) and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. Description of the variables are
here.
Market Basket Analysis: SuperMarket dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use “[DM] exercise 1” in the subject. Download the
SuperMarket dataset (in CSV format, zipped). Given a database of customer transactions of a supermarket, find the set of frequent items co-purchased and analyse the most interesting association rules that is possible to derive from the frequent patterns. Provide a short document which illustrates the input dataset, the adopted frequent pattern algorithm and the association rule analysis discussing your findings related to the most interesting rules. The database is composed of two files:(1) transactions.csv containing the customer transactions where each row contains a SCONTRINO_ID (transaction code) and COD_MKT_ID (the code of the item purchased); (2) segments-description.csv containing the full description of each item. For each COD_MKT_ID you can find information about the CATEGORY, SECTOR, AREA, SEGMENT and so on. Perform the analysis considering the segment level.
Classification: Census Dataset. Assigned on: 29.11.2013. To be completed within: 13.12.2013.
Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 3” in the subject. Download the Adult Dataset here:
http://archive.ics.uci.edu/ml/datasets/Census+Income, where you can also find the data description. Objective: finding decision trees to predict whether each individual has an income higher or lower than 50.000 dollars. The paper has to illustrate the input dataset, some analyses for the data understanding, the adopted classification methodology and the decision tree validation and interpretation.
Clustering analysis: WarLogs Dataset. Assigned on: 20.12.2013. To be completed within: 13.01.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 4” in the subject. Download the Dataset here:The dataset is in CVS format:
warlogs.csv.zip. Description of the variables are
here.
Problem : the exercise requires to perform two distinct clustering analsyses on the dataset: 1) one aimed to group events based on their impact on the population and on the forces involved – casualties, captured or wounded units, etc; 2)the other aimed to group events based on their location, in order to discover geographic areas where events are more dense. Optionally, the temporal dimension can be involved in the process (e.g. to split the dataset or directly as additional attribute in the clustering). Each cluster should be properly explored and characterize in comparison with the others.
Esercizi DM parte II - DM exercises Part 2
Advanced classification analysis: WarLogs Dataset. Assigned on: 03.03.2014. To be completed within: 17.03.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 5” in the subject. Download the Dataset here: The dataset is in CVS format:
warlogs.csv.zip. Description of the variables are
here.
Problem : Consider again the WarLogs dataset and define your own classification problem, defining the target variable at your choice. Perform the needed preprocessing for the chosen classification problem, and solve it adopting at least 3 different methods from the categories studied (trees, rules, Bayesian, SVM, ensemble, kNN). Discuss the relative performance of the obtained classifier according to the various quality measures.
Sequential pattern analysis: WarLogs Dataset. Assigned on: 07.04.2014. To be completed within: 05.05.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 6” in the subject. Download the Dataset here in CVS format:
warlogs.csv.zip. Description of the variables are
here.
Problem : Build a dataset of sequences that describe,
for each day and
for each geographical area, the sequence of
events happened there. The
geographical areas to adopt can be the same indicated in the “region” attribute already in the dataset, or they can be obtained by partitioning the territory in some other way, for instance to try to have more balanced areas. The
events to consider can be, for instance, represented by the “category” or “type” attributes in the dataset, or they can be computed considering other informations (kind of casualties, number of wounded or killed victims, etc.). Use this dataset to extract a set of frequent sequential patterns.
Tools for sequential patterns. Among possible alternatives, we suggest do adopt one of the following:
Weka: use the GeneralizedSequentialPatterns associator. The input dataset should contain, for each line, a pair <sequence ID><Event ID>, and the lines should be temporally ordered (there is no explicit timestamp in the data). Here is an example:
sequence_data.csv.zip.
Spam: command-line tool, that can be downloaded
here (binaries for Windows and Linux, including sample input file). Notice that the input should contain only numeric (integer) values, therefore some coding is needed. Also, input sequences longer than 64 transactions are not allowed, therefore they should be truncated.
Appelli di esame
Verifiche intermedie/Esercizi
| Data | Orario | Luogo | Note | Voti |
I Esercizio e II Esercizio | | | | | |
Appelli regolari / Exam sessions
Session | Date | Time | Room | Notes | Results |
1. | Thursday 16 January 2014 | 9.30 | TBD | A1 | |
2. | Monday 10 February 2014 | 9.30 | TBD | C | |
3. | Thursday 20 February 2014 | 14.00 | TBD | Predreschi's office | |
4. | Tuesday 25 February 2014 | 14.00 | TBD | Predreschi's office | |
5. | Monday 9 June 2014 | 9.00 | N1 | If needed, exams will continue on 10/6 and 11/6 in room L1 | Data Mining I: Results of written exam, June 9th, 2014 |
6. | Monday 30 June 2014 | 9.00 | N | If needed, exams will continue on 1/7 and 2/7 in rooms P and E | Data Mining II: Results of written exam, June 30th, 2014 |
7. | Monday 21 July 2014 | 9.00 | L1 | If needed, exams will continue on 10/6 and 11/6 in room L1 | |
8. | Tuesday 9 September 2014 | 15.30 | C1 | | |
Session | Date | Time | Room | Notes | Results |
1. | Monday 19 January 2015 | 9.00 | C | | |
2. | Monday 16 February 2015 | 9.00 | C | | |
Date | Time | Room | Notes | Results |
7 November 2014 | 9:00-11:00 | C1 | | |
Edizioni anni precedenti