Strumenti Utente

Strumenti Sito


dm:dm.2015-16

Data Mining A.A. 2015/16

DM 1: Foundations of Data Mining

Instructors - Docenti:

Teaching assistant - Assistente:

DM 2: Advanced topics on Data Mining and case studies

Instructors:

News

  • [21/06/2016] The results of the written exam of 20 June 2016 for DM1 and DM2 are out: Results
  • [03/06/2016] The results of the written exam for DM1 and DM2 are out:Results
  • [30/05/2016] The project for the DM1 project is out
  • [10/05/2016] The topics for the final projects are out: Projects DM II, 2015-16
  • [01/03/2016] The dates of the mid-term test and the regular summer exam sessions are out!
  • [09/02/2016] The results of the exam of 8 Feb 2016 are available online.
  • [08/02/2016] Oral Exam Sessions 11/02/2016 at 11.00, 22/02/2016 at 11.00 in Pedreschi's office. Send an email to pedre@di.unipi.it, monreale@di.unipi.it and guidotti@di.unipi.it to register for the oral exam if you have not already registered during the written exam of the 08/02/2016.
  • [05/02/2016] The time schedule for the oral exam of February will be decided during the written exam of 08/02/2016 in room A1 from 9.00 to 12.00. You can also decide to perform the oral exam during the written exam. Note that these information are already present in this web page in the Exam section.
  • [05/02/2016] San Francisco Crime list of projects received. If you submitted the project but your name does not appear in the list please submit your project (in a unique pdf file) again at guidotti@di.unipi.it. Buonaccorsi_Carta_Galassi
  • [01/02/2016] During the exam session in February it is possible to do the written exam for improving the evaluation of only one or both of the two mid-term tests of DMI. All students, who attended to DMI course in the 2015-2016 academic year, must subscribe for the written exam by sending within 4th February an email to: anna.monreale@unipi.it, pedre@di.unipi.it and guidotti@di.unipi.it. They can avoid the online subscription because otherwise must compile the evaluation of the DM2 course without attending it.
  • [22/01/2016] The oral exam session of 25/01/2016 will start at 14.00 instead of 15.00.
  • [19/01/2016] The results of the exam of 18 Jan 2016 are available online.
  • [18/01/2016] Oral Exam Sessions 20/01/2016 at 11.00, 25/01/2016 at 15.00 in Pedreschi's office. Send an email to pedre@di.unipi.it and guidotti@di.unipi.it to register for the oral exam if you have not already registered during the written exam of the 18/01/2016.
  • [12/01/2016] Titanic Disaster Classification Top 5.

1 Rizzi-Romano-Scigliuzzo 0,8134; 2 Criscolo-Quintini-Trafficante 0,80383; 3 Bazzali-Borghi-Giannella 0,79904; 3 Deidda-Policardo-Salamida 0,79904; 3 DelleMacchie-Iavarone-Rambelli 0,79904; 3 Kocan-Erdem 0,79904; 3 Stili-Strazzulla-Gaggioli 0,79904; 4 Calamia-Ortolani-Tardelli 0,79426; 5 Abedini-Baltakiene 0,78947; 5 Loconte-Spontella-Di Modugno 0,7894;

  • [08/01/2016] During the exam session in January and February it is possible to do the written exam for improving the evaluation of only one or both of the two mid-term tests of DMI. All students, who attended to DMI course in the 2015-2016 academic year, must subscribe for the written exam by sending within 14th January or 4th February an email to: anna.monreale@unipi.it, pedre@di.unipi.it and guidotti@di.unipi.it. They can avoid the online subscription because otherwise must compile the evaluation of the DM2 course without attending it.
  • [07/01/2016] Titanic Disaster list of projects received. If you submitted the project but your name does not appear in the list please submit your project (in a unique pdf file) again at guidotti@di.unipi.it.

Abedini_Baltakiene, Alzetta_Miaschi_Semplici, Bambini_Catania_Incorvaia, Bazzali_Borghi_Giannella, Boncoraglio_Delicto_Veshi, Calamia_Ortolani_Tardelli, Criscolo_Quintini_Trafficante, Deidda_Policardo_Salamida, DelleMacchie_Iavarone_Rambelli, Donati, Dossena_Grossi_LaPerna, Fuccio_Furlan_LaPusata, Gentile_Miliani_Rossi, Giacalone_Montisci_Salerno, Kocan_Erdem, LaCroce, Loconte_Spontella_DiModugno, Rizzi_Romano_Scigliuzzo, Russo, Stili_Strazzulla_Gaggioli, Xu

  • [07/01/2016] A new project is now available!!! Detailed infos are available in the Exams section.
  • [22/12/2015] The results of the first mid-term test are online. If someone does not find his or her name in the file, please send me an email. During the exam sessions of January and February it is possible to do only one of the two parts of the written exam.
  • [04/12/2015] The lesson planned for 7th Dec 2015 is suppressed.
  • [23/11/2015] Each students who would like to do the second mid-term test MUST subscribe for the exam at https://esami.unipi.it/
  • [19/11/2015] The results of the first mid-term test are online. If someone does not find his or her name in the file, please send me an email
  • [15/09/2015] The first lesson of Data Mining I will take place on Friday, Sept. 25th, in room A1.

Learning goals -- Obiettivi del corso

… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.

Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.

La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti:

  1. i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati;
  2. le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi;
  3. alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici.
  4. l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza

Reading about the "data scientist" job

  • Data, data everywhere. The Economist, Feb. 2010 download
  • Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
  • Welcome to the yotta world. The Economist, Sept. 2011 download
  • Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
  • Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
  • Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
  • Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
  • Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download

Hours - Orario e Aule

DM 1

Classes - Lezioni

Giorno Orario Aula
Lunedì/Monday 16:00 - 18:00 Aula C
Venerdì/Friday 14:00 - 16:00 Aula A1

Office hours - Ricevimento:

  • Prof. Pedreschi/Monreale: Lunedì/Monday h 14:00 - 16:00, Dipartimento di Informatica

DM 2

Classes - Lezioni

Day of week Hour Room
Monday 9:00 - 11:00 Room N1
Thursday 9:00 - 11:00 Room A1

Office hours - Ricevimento:

  • Nanni / Monreale: appointment by email, c/o ISTI-CNR

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

  • Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
  • Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7

Slides of the classes -- Slides del corso

  • Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: Slides per "Introduction to Data Mining".
  • The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors Slides per "Introduction to Data Mining".

Testi di esame

Data mining software

  • KNIME The Konstanz Information Miner. Download page
  • R: a language and environment for statistical computing
  • Scikit-learn: python library with tools for data mining and data analysis
  • WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page

Class calendar - Calendario delle lezioni (2015-2016)

First part of course, first semester (DMF - Data mining: foundations)

Day Aula Topic Learning material Instructor
1. 21.09.2015 16:00-18:00 C Canceled -
2. 25.09.2015 14:00-16:00 A1 Overview 1.dm-overview.pdf Pedreschi/Monreale
3. 28.09.2015 16:00-18:00 C Introduction 2.dm_ml_introduction.pdf Pedreschi
4. 02.10.2015 14:00-16:00 A1 Introduction 2.dm_ml_introduction.pdf Monreale
5. 05.10.2015 16:00-18:00 C Data Understanding3.dataunderstanding.pdf 3.data-understanting-appendix.pdf Monreale
6. 09.10.2015 14:00-16:00 A1 Data Preparation 4.data_preparation.pdf Monreale
7. 12.10.2015 16:00-18:00 C Clustering analysis. Centroid-based methods.dm2014_clustering_intro.pdf dm2014_clustering_kmeans.pdf Monreale
8. 16.10.2015 14:00-16:00 A1 Clustering analysis. Hierarchical methods. Tutorial Knime dm2014_clustering_hierarchical.pdf knime_slides_mains.pdf Monreale
9. 19.10.2015 16:00-18:00 C Clustering Analysis. Density Based Clustering and Validation dm2014_clustering_dbscan.pdf dm2014_clustering_validation.pdf Monreale
10. 21.10.2015 16:00-18:00 C Exercises on Data Understanding. exercises-dm1.pdf Monreale
11. 23.10.2015 14:00-16:00 A1 Exercises on Clustering. HC with Group Average exercises-clustering.pdf Monreale/Guidotti
12. 26.10.2015 16:00-18:00 C Knime Exercises datamanipulation.zip knime_clustering_iris.zip Pedreschi/Guidotti
13. 30.10.2015 14:00-16:00 A1 R and Python Exercises manipulation-clystering-r.zip manipulation-clustering-py.zip Pedreschi/Guidotti
02.11.2015-06.11.2015 First Mid-term test: 6th November 14:00-16:00 Room A
14. 09.11.2015 16:00-18:00 C Classification chap4_basic_classification.pdf Monreale
15. 13.11.2015 14:00-16:00 A1 Classification Monreale
16. 16.11.2015 16:00-18:00 C Classification Monreale
17. 20.11.2015 14:00-16:00 A1 Classification Monreale
18. 23.11.2015 16:00-18:00 C Exercises on Classification. Knime Exercises knime_classification_iris.zip knime_classification_adult.zip knime_classification_over_adult.zip Guidotti/Monreale
19. 27.11.2015 14:00-16:00 A1 Frequent Patterns & Association Rules 4-5tdm-restructured_assoc.pdf Monreale
20. 30.11.2015 16:00-18:00 C Canceled
21. 04.12.2015 14:00-16:00 A1 Canceled
22. 07.12.2015 16:00-18:00 C Canceled Pedreschi
23. 11.12.2015 14:00-16:00 A1 Exercises on Patterns. Knime Exercises knime_pattern.zip Guidotti / Pedreschi
24. 14.12.2015 16:00-18:00 C python-classification-pattern.zip r-classification-patterns.zip Guidotti / Pedreschi
16.12.2015-18.12.2015 Second Mid-term test

Second part of course, second semester (DMA - Data mining: advanced topics and case studies)

Day Aula Topic Learning material Instructor
1. 22.02.2016 09:00-11:00 N1 Introduction + Sequential Patterns / 1 sequential_patterns.pdf, textbook Ch. 7.4 Nanni & Pedreschi
2. 25.02.2015 09:00-11:00 A1 Sequential Patterns / 2
3. 29.02.2015 09:00-11:00 A1 Sequential Patterns / Exercises Link to SPMF, a tool for seq. patterns and sample dataset. Exercises: Text 1 and Text 2
4. 03.03.2015 09:00-11:00 A1 Advanced Classification Methods / 1 alternative_classification_1_dino_03.03.2016.pdf Pedreschi
5. 07.03.2015 09:00-11:00 A1 Advanced Classification Methods / 2 alternative_classification_2_dino_07.03.2016.pdf Pedreschi
6. 10.03.2015 09:00-11:00 A1 Advanced Classification Methods / Tools and Exercises exercises_classification.pdf sample_knime_workflows.zip
7. 14.03.2015 09:00-11:00 A1 Advanced Classification Methods / Exercises Exercises (also) on classification from 2014-15
8. 17.03.2015 09:00-11:00 A1 Time Series / 1 time_series_from_keogh_tutorial.pdf
9. 21.03.2015 09:00-11:00 A1 Time Series / 2
10. 24.03.2015 09:00-11:00 A1 Time Series / Exercises Some exercises from past exams: (Sequences and time series) (Classification)
25-29.03.2015 EASTER HOLIDAYS
04.04.2015 09:00-13:00 TBD Midterm tests
11. 07.04.2015 09:00-11:00 A1 Case study: CRM - Customer Segmentation + CRISP-DM Customer segmentation CRISP-DM
12. 11.04.2015 09:00-11:00 A1 Case study: CRM - Churn Analysis Intro_CRM Churn External_Churn
13. 14.04.2015 09:00-11:00 A1 Case study: CRM - Promotions and Sophistication Promotions Sophistication
14. 18.04.2015 09:00-11:00 A1 Mobility Data Analysis / 1 Preprocessing Patterns and models
15. 21.04.2015 09:00-11:00 A1 Mobility Data Analysis / 2 Individual/Collective models GSM_DM
16. 28.04.2015 09:00-11:00 A1 Case study: Mobility Data Analysis Case studies
17. 02.05.2015 09:00-11:00 A1 Complements: Ethical Issues / 1 slides Monreale
18. 05.05.2015 09:00-11:00 A1 Complements: Ethical Issues / 2 Monreale
19. 09.05.2015 09:00-11:00 A1 Projects presentation Projects
20. 12.05.2015 09:00-11:00 A1 Complements: Outlier Detection Slides from SDM2010 tutorial
21. 16.05.2015 09:00-11:00 A1 Projects discussion

Exams

Exam DM part I (DMF)

The exam is composed of three parts:

  • A written exam, with exercises and questions about methods and algorithms presented during the classes. It can be substitute with the first and second mid-term tests of November and December.
  • An oral exam, that includes: (1) discussing the project report with a group presentation; (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.
  • A project, consisting in exercises that require the use of data mining tools for analysys of data. Exercises include: data understanding, market basket analysis, clustering analysis and classification. The project has to be performed by max 3 people. It has to be performed by using Knime, R, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 10 pages of text, figures excluded. The project must be delivered at least 2 days before the oral exam. The paper must emailed to datamining [dot] unipi [at] gmail [dot] com. Please, use “[DM 2015-2016] Project” in the subject. Tasks of the project:
    1. Data Understanding: San Francisco Crime Data Set. Assigned on: 30.05.2016. Download the data set (train.csv) here: https://www.kaggle.com/c/sf-crime/data (in CSV format) where you can also find the data description. From 1934 to 1963, San Francisco was infamous for housing some of the world's most notorious criminals on the inescapable island of Alcatraz. Today, the city is known more for its tech scene than its criminal past. But, with rising wealth inequality, housing shortages, and a proliferation of expensive digital toys riding BART to work, there is no scarcity of crime in the city by the bay. This dataset contains incidents derived from SFPD Crime Incident Reporting system. The data ranges from 1/1/2003 to 5/13/2015. The training set and test set rotate every week, meaning week 1,3,5,7… belong to test set, week 2,4,6,8 belong to training set. Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations.
    2. Clustering analysis: San Francisco Crime Data Set. Assigned on: 30.05.2016. Perform the clustering analysis on the above data (train.csv), with any of the studied methods, using an appropriate subset of variables. Determine an adequate number of clusters, if any, and try to explain the properties of the discovered clusters (or else, argue why this dataset does not exhibit a clustering structure).
    3. Association Rule Mining: San Francisco Crime Data Set. Assigned on: 30.05.2016. Given the above data (train.csv), find the set of frequent items and analyse and discuss the most interesting association rules that is possible to derive from the frequent patterns.
    4. Classification: San Francisco Crime Data Set. Assigned on: 30.05.2016. Given time, location and additional infos, finding decision trees to predict the category of crime that occurred. Use the dataset “train.csv” for training the model and then use the file “test.csv” as test set checking your accuracy on the kaggle web site. The paper has to illustrate the adopted classification methodology and the decision tree validation and interpretation. In order to obtain a score for your model for the “test.csv” you have to prepare your model as usual using the file “train.csv” and fit your model using it. When you think your model is well trained run the prediction for the file “test.csv”. You have to produce a .csv file with the same formato of the file “sampleSubmission.csv” in the kaggle website. Then you have to upload this file on kaggle and you'll receive your score indicating the accuracy of your model. Report your score in the final paper.
    5. Hint 1! For those using Knime: since the train.csv file can be big to be managed with Knime, you can work on a sample of (test.csv) but select a permanent sample (e.g. a subset of the temporal window or a particular geographical area) and not a random one, and specify your selection in the final report.
    6. Hint 2! For those using Knime: exploit the OSM Map View node to visualize the San Francisco Map with the crimes.
    7. Hint 3! Classification task: try to build subsequent binary classifiers besides a unique classifier for multiple attributes (e.g. the first tree decide if the crime PROSTITUTION or not, if is not, the second one decide if the crime is KIDNAPPING or not, if is not the third tree decide if the crime is a ROBBERY or not…).

Guidelines for the project are here.

Exam DM part II (DMA)

The exam is composed of three parts:

  • A written exam, with exercises and questions about classification (advanced topics), sequential patterns and times series.
  • A project, assigned among those proposed during the classes, or proposed by the students themselves. In the latter case, they are invited to submit a short project proposal (max. 1 page) describing the data to use and the analysis objectives. The work done for the project should be summarized in a report, to be sent to the teachers at least 2 days before the oral exam. The proposed projects are the following:
    1. Market basket: Individual vs collective purchase behaviours
    2. Online services: Churn analysis on LastFM listenings
    3. Mobility: Taxi cabs & criminality in San Francisco
  • An oral exam, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes, including the theory of the parts already covered by the written exam.

Appelli di esame

Mid-term exams

Date Hour Place Notes Marks
First Mid-term 2015 Friday 06.11.2015 14.00 Room A Results
Second Mid-term 2015 Wednesday 16.12.2015 11.00 Room A1 Results
Date Hour Place Notes Marks
Mid-term 2016 Monday 04.04.2016 9.00 Room A1 Results

Appelli regolari / Exam sessions

Session Date Time Room Notes Results
1. Monday 18 January 2016 9.00 A1 In the same date we will define the dates for the oral exam.
2. Monday 08 February 20169.00A1 In the same date we will define the dates for the oral exam.
3. Monday, 30 May 20169.00C In the same date we will define the dates for the oral exam. DM1: Written exam results DM2: Written exam results
4. Monday, 20 June 20169.00C In the same date we will define the dates for the oral exam.
5. Friday, 08 July 20169.00C In the same date we will define the dates for the oral exam.
6. Monday, 05 Sept 20169.00C In the same date we will define the dates for the oral exam.

Appelli straordinari A.A. 2014/15 / Extra sessions A.A. 2014/15

Date Time Room Notes Results
6 November 2015 14:00-16:00 Room A
04 April 2016 9.00-13:00 Room A1

Edizioni anni precedenti

dm/dm.2015-16.txt · Ultima modifica: 21/09/2016 alle 22:12 (2 anni fa) da Anna Monreale