====== Data Mining A.A. 2013/14 ====== Instructors - Docenti: * **Dino Pedreschi, Fosca Giannotti, Mirco Nanni** * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa * [[http://www-kdd.isti.cnr.it]] * [[dino.pedreschi@di.unipi.it]] * [[fosca.giannotti@isti.cnr.it]] * [[mirco.nanni@isti.cnr.it]] Teaching assistant - Assistente: * **Anna Monreale** * KDD Laboratory, Università di Pisa and ISTI - CNR, Pisa * [[annam@di.unipi.it]] ====== News ====== * **[17/10/2014] Appello straordinario Anno Accademico 2013/2014: venerdì 7 novembre 2014 ore 9:00-11:00 aula C1** * [01/09/2014] Starting time of next exam session (Sept. 9th, 2014) has been moved to 3:30 P.M. * [13/07/2014] Results of DM II (written exam) available ([[dm:results_20140630|Link]]) * [26/06/2014] The rooms for the second session of exams (30/6-2/7) changed * [11/06/2014] Results of DM I (written exam) available ([[dm:results_20140609|Link]]) * [22/05/2014] Evaluation of Homework 6 (DM2) is here: {{:dm:SPM-voti.pdf|online}} * [24/04/2014] The deadline for the second exercise is postponded to May 5th, 2014. * [15/04/2014] L'indirizzo di posta datamining [dot] unipi [at] gmail [dot] com deve essere usato solo per la consegna degli esercizi! * [15/04/2014] Domani Mercoledì 16 Aprile 2014 dalle 14 alle 16:45 ricevimento per delucidazioni sulla relazione di classificazione (Studio 375 del Dipartimento di Informatica) * [07/04/2014] Le valutazioni dell'esercizio 5 (DM2) sono / Evaluation of the homework 5 (DM2) is {{:dm:2014-Val-ES5.pdf|online}} * [07/04/2014] The text for the second exercise, on sequential patterns, has been released. Deadline: 21/04/2014. * [11/03/2014] Detailed instructions for exams have been published. See: [[dm/start?&#esame_dm_parte_ii|Instructions for exam AA 2013-14]]. Notice: the proposed seminars can replace only the __second__ exercise, and not the first one. * [30/01/2014] The next exam session will be on Monday 10/02/2014 at 9.30 Room C. Note that, in that date students can do the writing exam or the oral exam. Moreover, Monday we will decide other dates for the oral exam; thus, we invite all students to come for deciding in which date doing the exam. * [20/01/2014] Le valutazioni del primo compito sono {{{{:dm:ris-verifica-2014-01-16.pdf|online}} * [07/01/2014] Le valutazioni del secondo esercizio sono / Evaluation of the second homework is {{:dm:2014-valutazioni-cla.pdf|online}} * [09/12/2013] Le valutazioni del primo esercizio sono / Evaluation of the first homework is {{:dm:2013-valutazioni-finali.pdf|online}} * [11/10/2013] Remember to register as a user of this wiki and subscribe to receive a message when this wiki is updated! ====== Learning goals -- Obiettivi del corso ====== ** ... a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the "sexiest" around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. ** //Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.// La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti: - i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati; - le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi; - alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici. - l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza ===== Reading about the "data scientist" job ===== * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1|link]] * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 [[http://www.ilsole24ore.com/art/tecnologie/2012-09-21/futuro-scritto-data-155044.shtml?uuid=AbOQCOhG|link]] * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{:dm:crossroadsxrds2012fall-dl.pdf|download}} * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: [[https://www.youtube.com/watch?v=mXLy3nkXQVM|YouTube video]] * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. [[http://www.fusioncharts.com/whitepapers/downloads/Towards-Effective-Decision-Making-Through-Data-Visualization-Six-World-Class-Enterprises-Show-The-Way.pdf|download]] ====== Hours - Orario e Aule ====== **Classes - Lezioni: DM 1** ^ Giorno ^ Orario ^ Aula ^ | Giovedì/Thursday | 14:00 - 16:00 | Aula B | | Venerdì/Friday | 14:00 - 16:00 | Aula A1 | **Classes - Lezioni: DM 2** ^ Giorno ^ Orario ^ Aula ^ | Monday | 9:00 - 11:00 | Aula N1 | | Wednesday | 9:00 - 11:00 | Aula L1 | **Office hours - Ricevimento:** * Prof. Pedreschi: Lunedì/Monday h 14:30 - 17:30, Dipartimento di Informatica * Giannotti/Nanni: appointment by email, c/o ISTI-CNR ====== Learning Material -- Materiale didattico ====== ===== Textbook -- Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * I capitoli 4, 6, 8 sono disponibili sul sito del publisher. -- Chapters 4,6 and 8 are also available at the publisher's Web site. ===== Slides of the classes -- Slides del corso ===== * Le slide utilizzate durante il corso verranno inserite nel calendario al termine di ogni lezione. In buona parte esse sono tratte da quelle fornite dagli autori del libro di testo: [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ===== Testi di esame ===== * Oltre ai testi e (dove disponibili) soluzioni degli appelli d'esame degli anni recenti, sono consultabili i seguenti esercizi proposti in anni precedenti. * {{tdm:verifica2006.pdf|Verifica 2006}}, {{tdm:verifica2005.pdf|Verifica 2005 (con soluzioni)}}, {{tdm:verifica2004.pdf|Verifica 2004}} * {{dm:verifica.05.06.2007.pdf|Verifica 5 giugno 2007}}, {{dm:verifica.26.06.2007.pdf|Verifica 26 giugno 2007}}, {{dm:verifica.24.07.2007_corretto.pdf|Verifica 24 luglio 2007}} (e {{dm:verifica.24.07.2007_soluzioni.pdf|Soluzioni}}) * {{:dm:verifica.2008.04.03.pdf|Verifica 3 aprile 2008}} (e {{:dm:soluzioni.2008.04.03.pdf|Soluzioni}}), {{:dm:dm-tdm.appello_2008_07_18_parte1.pdf|Verifica 18 luglio 2008 - parte 1}}, {{:dm:dm-tdm.appello_2008_07_18_parte2.pdf|Verifica 18 luglio 2008 - parte 2}} ===== Data mining software===== * **[[http://www.knime.org | KNIME ]] The Konstanz Information Miner. [[http://www.knime.org/download-desktop| Download page ]]** * **[[http://www.cs.waikato.ac.nz/ml/weka/ | WEKA ]] Data Mining Software in JAVA. University of Waikato, New Zealand [[http://www.cs.waikato.ac.nz/ml/weka/ | Download page ]]** ====== Class calendar - Calendario delle lezioni (2013-2014) ====== ==== First part of course, first semester (DMF - Data mining: foundations) ==== ^ ^ Day ^ Aula ^ Topic ^ Learning material ^ Instructor ^ |1.| 26.09.2013 14:00-16:00 | B | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 {{:dm:dm_intro-2011.pdf|}} | Pedreschi | |2.| 27.09.2013 14:00-16:00 | A1 | Intro: data mining & knowledge discovery process | Textbook, Chapt. 1 {{:dm:dm_intro-2011.pdf|}} | Pedreschi | |3.| 03.10.2013 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 {{:dm:chap2_data_new.pdf|}} | Pedreschi | |4.| 10.10.2013 14:00-16:00 | B | Data: types and basic measures | Textbook, Chapt. 2 {{:dm:chap2_data_new.pdf|}} | Pedreschi | |5.| 11.10.2013 14:00-16:00 | A1 | Exploratory data analysis and data understanding. | Textbook, Chapt. 3 {{:dm:chap3_data_exploration.pdf|}} | Pedreschi | |6.| 17.10.2013 14:00-16:00 | B | Frequent Pattern Mining. | Textbook, Chapt. 6 {{:dm:2-3tdm-restructured_assoc_2013.pdf|}} | Giannotti | |7.| 18.10.2013 14:00-16:00 | A1 | Frequent Pattern Mining. | Textbook, Chapt. 6 {{:dm:2-3tdm-restructured_assoc_2013.pdf|}} | Giannotti | |8.| 24.10.2013 14:00-16:00 | B | Association Rule Mining. | | Giannotti | |9.| 25.10.2013 14:00-16:00 | A1 |Association Rule Mining and Knime | Textbook, Chapt. 6 {{:dm:ar_example.pdf|Example on AR}} {{:dm:knime_dataunder_ar.pdf|Knime}}| Monreale | |10.| 31.10.2013 14:00-16:00 | B | Classification and predictive methods | Textbook, Chapt. 4 {{:dm:chap4_basic_classification.pdf|}} | Pedreschi | |11.| 14.11.2013 14:00-16:00 | B | Classification. Decision trees | Textbook, Chapt. 4 {{:dm:chap4_basic_classification.pdf|}} | Pedreschi | |12.| 15.11.2013 14:00-16:00 | A1 | Classification. Decision trees |Textbook, Chapt. 4 {{:dm:chap4_basic_classification.pdf|}} | Pedreschi | |13.| 21.11.2013 14:00-16:00 | B | Classification. Rule-based and bayesian methods |Textbook, Chapt. 4 {{:dm:chap4_basic_classification.pdf|}} | Pedreschi | |14.| 22.11.2013 14:00-16:00 | A1 | Classification. Validation and Weka Lab | | Pedreschi | |16.| 28.11.2013 14:00-16:00 | B | Classification. Validation and Weka Lab. Clustering: introduction. | Textbook, Chapt. 8 {{:dm:dm2014_clustering_intro.pdf|}} | Nanni | |15.| 29.11.2013 14:00-16:00 | A1 | Clustering analysis. Centroid-based methods | Textbook, Chapt. 8 {{:dm:dm2014_clustering_kmeans.pdf|}} | Nanni | |16.| 05.12.2013 14:00-16:00 | B | Clustering analysis. Hierarchical methods| Textbook, Chapt. 8 {{:dm:dm2014_clustering_hierarchical.pdf|}} | Nanni | |17.| 06.12.2013 14:00-16:00 | A1 | Clustering analysis. Density-based methods | Textbook, Chapt. 8 {{:dm:dm2014_clustering_dbscan.pdf|}} | Nanni | |18.| 12.12.2013 14:00-16:00 | B | Clustering analysis. Validation and Weka Lab | Textbook, Chapt. 8 {{:dm:dm2014_clustering_validation.pdf|}} | Nanni | |19.| 13.12.2013 14:00-16:00 | A1 | Wrap-up. Presentation of Second Semester syllabus| | Nanni | ==== Second part of course, second semester (DMA - Data mining: advanced topics and case studies) ==== ^ ^ Day ^ Aula ^ Topic ^ Learning material ^ Instructor ^ |1.| 17.02.2014 9:00-11:00 | N1 | Introduction + Advanced Classification Methods / 1 | Textbook, Chapt. 5 {{:dm:chap5_alternative_classification.pdf|}} | Pedreschi | |2.| 19.02.2014 9:00-11:00 | L1 | Advanced Classification Methods / 2 | | Pedreschi | |3.| 24.02.2014 9:00-11:00 | N1 | Advanced Classification Methods / 3 | | Pedreschi | |4.| 26.02.2014 9:00-11:00 | L1 | Case study- CRM1- Customer Segmentation - CRISP| {{:dm:1.dm2-intro-airmiles-stulong-crisp.ppt.pdf|}} | Giannotti | |5.| 3.03.2014 9:00-11:00 | N1 | Sequential patterns / 1 | {{:dm:2.dm2_association_analysis_in_short_sequentialpatterns.ppt.pdf|}} | Giannotti | |6.| 5.03.2014 9:00-11:00 | L1 | Case Study: CRM on retail selling / 1 - Churn analysis| {{:dm:2.dm3_churn-analysis.ppt.pdf|}} | Giannotti | |7.| 10.03.2014 9:00-11:00 | N1 | Sequential patterns / 2 | {{:dm:3.dm2_sequentialpatterns.ppt.pdf|}} | Giannotti | | | 12.03.2014 9:00-11:00 | L1 | Suspended| |8.| 17.03.2014 9:00-11:00 | N1 | Graph mining | {{:dm:graph_mining_2014_fixed.pdf|}} | Nanni | |9.| 19.03.2014 9:00-11:00 | L1 | Case Study: CRM on retail selling - Promotions/ 1 | {{:dm:dm2_crm_promotional-sales_2014.pdf|}} {{:dm:nanni-spinsanti.forecast_promotions.2010-a1-010.pdf|Paper on promotions}}| Giannotti | |10.| 24.03.2014 9:00-11:00 | N1 | Time series / 1 | {{:dm:time_series_from_keogh_tutorial.pdf|}} | Nanni | |11.| 26.03.2014 9:00-11:00 | L1 | Case Study: CRM on retail selling - Promotions / 2 | | Giannotti | |12.| 07.04.2014 9:00-11:00 | N1 | Time series / 2 | | Nanni | |13.| 09.04.2014 9:00-11:00 | L1 | Case Study: Geo-marketing | {{:dm:3.dm2012_st_events.pdf|Geo-churn}}, {{:dm:crm2014_pennacchioli_bigdata13.pdf|}} | Nanni | |14.| 14.04.2014 9:00-11:00 | N1 | Spatial/Spatiotemporal analysis / 1| {{:dm:7.dm2_mobilitydatamining_.pptx.pdf|}} {{:dm:chap06_mobility_data_mining-1.pdf|}}| Giannotti | |15.| 16.04.2014 9:00-11:00 | L1 | Spatial/Spatiotemporal analysis / 2 & Projects presentation| {{:dm:dm2_projects_2014.pdf|}} | Giannotti & Nanni | |16.| 28.04.2014 9:00-11:00 | N1 | Case study: Mobility / 1 | {{:dm:mobilitydatamining_case_studies_1.pdf|Mobility case studies 1}} | Giannotti | |17.| 30.04.2014 9:00-11:00 | L1 | Platform M_Atlas | | Nanni | |18.| 05.05.2014 9:00-11:00 | N1 | **Students' short seminars** | {{:dm:miningcustomerbehavior.pdf|Mining changes in customer behavior in retail marketing.}}, {{:dm:online-customerbehavior.pdf|An e-customer behavior model with online analytical mining for internet marketing planning.}} | Nanni | |19.| 07.05.2014 9:00-11:00 | L1 | Case study: Mobility / 2 | {{:dm:dm2_2014_gsm_data_mining.pdf|Mobility case studies 2}} | Nanni | |20.| 12.05.2014 9:00-11:00 | N1 | Ethical Issues in Data Analytics | {{:dm:9.dm2_privacyregulation_technology.pdf|Privacy: Regulations and and Privacy Aware Data Mining}} | Giannotti | |21.| 14.05.2014 9:00-11:00 | L1 | Ethical Issues / Fraude Detection Case Study | | Giannotti | |22.| 19.05.2014 9:00-11:00 | N1 | Projects discussion | | Giannotti/Nanni | ====== Modalità di esame ====== ===== Esame DM parte I ====== L'esame consiste in una prova scritta ed in una prova orale: * La **prova scritta** è composta essenzialmente di esercizi sui metodi e algoritmi visti a lezione. I testi degli appelli d'esame passati vengono regolarmente messi online e possono essere presi come riferimento generale. La prova scritta può essere sostituita dalle due verifiche intermedie: nel caso vengano entrambe superate con successo la media dei loro voti costituirà il voto con cui presentarsi all'orale -- a meno che non si sostenga nuovamente l'esame scritto, nel qual caso il voto più recente cancella quelli precedenti (in meglio o in peggio). Non è possibile recuperare una sola verifica intermedia durante gli appelli d'esame regolari. Per l'a.a. 2013-2014, le verifiche intermedie sono sostituite da una serie di esercizi che verranno proposti durante il corso. * La **prova orale** verte sugli aspetti più teorici del corso (definizioni, metodi, algoritmi, ecc.) trattati a lezione, oppure dalla discussione di bibliografia concordata con i docenti. ===== Esame DM parte II ====== [ Italian ] L'esame consta di tre parti: * Un **esame scritto**, con domande ed esercizi su classificazione (aspetti avanzati), pattern sequenziali e graph mining. L'esame scritto può essere sostituito da due piccole prove in itinere, che vanno consegnate entro i termini comunicati durante il corso: * **Esercitazione 1** su "Advanced Classification Methods" -- consegna: 17 marzo 2014. * **Esercitazione 2** su "Sequential patterns, Graph mining & Time series" -- consegna: 5 maggio 2014. Questa seconda esercitazione può essere sostituita da un seminario da tenere nei giorni di lezione appositamente indicati. Ogni seminario sarà tenuto congiuntamente da 2 studenti e consisterà in una presentazione di **15 minuti** che riassume uno degli articoli del seguente elenco: [[dm:Elenco_Articoli_2014|]]. Gli interessati sono pregati di contattare i docenti all'indirizzo [[mirco.nanni@isti.cnr.it]] esprimendo le proprie preferenze sull'articolo da presentare e sulla data di presentazione, le quali, nei limiti del possibile, saranno seguite dai docenti per l'assegnazione dei seminari. * Un **progetto**, assegnato tra quelli presentati a lezione -- vedi: {{:dm:dm2_projects_2014.pdf|}}. Gli interessati sono pregati di (1) scrivere a [[mirco.nanni@isti.cnr.it]] comunicando i nomi del proprio gruppo (max. 3); (2) svolgere il progetto utilizzando i dati del progetto assegnato e seguendo la traccia acclusa; e (3) inviare ai docenti una relazione che riassuma procedimento e risultati del progetto stesso, almeno 2 giorni prima di sostenere l'esame orale. * Un **orale**, che include: (1) discussione del progetto svolto tramite presentazione di gruppo (15min per gruppo); (2) discussione degli argomenti trattati a lezione. [ English ] The exam is composed of three parts: * A **written exam**, with exercises and questions about classification (advanced topics), sequential patterns and graph mining. As an alternative, this written exam can be replaced by two excercises that are provided during the course and that must be completed and returned within specified deadlines: * **Exercise 1** on "Advanced Classification Methods" -- deadline: March 17, 2014. * **Exercise 2** on "Sequential patterns, Graph mining & Time series" -- deadline: May 5th, 2014. This exercise can be replaced by a short seminar to be given during classes in specified time slots. Each seminar will be held jointly by 2 students, and will consist of a 15-minutes summary presentation of one of the papers listed here: [[dm:Elenco_Articoli_2014|]]. Interested students should contact the teachers at [[mirco.nanni@isti.cnr.it]] and provide their preferences about the paper to present as well as the date of presentation, which will be followed as much as possible in the assignment process. * A **project**, assigned among those proposed during the classes -- see: {{:dm:dm2_projects_2014.pdf|}}. Interested people should (1) write to [[mirco.nanni@isti.cnr.it]] to communicate the composition of groups (max. 3 people per group); (2) work out the project following the requiremens provided with the data relative to the project assigned to you; and (3) send to the teachers a short report that summarizes the analysis process and the (main) results. The deadline is 2 days before the oral exam. * Un **oral exam**, that includes: (1) discussing the project report with a group presentation (15 minutes for all the group); (2) discussing topics presented during the classes. ====== Esercizi 2013-2014 ====== ===== Esercizi DM parte I -- Exercises DM First Part ===== * ** Data Understanding: CarDrivers dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to [[datamining.unipi@gmail.com]]. Use "[DM] exercise 1" in the subject. ** Download the {{:dm:cardrivers.rar|Car Drivers dataset}} (in CSV format, zipped). The dataset contains a number of variables describing the driving habits of a population of car drivers, in terms of number, length and duration of travels, probability of travelling on highways, in cities, at night, entropy of travels over roads, places, or in time, radius of gyration (average distance from mean position or most frequent location L1), and so on. Explore the dataset with the analytical tool of KNIME or Weka (or whatever you like) and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. Description of the variables are [[dm:start:dataunderstanding|here]]. * **Market Basket Analysis: SuperMarket dataset. Assigned on: 31.10.2013. To be completed within: 25.11.2013. Send papers (3 pages max of text, figures excluded) by email to [[datamining.unipi@gmail.com]]. Use "[DM] exercise 1" in the subject. ** Download the {{:dm:transactions.zip|SuperMarket dataset}} (in CSV format, zipped). Given a database of customer transactions of a supermarket, find the set of frequent items co-purchased and analyse the most interesting association rules that is possible to derive from the frequent patterns. Provide a short document which illustrates the input dataset, the adopted frequent pattern algorithm and the association rule analysis discussing your findings related to the most interesting rules. The database is composed of two files:(1) transactions.csv containing the customer transactions where each row contains a SCONTRINO_ID (transaction code) and COD_MKT_ID (the code of the item purchased); (2) segments-description.csv containing the full description of each item. For each COD_MKT_ID you can find information about the CATEGORY, SECTOR, AREA, SEGMENT and so on. Perform the analysis considering the segment level. * **Classification: Census Dataset. Assigned on: 29.11.2013. To be completed within: 13.12.2013. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 3” in the subject. ** Download the Adult Dataset here:[[http://archive.ics.uci.edu/ml/datasets/Census+Income]], where you can also find the data description. Objective: finding decision trees to predict whether each individual has an income higher or lower than 50.000 dollars. The paper has to illustrate the input dataset, some analyses for the data understanding, the adopted classification methodology and the decision tree validation and interpretation. * ** Clustering analysis: WarLogs Dataset. Assigned on: 20.12.2013. To be completed within: 13.01.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 4” in the subject.** Download the Dataset here:The dataset is in CVS format: {{:dm:warlogs.csv.zip| warlogs.csv.zip}}. Description of the variables are [[dm:warlogs2013-14|here]]. **Problem** : the exercise requires to perform two distinct clustering analsyses on the dataset: 1) one aimed to group events based on their impact on the population and on the forces involved -- casualties, captured or wounded units, etc; 2)the other aimed to group events based on their location, in order to discover geographic areas where events are more dense. Optionally, the temporal dimension can be involved in the process (e.g. to split the dataset or directly as additional attribute in the clustering). Each cluster should be properly explored and characterize in comparison with the others. ===== Esercizi DM parte II - DM exercises Part 2 ===== * ** Advanced classification analysis: WarLogs Dataset. Assigned on: 03.03.2014. To be completed within: 17.03.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 5” in the subject.** Download the Dataset here: The dataset is in CVS format: {{:dm:warlogs.csv.zip| warlogs.csv.zip}}. Description of the variables are [[dm:warlogs2013-14|here]]. **Problem** : Consider again the WarLogs dataset and define your own classification problem, defining the target variable at your choice. Perform the needed preprocessing for the chosen classification problem, and solve it adopting at least 3 different methods from the categories studied (trees, rules, Bayesian, SVM, ensemble, kNN). Discuss the relative performance of the obtained classifier according to the various quality measures. * ** Sequential pattern analysis: WarLogs Dataset. Assigned on: 07.04.2014. To be completed within: 05.05.2014. Send papers (3 pages max of text, figures excluded) by email to datamining [dot] unipi [at] gmail [dot] com. Use ”[DM] exercise 6” in the subject.** Download the Dataset here in CVS format: {{:dm:warlogs.csv.zip| warlogs.csv.zip}}. Description of the variables are [[dm:warlogs2013-14|here]]. **Problem** : Build a dataset of sequences that describe, **for each day** and **for each geographical area**, the sequence of **events** happened there. The **geographical areas** to adopt can be the same indicated in the "region" attribute already in the dataset, or they can be obtained by partitioning the territory in some other way, for instance to try to have more balanced areas. The **events** to consider can be, for instance, represented by the "category" or "type" attributes in the dataset, or they can be computed considering other informations (kind of casualties, number of wounded or killed victims, etc.). Use this dataset to extract a set of frequent sequential patterns. **Tools for sequential patterns.** Among possible alternatives, we suggest do adopt one of the following: * **Weka**: use the GeneralizedSequentialPatterns associator. The input dataset should contain, for each line, a pair , and the lines should be temporally ordered (there is no explicit timestamp in the data). Here is an example: {{:dm:sequence_data.csv.zip|}}. * **Spam**: command-line tool, that can be downloaded {{:dm:spam_bin.zip|here}} (binaries for Windows and Linux, including sample input file). Notice that the input should contain only numeric (integer) values, therefore some coding is needed. Also, input sequences longer than 64 transactions are not allowed, therefore they should be truncated. ====== Appelli di esame ====== ===== Verifiche intermedie/Esercizi ===== ^ ^ Data ^ Orario ^ Luogo ^ Note ^ Voti ^ |I Esercizio e II Esercizio | | | | | | ===== Appelli regolari / Exam sessions ===== ^ Session ^ Date ^ Time ^ Room ^ Notes ^ Results ^ | 1. | Thursday 16 January 2014 | 9.30 | TBD | A1 | | | 2. | Monday 10 February 2014 | 9.30 | TBD | C | | | 3. | Thursday 20 February 2014 | 14.00 | TBD | Predreschi's office | | | 4. | Tuesday 25 February 2014 | 14.00 | TBD | Predreschi's office | | | 5. | Monday 9 June 2014 | 9.00 | N1 | If needed, exams will continue on 10/6 and 11/6 in room L1 | [[Results_20140609]] | | 6. | Monday 30 June 2014 | 9.00 | N | If needed, exams will continue on 1/7 and 2/7 in rooms P and E | [[Results_20140630]] | | 7. | Monday 21 July 2014 | 9.00 | L1 | If needed, exams will continue on 10/6 and 11/6 in room L1 | | 8. | Tuesday 9 September 2014 | 15.30 | C1 | | ^ Session ^ Date ^ Time ^ Room ^ Notes ^ Results ^ | 1. | **Monday 19 January 2015** | 9.00 | C | | | 2. | **Monday 16 February 2015** | 9.00 | C | | ===== Appelli straordinari / Extra sessions ===== ^ Date ^ Time ^ Room ^ Notes ^ Results ^ |7 November 2014 | 9:00-11:00 |C1 | | | ====== Edizioni anni precedenti ===== * [[dm.2012-13]] * [[dm.2011-12]] * [[dm.2010-11]] * [[dm.2009-10]] * [[dm.2008-09]] * [[dm.2007-08]] * [[dm.2006-07]] * [[PhDWorkshop2011]] * [[SNA.Ingegneria2011]] * [[SNA.IMT.2011]] * [[MAINS.SANTANNA.2011-12]] * [[MAINS.SANTANNA.DM4CRM.2012]]