====== Data Mining A.A. 2019/20 ====== ===== DM 1: Foundations of Data Mining (6 CFU) ===== Instructors - Docenti: * **Dino Pedreschi** * KDD Laboratory, Università di Pisa ed ISTI - CNR, Pisa * [[http://www-kdd.isti.cnr.it]] * [[dino.pedreschi@unipi.it]] ===== DM 2: Advanced Topics on Data Mining and Applications (6 CFU) ===== Instructors: * **Riccardo Guidotti** * KDDLab, Università di Pisa, Pisa * [[https://kdd.isti.cnr.it/people/guidotti-riccardo]] * [[riccardo.guidotti@di.unipi.it]] ===== DM: Data Mining (9 CFU) ===== Instructors: * **Dino Pedreschi, Anna Monreale** * KDD Laboratory, Università di Pisa, Pisa * [[http://www-kdd.isti.cnr.it]] * [[dino.pedreschi@unipi.it]] * [[anna.monreale@unipi.it]] ====== News ===== * **[04.07.2020] Third DM2 exams session from 17/07 to 29/07**. Please register ([[https://esami.unipi.it/| here]]) before 12/07 and select your slot at the agenda link that will be available from 12/07. We remind to submit the project one week before the exam. It is mandatory to submit the project before 15/07. Doodle will not be used for this session. Every slot can accept up to 4 students (in this case you have to register individually). Slots spans from 17/07 to 29/07 included. * **[21.06.2020] In order to help us in correcting the projects and organizing oral exams, everyone has to submit the project with the occupancy detection dataset before midnight of the 15th of July 2020. Another dataset for will be published after this deadline and submission after the 15th of July must use the new dataset. Remains valid the rule that the project must be submitted at least ONE WEEK before the oral exam.** * [12.06.2020] New Doodle is available for booking the DM2 exam [[https://doodle.com/poll/k4teyt6nwsbf4kbb|here]]. * [22.05.2020] In the section of this page: "Exam DM part I (DMF)" you can find the new rules for the exams of DM(9CU) of computer science and DM1(6CFU) of Data Science & BI and Digital Humanities. * [14.04.2020] CAT4 for auto evaluation is available {{ :dm:autocat4.pdf | here}} (it will not be considered for final evaluation). Report your final mark [[https://docs.google.com/spreadsheets/d/1_57y5ELInFsCFkaVrf0_rhMm3K1wvnIwKFgSZcXFukQ/edit?usp=sharing|here]]. It is recommended to do it before 18th May 2020. Solutions are available {{ :dm:autocat_4_answ.pdf | here}}. * [06.05.2020] Submission Draft 2 deadline 25/05/2020. We expect to find Task 2 and 3 completed, and if you started to do something of Task 4 and 5 is well accepted. We do not care about forms and shape what matters now is the content and proof that you continued making analysis on the data as required. * [01.05.2020] Keras Accuracy {{ :dm:keras_custom_accuracy_metrics.ipynb.zip | here}}. * [30.04.2020] DM2 Exam Rules {{ :dm:dm2_exam_rules.pdf | here}}. * [14.04.2020] CAT3 for auto evaluation is available {{ :dm:autocat3.pdf | here}} (it will not be considered for final evaluation). Report your final mark [[https://docs.google.com/spreadsheets/d/1_57y5ELInFsCFkaVrf0_rhMm3K1wvnIwKFgSZcXFukQ/edit?usp=sharing|here]]. It is recommended to do it before 9th Aprile 2020. Solutions are available {{ :dm:autocat_3_answ.pdf | here}}. * [08.04.2020] Submission Draft 1 deadline 16/04/2020. We expect to find Task 1 completed, Task 2 at a good stage (let say 60/70%), and if you started to do something of Task 3 is well accepted. We do not care about forms and shape what matters now is the content and proof that you started making analysis on the data as required. * [06.04.2020] Reading material available {{ :dm:material.zip | here}}. * [18.03.2020] CAT2 for auto evaluation is available {{ :dm:cat2_auto.pdf | here}} (it will not be considered for final evaluation). Report your final mark [[https://docs.google.com/spreadsheets/d/1_57y5ELInFsCFkaVrf0_rhMm3K1wvnIwKFgSZcXFukQ/edit?usp=sharing|here]]. It is recommended to do it before Sunday 22nd Marc 2020. Solutions available {{ :dm:cat2_auto_sol.pdf | here}}. * [05.03.2020] From Monday 9 we will have lectures online using Microsoft Teams. You can find [[https://www.unipi.it/index.php/docenti2/item/download/20208_47b650be35bfa446df6426066966ec76|here - ita]], [[https://www.unipi.it/index.php/docenti2/item/download/20213_d3732df8d98c2f4c45f99726c751e079|here - eng]] instructions to join the course. The code for joining the 420AA DATA MINING Team is rc6b0ko. The Microsoft Team will be used for replacing frontal lectures and office time. The material will be uploaded as usual on the DidaWiki web page. * [04.03.2020] Frontal lectures and office times are suspended. * [02.03.2020] CAT1 Results: {{ :dm:cat1_evaluation.pdf | CAT1}} * [24.02.2020] Project Dataset Change: [[http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+|Occupancy Detection]] * [17.01.2020] Declare Project Groups (max 3 people) by next Monday 24° February adding your information at [[https://docs.google.com/spreadsheets/d/1_57y5ELInFsCFkaVrf0_rhMm3K1wvnIwKFgSZcXFukQ/edit?usp=sharing|link]] * [12.01.2020] Project evaluation and Proposed final Mark {{ :dm:2019-projecteva-proposal.pdf | Results}} * [28.12.2019] Results of Midterm-Test December 2019: {{ :dm:results-dm-dec-2019.pdf | Results}}. Students that did not pass the midterm test can do the written exam during the winter sessions. When we will have the project mark we will compute the average between the written and project mark * [06.12.2019] Exercises on Clustering: {{ :dm:ex._clustering.pdf |}} * [04.11.2019] The lecture of Monday, November 4, terminates at 15:00 to allow for the participation to the Informatica 50 event "Ora che comanda lui, quando tutto è basato sul software" (in Italian), h 15:30 at Aula Magna Storica del Palazzo della Sapienza, UNIPI. Full information: [[https://www.unipi.it/index.php/informatica50-eventi/event/4726-ora-che-comanda-lui-quando-tutto-e-basato-sul-software|sito web evento]] * [03.10.2019] Please, fill the [[https://docs.google.com/spreadsheets/d/1oBz19UhXHcXox7QPfJ1g1jWTsNRW27cK7tc6MZrxvZE/edit?usp=sharing| spreadsheet]] with name of the group (Group1, Group2, ...), the list of students composing the group. * [26.09.2019] Global Climate Strike: teachers of DM course tomorrow Friday September 27 will join the Global Climate strike, so tomorrow the lecture is suppressed. * [18.09.2019] Event: "Privacy: limite o opportunità? Gli esempi delle Nuove Tecnologie e dei Dati Sanitari" {{ :dm:locandina_seminario_18_settembre_def_2.pdf | Information here}}. ====== Learning goals -- Obiettivi del corso ====== ** ... a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the "sexiest" around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. ** //Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.// La grande disponibilità di dati provenienti da database relazionali, dal web o da altre sorgenti motiva lo studio di tecniche di analisi dei dati che permettano una migliore comprensione ed un più facile utilizzo dei risultati nei processi decisionali. L'obiettivo del corso è quello di fornire un'introduzione ai concetti di base del processo di estrazione di conoscenza, alle principali tecniche di data mining ed ai relativi algoritmi. Particolare enfasi è dedicata agli aspetti metodologici presentati mediante alcune classi di applicazioni paradigmatiche quali il Basket Market Analysis, la segmentazione di mercato, il rilevamento di frodi. Infine il corso introduce gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza. Il corso consiste delle seguenti parti: - i concetti di base del processo di estrazione della conoscenza: studio e preparazione dei dati, forme dei dati, misure e similarità dei dati; - le principali tecniche di datamining (regole associative, classificazione e clustering). Di queste tecniche si studieranno gli aspetti formali e implementativi; - alcuni casi di studio nell’ambito del marketing e del supporto alla gestione clienti, del rilevamento di frodi e di studi epidemiologici. - l’ultima parte del corso ha l’obiettivo di introdurre gli aspetti di privacy ed etici inerenti all’utilizzo di tecniche inferenza sui dati e dei quali l’analista deve essere a conoscenza ===== Reading about the "data scientist" job ===== * Data, data everywhere. The Economist, Feb. 2010 {{:dm:economist--010.pdf|download}} * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 [[http://tech.fortune.cnn.com/2011/09/06/data-scientist-the-hot-new-gig-in-tech/|link]] * Welcome to the yotta world. The Economist, Sept. 2011 {{:dm:economist-2012-dm.pdf|download}} * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 [[http://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/ar/1|link]] * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 [[http://www.ilsole24ore.com/art/tecnologie/2012-09-21/futuro-scritto-data-155044.shtml?uuid=AbOQCOhG|link]] * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics {{:dm:crossroadsxrds2012fall-dl.pdf|download}} * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: [[https://www.youtube.com/watch?v=mXLy3nkXQVM|YouTube video]] * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. [[http://www.fusioncharts.com/whitepapers/downloads/Towards-Effective-Decision-Making-Through-Data-Visualization-Six-World-Class-Enterprises-Show-The-Way.pdf|download]] ====== Hours - Orario e Aule ====== ===== DM1 & DM ===== **Classes - Lezioni** ^ Day of Week ^ Hour ^ Room ^ | Lunedì/Monday | 14:00 - 16:00 | Aula E1 | | Mercoledì/Wednesday | 16:00 - 18:00 | Aula A1 | | Venerdì/Friday | 11:00 - 13:00 | Aula C1 | **Office hours - Ricevimento:** * Prof. Pedreschi: Lunedì/Monday h 14:00 - 17:00, Dipartimento di Informatica * Prof. Monreale: Lunedì/Monday h 09:00 - 11:00, Dipartimento di Informatica ===== DM 2 ===== **Classes - Lezioni** ^ Day of Week ^ Hour ^ Room ^ | Monday | 09:00 - 11:00 | C | | Wednesday | 16:00 - 18:00 | C1 | **Office hours - Ricevimento:** * Room 268 Dept. of Computer Science * Thursday: 15-17, Room: 286 * Appointment by email ====== Learning Material -- Materiale didattico ====== ===== Textbook -- Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * I capitoli 4, 6, 8 sono disponibili sul sito del publisher. -- Chapters 4,6 and 8 are also available at the publisher's Web site. * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. **GUIDE TO INTELLIGENT DATA ANALYSIS.** Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. * Jake VanderPlas. **[[http://shop.oreilly.com/product/0636920034919.do| Python Data Science Handbook: Essential Tools for Working with Data.]]** 1st Edition. ===== Slides of the classes -- Slides del corso ===== * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ===== Past Exams ===== * Exercises on Clustering: {{ :dm:ex._clustering.pdf |}} * Some text of past exams on **DM1 (6CFU)**: * {{ :dm:2017-1-19.pdf |}}, {{ :dm:2017-9-6.pdf |}}, {{ :dm:2016-05-30-dm1-seconda.pdf |}} * Some solutions of past exams containing exercises on KNN and Naive Bayes classifiers **DM1 (9CFU)**: * {{ :dm:dm2_exam.2017.06.13_solutions.pdf |}}, {{ :dm:dm2_exam.2017.07.04_solutions.pdf |}}, {{ :dm:dm2_mid-term_exam.2017.06.06_solutions.pdf |}} * Some exercises (partially with solutions) on **sequential patterns** and **time series** can be found in the following texts of exams from the last years: * {{ :dm:dm2_exam.2015.04.13.results.pdf|}}, {{ :dm:dm2_exam.2016.04.4_sol.pdf |}}, {{ :dm:dm2_exam.2016.04.5_sol.pdf |}}, {{ :dm:dm2_exam.2016.06.20_sol.pdf |}}, {{ :dm:dm2_exam.2016.07.08_sol.pdf |}} * Some very old exercises (part of them with solutions) are available here, most of them in Italian, not all of them on topics covered in this year program: * {{tdm:verifica2006.pdf|Verifica 2006}}, {{tdm:verifica2005.pdf|Verifica 2005 (con soluzioni)}}, {{tdm:verifica2004.pdf|Verifica 2004}} * {{dm:verifica.05.06.2007.pdf|Verifica 5 giugno 2007}}, {{dm:verifica.26.06.2007.pdf|Verifica 26 giugno 2007}}, {{dm:verifica.24.07.2007_corretto.pdf|Verifica 24 luglio 2007}} (e {{dm:verifica.24.07.2007_soluzioni.pdf|Soluzioni}}) * {{:dm:verifica.2008.04.03.pdf|Verifica 3 aprile 2008}} (e {{:dm:soluzioni.2008.04.03.pdf|Soluzioni}}), {{:dm:dm-tdm.appello_2008_07_18_parte1.pdf|Verifica 18 luglio 2008 - parte 1}}, {{:dm:dm-tdm.appello_2008_07_18_parte2.pdf|Verifica 18 luglio 2008 - parte 2}} * {{:dm:appello.2010.06.01_soluzioni.pdf| Exam with solution 2010-06-01}} {{:dm:appello.2010.06.22_soluzioni.pdf|Exam with solution 2010-06-22}} {{:dm:appello.2010.09.09_soluzioni.pdf|Exam with solution 2010-09-09}}{{:dm:appello.2010.07.13_soluzioni.pdf| Exam with solution 2010-07-13}} ===== Data mining software===== * [[http://www.knime.org | KNIME ]] The Konstanz Information Miner. [[http://www.knime.org/download-desktop| Download page ]] * Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. [[https://www.anaconda.com/distribution/| Download page]] (the following libraries are already included) * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] * [[http://www.cs.waikato.ac.nz/ml/weka/ | WEKA ]] Data Mining Software in JAVA. University of Waikato, New Zealand [[http://www.cs.waikato.ac.nz/ml/weka/ | Download page ]] ====== Class calendar - Calendario delle lezioni (2019/2020) ====== ===== First part of course, first semester (DM1 - Data mining: foundations & DM - Data Mining) ===== ^ ^ Day ^ Topic ^ Learning material ^ Instructor ^ |1.| 16.09 14:00-16:00 | Overview. Introduction to KDD | {{ :dm:1.dm-overview-corso.pdf | Course Overview}} {{ :dm:2.introduction-short.pdf | Introduction DM}} | Pedreschi | | | 18.09 16:00-18:00 | Lecture canceled (Event at Scuola S. Anna Information in News Section of this page) | | Pedreschi | |2.| 20.09 11:00-13:00 | Introduction to KDD: technologies, Application and Data | | Pedreschi | |3.| 23.09 14:00-16:00 | Data Understanding (from Bertold book!) | {{ :dm:3.dataunderstanding-2019.pdf |Slides DU}} {{ :dm:2-statistica_descrittiva.pdf |Slides on Descriptive Statistics}} useful for clarifying some statistical notions of statistics. Unfortunately this material is only in Italian. | Monreale | |4.| 25.09 16:00-18:00 | Data Preparation |{{ :dm:3.dm_ml_data_preparation.pdf | Slides DP}} | Monreale | | | 27.09 11:00-13:00 | Climate Strike | | | |5.| 30.09 14:00-16:00 | Introduction to Python. | {{ :dm:python_basics.ipynb.zip |Python Introduction}} | Monreale | |6.| 02.10 16:00-18:00 | Clustering: Introduction + Centroid-based clustering, K-means | {{ :dm:4.basic_cluster_analysis-intro-kmeans.pdf |Clustering: Intro and K-means}} | Pedreschi | |7.| 04.10 11:00-13:00 | Lab: Data Understanding & Preparation in Knime | Knime: {{ :dm:01_data_understanding.zip |}} Data: {{ :dm:titanic.csv.zip | Titanic File}} | Monreale | |8.| 07.10 14:00-16:00 | Lab: DU Python + Project presentation | Python: {{ :dm:titanic_data_understanding2.ipynb.zip |}}| Monreale | |9.| 09.10 16:00-18:00 | Clustering: K-means + Hierarchical |{{ :dm:5.basic_cluster_analysis-hierarchical.pdf |}} | Monreale | |10.| 11.10 11:00-13:00 | Suppressed for Internet festival | | Pedreschi | |11.| 14.10 14:00-16:00 | Clustering: DBSCAN & VALIDITY | {{ :dm:6.basic_cluster_analysis-dbscan-validity.pdf |}}| Pedreschi | |12.| 16.10 16:00-18:00 | Exercises on Clustering| Tool for Dm ex: [[http://matlaspisa.isti.cnr.it:5055/Help|Didactic Data Mining ]] {{ :dm:ex-clustering.pdf | Ex. Clustering PDF}} {{ :dm:ex-clustering.zip |Ex. Clustering PPTX}}| Monreale | |13.| 18.10 11:00-13:00 | Lab: Clustering | {{ :dm:knime_clustering.zip |clustering_knime}} {{ :dm:python_clustering-iris.zip |clustering_python}} | Monreale | |14.| 21.10 14:00-16:00 | Classification | {{ :dm:7.chap3_basic_classification-2019.pdf |}}[[http://www.r2d3.us/visual-intro-to-machine-learning-part-1/|A visual intro to machine learning]] | Pedreschi | |15.| 23.10 16:00-18:00 | Classification | | Pedreschi | |16.| 25.10 11:00-13:00 | Classification | | Pedreschi | |17.| 28.10 14:00-16:00 | LAB: Classificazione | {{ :dm:knime_classification.zip | knime_classification}} {{ :dm:python_classification.zip | python_classification}} | Monreale | |18.| 30.10 16:00-18:00 | Exercises Classification + Discussion Clustering | {{ :dm:ex-classification.pdf |}}| Monreale| |19.| 04.11 14:00-15:00 | Pattern Mining | Note: the lecture will terminate at 15:00 to allow for the participation of the Informatica50 event (see news) {{ :dm:8.tdm-patterns-assrules.pdf |slides}} | Pedreschi | |20.| 06.11 16:00-18:00 | Pattern Mining | | Pedreschi | | | 08-14.11 | Project work | | | |21.| 15.11 11:00-13:00 | Exercises and Lab on Pattern Mining | {{ :dm:pattern_knime.zip |knime_pattern}} {{ :dm:pattern_python.zip |python_pattern}} https://anaconda.org/conda-forge/pyfim, http://www.borgelt.net/pyfim.html {{ :dm:ex-frequentpatterns-ar.pdf |}}| Monreale | | | 18.11 14:00-16:00 | Suppressed for weather conditions | | | | | 20.11 16:00-18:00 | Suppressed | | | |22.| 22.11 11:00-13:00 | Exercises Classification| | Monreale | | | | **Next Classes are dedicated to DM of 9 CFU ** | | | |23.| 25.11 14:00-16:00 | Alternative methods for classification/1 | {{ :dm:lezioneadvancedclassificationmethods1-knn_nb.pdf | K-Nearest Neighbors & Naive Bayes }} |Pedreschi| |24.| 27.11 16:00-18:00 | Alternative methods for classification/2 | {{ :dm:ensemblemethod_wisdomofthecrowd.pdf | Wisdom of the crowd & Ensemble methods: Bagging, Random Forest & Boosting }} {{ :dm:voxpopuli-galton-1907.pdf | Galton's "Vox Populi" 1907 Nature paper}} |Pedreschi| |25.| 29.11 11:00-13:00 | Alternative methods for classification/3 | {{ :dm:lezioneadvancedclassificationmethods3_rules-ensemble.pdf | Recap Ensemble methods & Hints to Rule-based classification}} |Pedreschi| |26.| 02.12 14:00-16:00 | Alternative Methods for Pattern Mining + Ex on KNN and NB | {{ :dm:fp-growth.pdf |}} {{ :dm:ex-classification-knn-nb.pdf | KNN & NB}} |Monreale | |27.| 04.12 16:00-18:00 | Alternative Methods for Clustering | {{ :dm:1-alternative-clustering-2019.pdf |}}{{ :dm:2-transactionalclustering-2019.pdf |}} |Monreale | |28.| 06.12 11:00-13:00 | Sequential Pattern Mining | {{ :dm:sequential_patterns_2019.pdf |Sequential patterns}} |Pedreschi| |29.| 09.12 14:00-16:00 | Exercises on sequential pattern mining & ROCK | {{ :dm:exsequentialpatternmining.pdf |}} {{ :dm:ex-clustering-rock.pdf |}} |Monreale | |30.| 11.12 16:00-18:00 | Black Box Explanations | {{ :dm:2019-dm_xai.pdf |}} Material: [[https://arxiv.org/pdf/1805.10820.pdf|LORE]] [[https://www.kdd.org/kdd2016/papers/files/rfp0573-ribeiroA.pdf| LIME]] [[http://delivery.acm.org/10.1145/3240000/3236009/a93-guidotti.pdf?ip=94.38.73.6&id=3236009&acc=OA&key=4D4702B0C3E38B35%2E4D4702B0C3E38B35%2E4D4702B0C3E38B35%2ED544636226B69D47&__acm__=1576196869_06b3353aae4fe3bd8ea30d9c9c5356eb|Survey]] {{ :dm:pkdd_2019_abele_cr.pdf |ABELE}}| Monreale | |31.| 13.12 11:00-13:00 | Exercises on written exam - all students | {{ :dm:9_cfu_ex.pdf |}} {{ :dm:ex_clustering_fpm_dt.pdf |}} {{ :dm:hierarchical_max_sim.pdf |}}| Monreale | |32.| 16.12 13:30-16:00 | Mid-term Test (Rooms A, E1, C1) | |Monreale | |30.| 18.12 16:00-18:00 | Privacy in DM. Project. | {{ :dm:privacydt.pdf |}} {{ :dm:allegato1_chapter.pdf | Overview on Privacy}} {{ :dm:capprivacy.pdf | Privacy by design}}| Monreale | ===== Second part of course, second semester (DM2 - Advanced Topics on Data Mining and Applications) ===== ^ ^ Day ^ Room (Aula) ^ Topic ^ Learning material ^ Instructor (Guidotti)^ |1.| 17.02.2020 09:00-11:00 | C | Introduction, Instance-based and Bayesian Classifiers | {{ :dm:00_dm2_intro_2020.pdf | Intro}}, {{ :dm:00_dm2_python_libraries4data_science_2020.pdf | Libraries}}, {{ :dm:01_dm2_knn_naive_bayes_2020.pdf | Instance-Based and Bayesian Classifiers}} | | |2.| 19.02.2020 16:00-18:00 | C1 | Linear and Logistic Regression, Dimensionality Reduction, Exercises KNN and Naive Bayes | {{ :dm:02_dm2_linear_regression_2020.pdf | Regression}}, {{ :dm:02_dm2_pca_2020.pdf | Dimensionality Reduction}}, {{ :dm:01_ex_knn_naive_bayes_lift.pdf | Ex_KNN_NB_Lift}}, {{ :dm:appendix_kumar.pdf | Appendix }} | | |3.| 24.02.2020 09:00-11:00 | C | Imbalanced Learning, Performance Evaluation and Rule-based Classifiers | {{ :dm:03_dm2_imbalanced_data_2020.pdf | Imbalanced Learning}} {{ :dm:04_dm2_rule_based_classifiers_2020.pdf | Rule-based Classifiers}}| | |4.| 26.02.2020 16:00-18:00 | C1 | Exercises Lift, ROC, KNN and Naive Bayes. Lab KNN and Naive Bayes. | {{ :dm:01_ex_knn_naive_bayes_lift.pdf | Ex_KNN_NB_Lift}}, {{ :dm:01_knn_naive_bayes.ipynb.zip | Lab_KNN_NB}}, {{ :dm:data_preparation.py.zip | Data Preparation}}, {{ :dm:churn.csv.zip | Churn Dataset}}, {{ :dm:iris.csv.zip | Iris Dataset}}| | |5.| 02.03.2020 09:00-11:00 | C | Lab Regression, Dimensionality Reduction, Imbalanced Learning + CAT1 |{{ :dm:02_linear_logistic_regression.ipynb.zip | Regression}}, {{ :dm:03_dimensionality_reduction.ipynb.zip | Dimensionality Reduction}}, {{ :dm:04_imbalaned_data.ipynb.zip | Imbalanced Learning}} {{ :dm:airquality.csv.zip | Airquality Dataset}}| | |6.| 04.03.2020 16:00-18:00 | C1 | CRISP-DM, SVM, Intro NN | {{ :dm:05_dm2_crispdm_2020.pdf | CRISP-DM}}, {{ :dm:05_dm2_svm_2020.pdf | SVM}}, {{ :dm:06_dm2_neural_networks_2020.pdf | NN }}| | |7.| 09.03.2020 09:00-11:00 | online | Neural Network, Exercises NN | {{ :dm:06_dm2_neural_networks_2020.pdf | NN }}, {{ :dm:02_ex_nn_ensemble.pdf | Ex_NN_Ensemble}} | |8.| 11.03.2020 16:00-18:00 | online | Neural Network, Exercises NN, Deep Neural Network, Intro Ensemble, Exercises Ensemble | {{ :dm:06_dm2_neural_networks_2020.pdf | NN }}, {{ :dm:07_dm2_deep_neural_networks.pdf | DNN}} {{ :dm:02_ex_nn_ensemble.pdf | Ex_NN_Ensemble}}| | |9.| 16.03.2020 09:00-11:00 | online | Ensemble Classifiers, Exercises Ensemble | {{ :dm:08_dm2_ensemble_2020.pdf | Ensemble}}, {{ :dm:02_ex_nn_ensemble.pdf | Ex_NN_Ensemble}} | |10.| 18.03.2020 16:00-18:00 | online | Lab SVM, Neural Network, Ensemble | {{ :dm:lab_svm_nn_rf.zip | Lab_SVM_NN_RF}} | | |11.| 23.03.2020 09:00-11:00 | online | Time Series Similarity, Ex DTW | {{ :dm:11_dm2_time_series_similarity_2020.pdf | Time Series Similarity}}, {{ :dm:03_ex_ts.pdf | Ex_DTW}} | |12.| 25.03.2020 16:00-18:00 | online | Time Series Motif/Shapelet, Ex Matrix Profile |{{ :dm:12_dm2_time_series_motif.pdf | Time Series Motif/Shapelet}}, {{ :dm:03_ex_ts.pdf | Ex_MP}} | | |13.| 30.03.2020 09:00-11:00 | online | Time Series Stationariety and Forecasting | {{ :dm:13_dm2_time_series_forecasting_2020.pdf | Time Series Forecasting }} | |14.| 01.04.2020 16:00-18:00 | online | Lab Time Series | {{ :dm:lab_ts.zip | Lab_TS}} | | |15.| 06.04.2020 09:00-11:00 | online | Time Series Classification, Lab Time Series | {{ :dm:14_dm2_time_series_classification_2020.pdf | Time Series Classification}}, {{ :dm:lab_ts.zip | Lab_TS}}, {{ :dm:20_dm2_data_partitioning.pdf | Data Partitioning}} | | | - | 08.04.2020 | | Reading/Project Week | | | | - | 15.04.2020 | | Reading/Project Week | | | |16.| 20.04.2020 09:00-11:00 | online | Sequential Pattern Mining | {{ :dm:15_dm2_sequential_patterns_2020.pdf | SPM }} | | |17.| 22.04.2020 16:00-18:00 | online | SPM Time Constraints, Exercises, Lab | {{ :dm:04_ex_seq_clus.pdf | Ex_SPM}}, {{ :dm:12_sequential_pattern_mining.ipynb.zip | Lab_SPM}} | | |18.| 27.04.2020 09:00-11:00 | online | Advanced Clustering, Ex, SPM, Lab EM, X-Means | {{ :dm:16_dm2_advanced_clustering_2020.pdf | Advanced Clustering }}, {{ :dm:13_advanced_clustering.ipynb.zip | Lab_AC}} | | |19.| 29.04.2020 16:00-18:00 | online | Transactional Clustering, Ex TC, Lab K-Mode | {{ :dm:04_ex_seq_clus.pdf | Ex_SPM_TC}} | | |20.| 04.05.2020 09:00-11:00 | online | Anomaly Detection, Ex AD | {{ :dm:17_dm2_anomaly_outliers_2020.pdf | Anomaly Detection }}, {{ :dm:05_ex_outliers.pdf | Ex_AD}} | | |21.| 06.05.2020 16:00-18:00 | online | Anomaly Detection, Ex AD, Lab AD | {{ :dm:17_dm2_anomaly_outliers_2020.pdf | Anomaly Detection }}, {{ :dm:05_ex_outliers.pdf | Ex_AD}}, {{ :dm:14_outlier_detection.ipynb.zip | Lab_AD}} | | |22.| 11.05.2020 09:00-11:00 | online | Ethics: Privacy | {{ :dm:18_dm2_ethics_privacy_2020.pdf | Privacy}} | | |23.| 13.05.2020 16:00-18:00 | online | Ethics: Explainability | {{ :dm:19_dm2_explainability_2020.pdf | Explainability}} | | |24.| 18.05.2020 09:00-11:00 | online | Ethics: Local Explainability, Inspection, Transparent Methods, Lab | {{ :dm:19_dm2_explainability_2020.pdf | Explainability}}, {{ :dm:15_explainable_machine_learning.ipynb.zip | Lab_XAI}} | | | - | 20.05.2020 | | Reading/Project Week | | | | - | 25.05.2020 | | Reading/Project Week | | | | - | 27.05.2020 | | Reading/Project Week | | | ====== Exams ====== ===== Exam DM part I (DMF) ====== **RULES FOR EXAMS for COMPUTER SCIENCE - 9CFU**: {{ :dm:rules_dm_9cfu_.pdf | EXAM RULES Summer Session - 9 CFU}} **RULES FOR EXAMS for DATA SCIENCE & BI and DIGITAL HUMANITIES - DM1(6CFU)**: {{ :dm:rules_dm1_6cfu_.pdf |EXAM RULES Summer Session - DM1(6CFU)}} The exam is composed of two parts: * An **oral exam **, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory of the practical parts. It is optional for students passing the written part by ONLY the mid-term test. * A **project** consists in exercises that require the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, frequent pattern mining, and classification. The project has to be performed by min 3, max 4 people. It has to be performed by using Knime, Python or a combination of them. The results of the different tasks must reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must emailed to [[datamining.unipi@gmail.com]]. Please, use “[DM 2019-2020] Project” in the subject. Tasks of the project: - ** Data Understanding: ** Explore the dataset with the analytical tools studied and write a concise “data understanding” report describing data semantics, assessing data quality, the distribution of the variables and the pairwise correlations. (see Guidelines for details) - ** Clustering analysis: ** Explore the dataset using various clustering techniques. Carefully describe your's decisions for each algorithm and which are the advantages provided by the different approaches. (see Guidelines for details) - ** Classification: ** Explore the dataset using classification trees. Use them to predict the target variable. (see Guidelines for details) - ** Association Rules: ** Explore the dataset using frequent pattern mining and association rules extraction. Then use them to predict a variable either for replacing missing values or to predict target variable. (see Guidelines for details) - ** ADDITIONAL TASK for DM9 CFU (OPTIONAL)**: Students for computer science (DM9CFU) can decide to deliver an additional task for the project selected among the following for additional bonus of 3 points: - Classification: Compare results of classification by decision tree with KNN, Naive Bayesian, analysing also the runtime at training and test phase. - Clustering: Is it possible to apply EM clustering? Does the quality of the clustering result improve? * Project 1 - Dataset: **Carvana Data** - Assigned: 07/10/2019 - Deadline: 05/01/2020 08/01/2020 - Link: https://www.kaggle.com/t/712fc5e264e748afb0e0616f56f3c102 * Project 2 - Dataset: **Bank Loan Status** - Assigned: 09/01/2020 - Deadline: 4 days before the oral exam - This dataset will be used for all tasks. For the classification task, you have to split the dataset into train and test set and the class to predict is the variable "Loan Status". - This dataset will be valid for all the exam sessions until September. - Download the dataset {{:dm:credit_2020.zip|Bank Loan Status dataset}} (in CSV format, zipped) **Guidelines for the project are [[:dm:start:guidelines|here]].** ===== Exam DM part II (DMA) ====== The exam is composed of three parts: * A **written exam**, with exercises and questions about methods and algorithms presented during the classes. It can be substituted with ongoing tests held during the course. * A **project**, that consists in employing the methods and algorithms presented during the classes for solving exercises on a given dataset. The project has to be performed by max 3 people. It has to be performed by using Knime, Python, other software or a combination of them. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages (suggested 25) of text including figures + 1 cover page (minimum font 11, minimum interline 1). The project must be delivered at least 2 days before the oral exam. * An **oral exam**, that includes: (1) discussing topics presented during the classes, including the theory of the parts already covered by the written exam; (2) discussing the project report with a group presentation. * **Dataset**: the data is about Occupancy Detection and can be downloaded here: [[http://archive.ics.uci.edu/ml/datasets/Occupancy+Detection+|dataset]]. * Submission Draft 1: 16/04/2020 23:59 Italian Time * Submission Draft 2: 25/05/2020 23:59 Italian Time * Final Submission: one week before the oral exam. * **Dataset 2**: the data is about Air Quality and can be downloaded here: [[https://archive.ics.uci.edu/ml/datasets/Air+Quality|dataset]]. The dataset has not a target variable for classification. Thus, define a target variable, for instance "is weekend" and set "true" for weekend days, and "false" for the others. * Final Submission: one week before the oral exam or within 30/11/2020. * **Project Task 1** - Basic Classifiers and Evaluation - Prepare the dataset in order to build several basic classifiers able to predict room occupancy from the available variables. You are welcome in creating new variables. - Solve the classification task with k-NN (testing values of k), Naive Bayes, Logistic Regression, Decision Tree using cross-validation and/or random/grid search for parameter estimation. - Evaluate each classifier using Accuracy, Precision, Recall, F1, ROC, AUC and Lift Chart. - Try to reduce the dimensionality of the dataset using the methods studied (or new ones). Test PCA and try to solve the classification task in two dimensions. Plot the dataset in the two new dimensions and observe the decision boundary and the one of the trained algorithms. - Analyze the value distribution of the class to predict and turn the dataset into an imbalanced version reaching a strong majority-minority distribution (e.g. 96%-4%). Then solve again the classification task adopting the various techniques studied (or new ones). - Select two continuous attributes, define a regression problem and try to solve it using different techniques reporting various evaluation measures. Plot the two-dimensional dataset. Then generalize to multiple linear regression and observe how the performance varies. - Draw your conclusions about the basic classifiers and techniques adopted in this analysis. * **Project Task 2** - Advanced Classifiers and Evaluation - Using the dataset for classification prepared for Task 1 build several advanced classifiers able to predict room occupancy from the available variables. In particular, you are required to use SVM (linear and non-linear), NN (Single and Multilayer Perceptron), DNN (design at least two different architectures), Ensemble Classifier (RandomForest, AdaBoost and a Bagging technique in which you can select a base classifier of your choice with a justification). - Evaluate each classifier using Accuracy, Precision, Recall, F1, ROC, etc; Draw your conclusion about the classifiers. - Highlight in the report different aspects typical of each classifier. For instance for SVM: is a linear model the best way to shape the decision boundary? Or for NN: what are the parameter sets or the convergence criteria suggesting you are avoiding overfitting? How many iterations/base classifiers are needed to allow a good estimation using an ensemble method? Which is the feature importance for the Random Forest? - You are NOT required to experiment also in the imbalanced case but if you do it is not considered a mistake. * **Project Task 3** - Time Series Analysis and Forecasting/Classification - Exploit the temporal information of the dataset preparing it for a univariate framework of analysis, i.e. select a feature and use it as your time series. You are welcome in using more than one reliable temporal split to have more time series of the same feature. You are welcome in creating more than a dataset using more than a feature and report the result on the feature you prefer or more than one. Analyze such datasets for finding motifs and/or anomalies and shaplets. Visualize and discuss them and their relationship with the class of the time series. - On the dataset(s) created, compute clustering based on Euclidean/Manhattan and DTW distances and compare the results. To perform the clustering you can choose among different similarity methods, i.e., shape-based, feature-based, approximation-based, compression-based, etc.. Finally, analyze the clusters and the clustering and highlight similarities and differences. - Apply forecasting methods on the dataset(s) created. Make sure to preprocess adequately the time series according to the method used (e.g., an exponential smoothing or an autoregression), indeed checking stationarity and reducing trends and seasonality or with the help of a statistically significant test; - Solve the classification task on the univariate dataset created using different approaches, i.e., traditional classification, shapelet-based, feature-based, etc. - Solve the classification task considering the whole dataset as a multivariate dataset. Develop the classification process you prefer (e.g. exploiting shapelets, traditional classifiers, CNN, or RNN) to maximize accuracy and F1-score. * **Project Task 4** - Sequential Pattern Mining - Convert the time series into a discrete format (e.g., SAX) in order to prepare the data for the task. - Using different values of support, extract the most frequent sequential patterns (of at least length 3/4), then discuss the most interesting sequences. * **Project Task 5** - Outlier Detection and Explainability - From the original dataset (i.e. not the time series built on Task 3 or sequences of Task 4, nor the preprocessed dataset used in Tasks 1 and 2), identify the top 1% outliers. - Adopt at least three different methods belonging to different families (i.e. statistical/depth-based, distance-based, density-based, angle-based, …) and compare the results. - (Optional) Try to use an explanation method to illustrate the reasons for the classification in one of the steps of the previous Tasks (if you want to try LORE please ask the code to riccardo.guidotti@unipi.it). ====== Appelli di esame ====== ===== Mid-term exams ===== ^ ^ Date ^ Hour ^ Place ^ Notes ^ Marks ^ | DM1: First Mid-term 2018 | 16.12.2019 | 13:30-16:00 | Room E1, C1, A | Please, use the system for registration: https://esami.unipi.it/| | ===== Appelli regolari / Exam sessions ===== ^ Session ^ Date ^ Time ^ Room ^ Notes ^ Marks ^ |1.|16.01.2019| 14:00 - 18:00| Room E | | | |2.|06.02.2019| 14:00 - 18:00| Room E | | | |3.|19.06.2019| 09:00 - 13:00| Room A1 | Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September.| {{ :dm:results.2019.06.19.pdf |Results}} | |4.|10.07.2019| 09:00 - 13:00| Room A1 |Oral Exam on DM1 within 15 July. If you cannot do within that date you can do the oral exam on September. | {{ :dm:results.2019.07.10.pdf |Results}} | |5.|08.06.2020| 09:00 - 18:00| Microsoft Teams | From 08/06 to 25/06. Please register ([[https://esami.unipi.it/| here]]) and select your slot [[https://doodle.com/poll/4n47aer3ynr3m5zi|here]]. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 01/06. | | |6.|26.06.2020| 09:00 - 18:00| Microsoft Teams | From 26/06 to 16/07. Please register ([[https://esami.unipi.it/| here]]) and select your slot [[https://doodle.com/poll/k4teyt6nwsbf4kbb|here]]. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 21/06. | | |7.|17.07.2020| 09:00 - 18:00| Microsoft Teams | From 17/07 to 29/07. Please register ([[https://esami.unipi.it/| here]]) and select your slot at the agenda link that will be available from 12/07 only for those registered for the exam. We remind to submit the project one week before the exam. It would be helpful if you submit the project within 10/07. It is mandatory to submit the project before 15/07. | | ====== Previous years ===== * [[dm.2018-19]] * [[dm.2017-18]] * [[dm.2016-17]] * [[dm.2015-16]] * [[dm.2014-15]] * [[dm.2013-14]] * [[dm.2012-13]] * [[dm.2011-12]] * [[dm.2010-11]] * [[dm.2009-10]] * [[dm.2008-09]] * [[dm.2007-08]] * [[dm.2006-07]] * [[PhDWorkshop2011]] * [[SNA.Ingegneria2011]] * [[SNA.IMT.2011]] * [[MAINS.SANTANNA.2011-12]] * [[MAINS.SANTANNA.DM4CRM.2012]] * [[MAINS.SANTANNA.DM4CRM.2016]] * [[MAINS.SANTANNA.DM4CRM.2017 | Data Mining for Customer Relationship Management 2017]] * [[MAINS.SANTANNA.DM4CRM.2018]] * [[MAINS.SANTANNA.DM4CRM.2019]] * [[SDM2018 | Instructions for camera ready and copyright transfer]] * [[DM-SAM | Storie dell'Altro Mondo]] * [[DM-I40 | Master Industry 4.0]]