<html> <!– Google Analytics –> <script type=“text/javascript” charset=“utf-8”> (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){ (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','www.google-analytics.com/analytics.js','ga'); ga('create', 'UA-34685760-1', 'auto', 'personalTracker', {'allowLinker': true}); ga('personalTracker.require', 'linker'); ga('personalTracker.linker:autoLink', ['pages.di.unipi.it', 'enforce.di.unipi.it', 'didawiki.di.unipi.it', 'luciacpassaro.github.io'] ); ga('personalTracker.require', 'displayfeatures'); ga('personalTracker.send', 'pageview', 'courses/dm/'); setTimeout(“ga('send','event','adjusted bounce rate','30 seconds')”,30000); </script> <!– End Google Analytics –> <!– Global site tag (gtag.js) - Google Analytics –> <script async src=“https://www.googletagmanager.com/gtag/js?id=G-LPWY0VLB5W”></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-LPWY0VLB5W'); </script> <!– Capture clicks –> <script> jQuery(document).ready(function(){ jQuery('a[href$=“.pdf”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'DM', 'PDFs', fname); }); jQuery('a[href$=“.r”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'DM', 'Rs', fname); }); jQuery('a[href$=“.zip”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'DM', 'ZIPs', fname); }); jQuery('a[href$=“.mp4”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'DM', 'Videos', fname); }); jQuery('a[href$=“.flv”]').click(function() { var fname = this.href.split('/').pop(); ga('personalTracker.send', 'event', 'DM', 'Videos', fname); }); }); </script> </html> ====== Data Mining A.A. 2022/23 ====== ===== DM1 - Data Mining: Foundations (6 CFU) ===== Instructors: * Dino Pedreschi * KDDLab, Università di Pisa * http://www-kdd.isti.cnr.it * dino [dot] pedreschi [at] unipi [dot] it * Riccardo Guidotti * KDDLab, Università di Pisa * https://kdd.isti.cnr.it/people/guidotti-riccardo * riccardo [dot] guidotti [at] di [dot] unipi [dot] it Teaching Assistant * Francesco Spinnato * KDDLab, Scuola Normale Superiore * https://kdd.isti.cnr.it/people/spinnato-francesco * francesco [dot] spinnato [at] sns [dot] it ===== DM2 - Data Mining: Advanced Topics and Applications (6 CFU) ===== Instructors: * Riccardo Guidotti * KDDLab, Università di Pisa * https://kdd.isti.cnr.it/people/guidotti-riccardo * riccardo [dot] guidotti [at] di [dot] unipi [dot] it Teaching Assistant * Francesco Spinnato * KDDLab, Scuola Normale Superiore * https://kdd.isti.cnr.it/people/spinnato-francesco * francesco [dot] spinnato [at] sns [dot] it ====== News ====== * [23.02.2023] Spinnato Booking Agenda: here * [20.02.2023] Project Groups link * [23.11.2022] In order to recover from skipped and suspended lectures we signal the presence of two new dates in unusual slots for our lectures, i.e., Wed 7th Dec 14.00-16.00 Room A1 and Wed 14th Dec 14.00-16.00 Room A1. * [15.09.2022] Project Groups link * [15.09.2022] MS Teams link * [15.09.2022] Lectures will be in presence only. Registrations of the lectures of past years can be found at the bottom of this web page. ====== Learning Goals ====== * DM1 * Fundamental concepts of data knowledge and discovery. * Data understanding * Data preparation * Clustering * Classification * Pattern Mining and Association Rules * Sequential Pattern Mining * DM2 * Outlier Detection * Dimensionality Reduction * Regression * Advanced Classification and Regression * Time Series Analysis * Transactional Clustering * Explainability ====== Hours and Rooms ====== ===== DM1 ===== Classes ^ Day of Week ^ Hour ^ Room ^ | Monday | 11:00 - 13:00 | Aula A1 | | Thursday | 11:00 - 13:00 | Aula A1 | Office hours - Ricevimento: * Prof. Pedreschi * Monday 16:00 - 18:00 * Online * Prof. Guidotti * Wednesday 15-17 or Appointment by email * Room 363 Dept. of Computer Science or MS Teams ===== DM 2 ===== Classes ^ Day of Week ^ Hour ^ Room ^ | Monday | 09:00 - 11:00 | C1 | | Tuesday | 09:00 - 11:00 | C1 | Office Hours - Ricevimento: * Wednesday 16.30-18.00 or Appointment by email * Room 363 Dept. of Computer Science or MS Teams ====== Learning Material – Materiale didattico ====== ===== Textbook – Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006 * http://www-users.cs.umn.edu/~kumar/dmbook/index.php * I capitoli 3, 5, 7 sono disponibili sul sito del publisher. – Chapters 3,5 and 7 are also available at the publisher's Web site. * Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7 * Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition. * Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition. ===== Slides ===== * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining". ===== Software===== * Python - Anaconda (>3.7): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included) * Scikit-learn: python library with tools for data mining and data analysis Documentation page * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page Other softwares for Data Mining * KNIME The Konstanz Information Miner. Download page * WEKA Data Mining Software in JAVA. University of Waikato, New Zealand Download page * Didactic Data Mining DDM ====== Class Calendar (2021/2022) ====== ===== First Semester (DM1 - Data Mining: Foundations) ===== ^ ^ Day ^ Time ^ Room ^ Topic ^ Learning Material ^ Lecturer ^ |01.| 15.09.2022 | 11-13 |A1| Overview, Intro, KDD and CRIPS. | Intro | Pedreschi/Guidotti | | | 19.09.2022 | 11-13 | | No Lecture | | | |02.| 22.09.2022 | 11-13 |A1| Project Guideliens & Intro to Python | Project Guidelines, Intro Python | Spinnato | | | 26.09.2022 | 11-13 | | No Lecture | | | |03.| 29.09.2022 | 11-13 |A1| Data Understanding | Data Understanding | Pedreschi | |04.| 03.10.2022 | 11-13 |A1| Data Understanding & Data Preparation | Data Preparation| Pedreschi | |05.| 06.10.2022 | 11-13 |A1| Lab. Data Understanding | Data Und Python | Spinnato/Guidotti | | | 10.10.2022 | 11-13 | | No Lecture | | | |06.| 13.10.2022 | 11-13 |A1| Data Preparation, Similarity | Data Similarity, Data Und Python | Pedreschi | |07.| 17.10.2022 | 11-13 |A1| Intro Clustering, K-Means | Intro Clustering, K-Means | Pedreschi | |08.| 20.10.2022 | 11-13 |A1| K-Means | K-Means | Pedreschi | |09.| 24.10.2022 | 11-13 |A1| Hierarchical & Density-based | Hierarchical, Density | Pedreschi | |10.| 27.10.2022 | 11-13 |A1| Lab. Clustering | Clustering Python | Spinnato/Guidotti | | | 30.10.2022 | 11-13 | | No Lecture | | | |11.| 03.11.2022 | 11-13 |A1| Exercises Clustering | Exercises Clustering | Guidotti | |12.| 07.11.2022 | 11-13 |A1| Intro Classification | Intro Classification, kNN | Guidotti | |13.| 10.11.2022 | 11-13 |A1| Eval Measures, Exercises kNN | Intro Classification, kNN | Guidotti | |14.| 14.11.2022 | 11-13 |A1| Decision Tree | Decision Trees | Guidotti | |15.| 17.11.2022 | 11-13 |A1| Decision Tree, Exercises DT | Decision Trees, Ex DT | Guidotti | |16.| 22.11.2022 | 11-13 |A1| Decision Tree | Decision Trees | Guidotti | |17.| 24.11.2022 | 11-13 |A1| Naive Bayes Classifier | NBC | Guidotti | |18.| 28.11.2022 | 11-13 |A1| Lab. Classification | Classification Python | Spinnato/Guidotti | |19.| 01.12.2022 | 11-13 |A1| Intro Regression | Intro Regression | Guidotti | |20.| 05.12.2022 | 11-13 |A1| Pattern Mining | Pattern Mining | Pedreschi | |21.| 07.12.2022 | 14-16 |A1| Pattern Mining | Pattern Mining | Pedreschi | | | 08.12.2022 | 11-13 | | No Lecture | | | |22.| 12.12.2022 | 11-13 |A1| Exercises Apriori | Exercises Apriori, Solutions| Guidotti | |23.| 14.12.2022 | 14-16 |A1| Pattern Mining (FP-Growth) | Pattern Mining | Guidotti | |24.| 15.12.2022 | 11-13 |A1| Lab. Pattern Mining | Pattern Mining Python | Spinnato/Guidotti | ===== Second Semester (DM2 - Data Mining: Advanced Topics and Applications) ===== ^ ^ Day ^ Room ^ Topic ^ Learning Material ^ Lecturer ^ | 01.| 20.02.2023 09:00–11:00 | C1 | Course Overview, Imbalanced Learning | Intro, ImbLearn, LabImbLearn | Guidotti | | 02.| 21.02.2023 09:00–11:00 | C1 | Dimensionality Reduction | DimRed, LabDimRed | Guidotti | | 03.| 27.02.2023 09:00–11:00 | C1 | Outlier Detection: Taxonomy, Stat. & Depth-based | OutDet | Guidotti | | 04.| 28.02.2023 09:00–11:00 | C1 | Outlier Detection: Distance & Density-based | OutDet | Guidotti | | 05.| 06.03.2023 09:00–11:00 | C1 | Outlier Detection: Ensemble & Model-based | OutDet, LabOutDet | Guidotti | | 06.| 07.03.2023 09:00–11:00 | C1 | Gradient Descent, Maximum-Likelihood Estimation | GD, MLE | Guidotti | | 07.| 13.03.2023 09:00–11:00 | C1 | Odds, Odds Ratio, Logistic Regression | Odds, LogReg, LabLogReg | Guidotti | | 08.| 14.03.2023 09:00–11:00 | C1 | SVM | SVM, LabSVM | Guidotti | | 09.| 20.03.2023 09:00–11:00 | C1 | Neural Networks (Perceptron) | Perceptron | Guidotti | | 10.| 21.03.2023 09:00–11:00 | C1 | (Deep) Neural Networks | NeuralNetwork | Guidotti | | | 27.03.2023 09:00–11:00 | | No Lecture | | | | | 28.03.2023 09:00–11:00 | C1 | Office Hours (in class) | LabNN | Spinnato | | 11.| 03.04.2023 09:00–11:00 | C1 | Ensemble Models: Bagging & Random Forest | Ensemble| Guidotti | | 12.| 04.04.2023 09:00–11:00 | C1 | Ensemble Models: Boosting | Ensemble, LabEnsemble | Guidotti | | | 10.04.2023 09:00–11:00 | | No Lecture | | | | | 11.04.2023 09:00–11:00 | | No Lecture | | Guidotti | | 13.| 17.04.2023 09:00–11:00 | C1 | Ensemble Models: Gradient Boosting Machines | GBM | Guidotti | | 14.| 18.04.2023 09:00–11:00 | C1 | Ensemble Models: Gradient Boosting Machines | GBM, LabGMB | Guidotti | | | 24.04.2023 09:00–11:00 | | No Lecture | | | | | 25.04.2023 09:00–11:00 | | No Lecture | | | | | 01.05.2023 09:00–11:00 | | No Lecture | | | | 15.| 02.05.2023 09:00–11:00 | C1 | Time Series Similarity & Distance | TS_Sim | Guidotti | | 16.| 08.05.2023 09:00–11:00 | C1 | Time Series Clustering & Approximations | TS_Apprx_Clust, LabTSSimClus | Guidotti | | 17.| 09.05.2023 09:00–11:00 | C1 | Time Series Patterns | TS_MatrixProfile, LabTS_MP | Guidotti | | 18.| 15.05.2023 09:00–11:00 | C1 | Time Series Classification | TS_Classification, LabTS_Clf | Guidotti | | 19.| 16.05.2023 09:00–11:00 | C1 | Sequential Pattern Mining | SeqPatternMining | Guidotti | | 20.| 22.05.2023 09:00–11:00 | C1 | Sequential Pattern Mining | SeqPatternMining | Guidotti | | 21.| 23.05.2023 09:00–11:00 | C1 | Transactional Clustering | TransactionalClustering | Guidotti | | 22.| 24.05.2023 09:00–11:00 | C1 | Explainable AI | XAI | Guidotti | | 23.| 25.05.2023 09:00–11:00 | C1 | Explainable AI | XAI, LabXAI | Guidotti | | 24.| 26.05.2023 09:00–11:00 | C1 | Rule-based Models | RuleBasedClassifier | Guidotti | ====== Exams ====== How and Where: The exam will take place in oral mode only at the teacher's office or classroom previously designated. The exam will be held online on the 420AA Data Mining course channel only at the request of the student in accordance with current legislation. When: The dates relating to the start of the three exams are/will be published on the online platform https://esami.unipi.it/. Within each session, we will identify dates and slots in order to distribute the various orals. The dates and slots to take the exam will be published on the course page by the end of May. Each student must also register on https://esami.unipi.it/. The examination can only be carried out after the delivery of the project. The project must be delivered one week before when you want to take the exam. Group oral discussions will be preferred in respect of the project groups in order to parallelize any discussion on the project. It is not mandatory to take the oral exam together with the other members of the group. In the event that the oral exam is not passed, it will not be possible to take it for 20 days. If the project is not considered sufficient, it must be carried out again on a new dataset or a very updated version of the current one. What: The oral test will evaluate the practical understanding of the algorithms. The exam will evaluate three aspects. - Understanding of the theoretical aspects of the topics addressed during the course. The student may be required to write on formulas or pseudocode. During the explanations, the student can use pen and paper (if online, the student can use the Miro graphic system https://miro.com/ during the explanations) - Understanding of the algorithms illustrated during the course and their practical implementation. You will be asked to perform one or more simple exercises. The text will be shown on the teacher's screen and / or copied to Miro. The student will have to use pen and paper (if online by Miro https://miro.com/ to show how the exercise is solved. - Discussion of the project with questions from the teacher regarding unclear aspects, questionable steps or choices. Final Mark: for 12-credit exam, the final mark will be obtained as the average mark of DM1 and DM2. Exam Booking Periods * Exam portal link: here * 1st Appello: 11/12/2022 00:00 - 05/01/2023 23:59 * 2nd Appello: 01/01/2023 00:00 - 26/01/2023 23:59 * 3rd Appello: 01/05/2023 09:00 - 26/05/2023 23:59 * 4th Appello: 22/05/2023 09:00 - 16/06/2023 23:59 * 5th Appello: 1/06/2023 09:00 - 06/07/2023 23:59 * 6th Appello: 30/07/2023 09:00 - 24/08/2023 23:59 Exam Booking Agenda * Agenda DM1 Link: here * Agenda DM2 Link: here * 1st Appello: starts 10/01/2023 - Room: X1 * 2nd Appello: starts 31/01/2023 - Room: X1 * 3rd Appello: starts 31/05/2023 - Room: X1 * 4th Appello: starts 21/06/2023 - Room: X1 * 5th Appello: starts 11/07/2023 - Room: X1 * 6th Appello: starts 29/08/2023 - Room: X1 & X2 Do not forget to make the evaluation of the course!!! ===== Exam DM1 ====== The exam is composed of two parts: * An oral exam, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises. * A project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: data understanding, clustering analysis, pattern mining, and classification (guidelines will be provided for more details). The project has to be performed by min 2, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 20 pages of text including figures. The paper must be emailed to francesco [dot] spinnato [at] sns [dot] it and riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM1 2022-2023] Project” in the subject. * Dataset - Assigned: 15/09/2021 - MidTerm Submission: 28/11/2022 (extended) (half project required, i.e., Data Understanding & Preparation and Clustering) - Final Submission: 31/12/2022 or one week before the oral exam (complete project required). - Dataset: RAVDESS - Link original pages: zenodo, kaggle1, kaggle2 DM1 Project Guidelines See Project Guidelines. ===== Exam DM2 ====== The exam is composed of two parts: * An oral exam, that includes: (1) discussing the project report; (2) discussing topics presented during the classes, including the theory and practical exercises. * A project, that consists in exercises requiring the use of data mining tools for analysis of data. Exercises include: imbalanced learning, dimensionality reduction, outlier detection, advanced classification/regression methods, time series analysis/clustering/classification (guidelines will be provided for more details). The project has to be performed by min 1, max 3 people. It has to be performed by using Python or any other data mining software. The results of the different tasks must be reported in a unique paper. The total length of this paper must be max 30 pages of text including figures. The paper must be emailed to francesco [dot] spinnato [at] sns [dot] it and riccardo [dot] guidotti [at] unipi [dot] it. Please, use “[DM2 2022-2023] Project” in the subject. * Dataset - Assigned: 20/02/2023 - MidTerm Submission: 30/04/2023 (Modules 1 and 2) - Final Submission: one week before the oral exam (complete project required). - Dataset: RAVDESS2 - Link original pages: zenodo, kaggle1, kaggle2 DM2 Project Guidelines See Project Guidelines. ====== Exam Dates ====== ===== Exam Sessions ===== ^ Session ^ Date ^ Room ^ Notes ^ Marks ^ |1.|10.01.2023| | Please, use the system for registration: https://esami.unipi.it/ | | |2.|31.01.2023| | Please, use the system for registration: https://esami.unipi.it/ | | |3.|31.05.2023| | Please, use the system for registration: https://esami.unipi.it/ | | |4.|15.06.2023| | Please, use the system for registration: https://esami.unipi.it/ | | |5.|12.07.2023| | Please, use the system for registration: https://esami.unipi.it/ | | |6.|??.??.2023| | Please, use the system for registration: https://esami.unipi.it/ | | ===== Past Exams ===== * Past exams texts can be found in old pages of the course. Please do not consider these exercises as a unique way of testing your knowledge. Exercises can be changed and updated every year and will be published together with the slides of the lectures. ===== Reading About the “Data Scientist” Job ===== … a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them. Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010. * Data, data everywhere. The Economist, Feb. 2010 download * Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link * Welcome to the yotta world. The Economist, Sept. 2011 download * Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link * Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link * Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download * Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video * Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download ====== Previous years ===== * Data Mining A.A. 2021/22 * Data Mining A.A. 2020/21 * Data Mining A.A. 2019/20 * Data Mining A.A. 2018/19 * Data Mining A.A. 2017/18 * Data Mining A.A. 2016/17 * Data Mining A.A. 2015/16 * Data Mining A.A. 2014/15 * Data Mining A.A. 2013/14 * Data Mining A.A. 2012/13 * Data Mining A.A. 2011/12
