====== Data Mining (309AA) - 9 CFU A.Y. 2023/2024 ====== **Instructor:** * **Anna Monreale** * KDDLab, Università di Pisa * [[anna.monreale@unipi.it]] **Teaching Assistant:** * * **Lorenzo Mannocci** * University of Pisa * [[lorenzo.mannocci@phd.unipi.it]] ====== News ====== * [05.09.2023] ** The lectures will start on 27th September 2023** ====== Learning Goals ====== * Fundamental concepts of data knowledge and discovery. * Data understanding * Data preparation * Clustering * Classification * Pattern Mining and Association Rules * Outlier Detection * Time Series Analysis * Sequential Pattern Mining * Ethical Issues ====== Hours and Rooms ====== **Classes** ^ Day of Week ^ Hour ^ Room ^ | Wednesday | 09:00 - 11:00 | Room C1 | | Thursday | 09:00 - 11:00 | Room C1 | | Friday | 09:00 - 11:00 | Room C | **Office hours - Ricevimento:** Anna Monreale: Tuesday: 11:00-13:00 by online using Teams or at the Department of Computer Science, room 374/E (Please ask an appointment by email). Lorenzo Mannocci: TDB A [[https://teams.microsoft.com/l/team/19%3ajujTZ5yI6IyKkRl1YEGY0Iisg7RhlW1YTam_NO3-OOE1%40thread.tacv2/conversations?groupId=2ce9fd1a-3f23-47b0-92cd-8652f8be9ed6&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Channel]] will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will **NOT** be live-streamed, but recordings of the lecture or of the previous years will be made available here for non-attending students. ====== Learning Material -- Materiale didattico ====== ===== Textbook -- Libro di Testo ===== * Pang-Ning Tan, Michael Steinbach, Vipin Kumar. **Introduction to Data Mining**. Addison Wesley, ISBN 0-321-32136-7, 2006 * [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php]] * Chapters 4,6 and 8 are also available at the publisher's Web site. * Laura Igual et al.** Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications**. 1st ed. 2017 Edition. * Jake VanderPlas. **[[http://shop.oreilly.com/product/0636920034919.do| Python Data Science Handbook: Essential Tools for Working with Data.]]** 1st Edition. * For Python Notions: {{ :magistraleinformatica:dmi:python_basics.ipynb.zip | Very basic notions on Python}} ===== Slides ===== * The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors [[http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4|Slides per "Introduction to Data Mining"]]. ===== Software===== * Python - Anaconda (at least 3.7 version!!!): Anaconda is the leading open data science platform powered by Python. [[https://www.anaconda.com/distribution/| Download page]] (the following libraries are already included) * Scikit-learn: python library with tools for data mining and data analysis [[http://scikit-learn.org/stable/ | Documentation page]] * Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. [[http://pandas.pydata.org/ | Documentation page]] ====== Class Calendar (2023/2024) ====== ===== First Semester ===== ^ ^ Day ^ Topic ^ Learning material ^ References ^ Video Lectures ^ |1. | 27.09 | Overview. Introduction to KDD | {{ :magistraleinformatica:dmi:1-overview-2023.pdf |}} {{ :magistraleinformatica:dmi:1-intro-dm.pdf |}}|Chap. 1 Kumar Book |[[https://unipiit.sharepoint.com/:v:/s/a__td_59044/EYjxO1YANqtMnr8upJa3X4oBp3wEsdjef8iSXN7LL7jcxQ?e=Jd80j9|Introduction DM - Video1]] [[ https://unipiit.sharepoint.com/:v:/s/a__td_59044/Eesf2mgGU1hMjMH4qH_xJewBKtee3TWrullu269byR2bnA?e=JJ4AUx|Introduction DM - Video2]]| |2. | 28.09 | Data Understanding | {{ :magistraleinformatica:dmi:2-data_understanding.pdf |}}|Chap.2 Kumar Book and additioanl resource of Kumar Book:[[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Exploring Data]] If you have the first ed. of KUMAR this is the Chap 3 | |3. | 29.09 | Data Understanding & Data Preparation | {{ :magistraleinformatica:dmi:3-data_preparation.pdf |}} |Chap.2 Kumar Book and additioanl resource of Kumar Book:[[https://www-users.cs.umn.edu/~kumar001/dmbook/data_exploration_1st_edition.pdf|Exploring Data]] If you have the first ed. of KUMAR this is the Chap 3 | | |4. | 04.10 | Data Preparation & Data Similarities | {{ :magistraleinformatica:dmi:4-data_similarity.pdf |}} | Data Similarity is in Chap. 2 | [[https://unipiit.sharepoint.com/:v:/s/a__td_59044/EWaYURxnzPdIiLiqjkS4LM8B8sme_xmm0LwtK9EptuP0Jg?e=dsZojO|DP+Similarities]] The last minutes of the lecture were not recorded because of the connection| |5. | 05.10 | Python-LAB: Data Understanding | {{ :magistraleinformatica:dmi:dataunderstanding.zip | DU notebooks and data}} | | [[https://unipiit.sharepoint.com/:v:/s/a__td_59044/EYWSZBIG7X1MoFOev5Th_cIBprLLN-AwSBMamgGzNju0Sw?e=jzdPx8|Python Lab on DU]]| | | 06.10 | Suppressed | | | | |6. | 11.10 | Introduction to Clustering. Centroid-based Clustering: K-means algorithm. | {{ :magistraleinformatica:dmi:5-basic_cluster_analysis-intro.pdf |}} {{ :magistraleinformatica:dmi:6.1-basic_cluster_analysis-kmeans.pdf |}} | Chap. 7 Kumar Book | [[https://unipiit.sharepoint.com/:v:/s/a__td_54794/EV-fDd75MIxGmazA79kFHCYBI78yYwqy7AFE5h9MN2rRqg?e=YVgdjS|Video 1: Introduction to Clustering + K-means - Part 1]] - Video of previous years| |7. | 12.10 | Centroid-based Clustering: K-means variants. | {{ :magistraleinformatica:dmi:6.2-basic_cluster_analysis-kmeans-variants.pdf |}} | Chap. 7 Kumar Book {{ :magistraleinformatica:dmi:clusteringmixturemodels.pdf |}} {{ :magistraleinformatica:dmi:xmeans.pdf |}}| [[https://unipiit.sharepoint.com/:v:/s/a__td_54794/ETySd1UWIzxCoAKilzaXO_MBW8oXZZCjf5FEhyywGIdJBg?e=Xq2jdo|Video 2: Introduction to Clustering + K-means - Part 2]]] [[https://unipiit.sharepoint.com/:v:/s/a__td_54794/EQTbbvqF2kJOgEsFQ1WF48cBjWf2wgTCbOjxcQzn9MyVzw?e=KQ7gEZ|Video 1: Center-based clustering - Bisecting K-means, Xmeans, EM ]];Videos of previous years| | | 13.10 | Suspension of teaching | | | Recording in Teams Channel | |8.| 18.10 | Hierarchical and density based CLustering | {{ :magistraleinformatica:dmi:7.basic_cluster_analysis-hierarchical.pdf |}} {{ :magistraleinformatica:dmi:8.basic_cluster_analysis-dbscan-validity.pdf |}} | Chap. 7 Kumar Book | Recording in Teams Channel | |9.| 19.10 | Clustering Validity & Python Lab: Clusterig K-means | {{ :magistraleinformatica:dmi:8.basic_cluster_analysis-dbscan-validity.pdf |}} | Chap. 7 Kumar Book| Recording in Teams Channel | |10.| 20.10 | Python Lab: Clusterig Density based and hierarchical + Introduction to Classification |{{ :magistraleinformatica:dmi:clustering.zip | Notebook on Clustering}} {{ :magistraleinformatica:dmi:9.chap3_basic_classification-2023.pdf |}} | Chap.3 Kumar Book |Recording in Teams Channel | |11.| 25.10 | Decision Trees & Classifier Evaluation | Same slides as previous lecture | Chap.3 Kumar Book | Recording in Teams Channel | |12.| 26.10 | Classifier Evaluation | Same slides as previous lecture | Chap.3 Kumar Book | | |13.| 27.10 | Rule-based Classifiers |{{ :magistraleinformatica:dmi:10-rule-based-classifiers.pdf |}} | Chap.4 Kumar Book | Recording in Teams Channel | |14.| 02.11 | Rule-based Classifiers + Instance based Classifiers| {{ :magistraleinformatica:dmi:10-knn.pdf |}}| Chap.4 Kumar Book | Recording in Teams Channel | |15.| 03.11 |Naive Bayesian Classifier. SVM. Ensemble Classifiers| {{ :magistraleinformatica:dmi:11_2023-naive_bayes.pdf |}} {{ :magistraleinformatica:dmi:14_svm_2023.pdf |}} {{ :magistraleinformatica:dmi:13_ensemble_2023.pdf |}}| Chap.4 Kumar Book | Recording in Teams Channel | |16.| 08.11 | Python Lab: Classification| {{ :magistraleinformatica:dmi:classification.zip |}} | | Recording in Teams Channel | |17.| 09.11 | NN Classifiers| {{ :magistraleinformatica:dmi:15_neural_networks_2023.pdf |}} | Chap.4 Kumar Book | Recording in Teams Channel | |18.| 10.11 | Python Lab: NN & Imbalanced Classification | {{ :magistraleinformatica:dmi:imbalanced_classification.zip |}} | | Recording in Teams Channel | |19.| 15.11 | Association Rule Mining: Apriori | {{ :magistraleinformatica:dmi:17_association_analysis.pdf |}} | Chap.5 Kumar Book | Recording in Teams Channel | |20.| 16.11 | Association Rule Mining: Evalaution and FP-Growth | {{ :magistraleinformatica:dmi:17_2023-fp-growth.pdf |}} | Chap.5 Kumar Book | Recording in Teams Channel | |21.| 17.11 | Sequential Pattern Mining | {{ :magistraleinformatica:dmi:18_sequential_patterns_2023.pdf |}} | Chap.6 Kumar Book | Recording in Teams Channel | |22.| 22.11 | Sequential Pattern Mining: timing constraint. Time Series Analysis: Similarities, Distances and Transformations| {{ :magistraleinformatica:dmi:22_time_series_similarity_2023.pdf |}} | [[https://cs.gmu.edu/~jessica/BookChapterTSMining.pdf |Overview on Time Series]] | Recording in Teams Channel | |23.| 23.11 | Time Series Analysis: Shapelet & Motif| {{ :magistraleinformatica:dmi:23_time_series_motif-shapelets2023.pdf |}} | {{ :magistraleinformatica:dmi:shaplet.pdf |}} | Recording in Teams Channel | |24.| 24.11 | Time Series Analysis: Shapelet & Motif; introduction to ethics and privacy| same slides of the previous lecture and {{ :magistraleinformatica:dmi:19_ethics_privacy_2023_intro.pdf |}} | {{ :magistraleinformatica:dmi:matrixprofile.pdf |}} [[https://www.cs.ucr.edu/~eamonn/MatrixProfile.html|Papers and resourse on motif]] | Recording in Teams Channel | |25.| 29.11 | Python Lab: ARM, SPM, Time series transformations | {{ :magistraleinformatica:dmi:ar_spm.zip |}} {{ :magistraleinformatica:dmi:timeseries.zip |}} | | Recording in Teams Channel | |26.| 30.11 | Python Lab: Time series analysis | notebooks in the zip file of the previous lecture| | Recording in Teams Channel | |27.| 01.12 | Privacy in AI and Big Data Analytics | {{ :magistraleinformatica:dmi:19_ethics_privacy2023.pdf |}} This set of slides include alse the introduction of the lecture 24.11.2023 |{{ :magistraleinformatica:dmi:chap-anonymity.pdf |}} {{ :magistraleinformatica:dmi:chap-anonymity.pdf |}} {{ :magistraleinformatica:dmi:prudence.pdf |}} {{ :magistraleinformatica:dmi:chapter-ppdm.pdf |}}| Recording in Teams Channel | |28.| 06.12 | Explainable AI | {{ :magistraleinformatica:dmi:20_explainability_2023.pdf |}}|{{ :magistraleinformatica:dmi:lore-tabular.pdf |}} {{ :magistraleinformatica:dmi:xai-survey.pdf |}} {{ :magistraleinformatica:dmi:imagexai.pdf |}} {{ :magistraleinformatica:dmi:timeseriesxai.pdf |}}| Recording in Teams Channel | |29.| 07.12 | Explainable AI | {{ :magistraleinformatica:dmi:21_anomaly_detection_2023.pdf |}} {{ :magistraleinformatica:dmi:anomaly_detection.zip |}}| | Recording in Teams Channel | |30.| 13.12 | Anomaly Detection | {{ :magistraleinformatica:dmi:21_anomaly_detection_2023.pdf |}} | | Recording in Teams Channel | |31-32.| 14.12 9-11| Lab Python in AD + Lab Python in XAI| {{ :magistraleinformatica:dmi:anomaly_detection.zip |}}| | Recording in Teams Channel | |33.| 15.12 9-11| Lab Python in XAI + Paper Presentation| | | | |34.| 18.12 09-11| Paper Presentation| | | | |35.| 20.12 09-11| Paper Presentation| | | | |36.| 21.12 09-11| Paper Presentation| | | | ====== Exams ====== **Project ** A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks. * First part of the project consists in the **assignments** described here: {{ :magistraleinformatica:dmi:project_description_dm23-pub.pdf | Project Description}} - **Dataset: {{ :magistraleinformatica:dmi:gun-data.zip | Dataset Files}}** - **Deadline**: the fist part has to be delivered within November 19th, 2023 November 26th, 2023. Send an email to: anna.monreale@unipi.it, lorenzo.mannocci@phd.unipi.it * Second part of the project consists in the assignment described here: {{ :magistraleinformatica:dmi:project_description_dm23-pub-updated.pdf |Updated Project Description}} - **Deadline**: Jan 8, 2024 * Third part of the project consists in the assignment described here: {{ :magistraleinformatica:dmi:project_description_dm23-pub-complete.pdf |Updated Project Description}} - **Deadline**: Jan 8, 2024 **Students who did not deliver the above project within **Jan 8, 2024** need to ask by email a new project to the teachers. The project that will be assigned will require about 20 days of work and after the delivery it will be discussed during the oral exam. ** ** Paper Presentation (OPTIONAL)** Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions on the entire program. They only need to present the project (see next point) and answer open question only on the topics which will not be covered by the project. The paper presentation can be done by the group or by a single person. **Oral Exam** * **Project presentation** (with slides) – 10-15 minutes: mandatory for all the students with question fo understanding the details of any part of the project. * ** Open questions on the entire program **: for students who will not opt for paper presentation * ** Open questions on the topics which will not be covered by the project ** only for students opting for paper presentation. * Group presentations of the project are preferred. If this is impossible please contact me for finding a solution. **How to book for the exam colloquium? ** In https://esami.unipi.it/ you can find the dates for the exam: one for January and one for February. Each student must do the registration on one of the 2 dates. These are not the dates of the colloquium or project delivery but we will use the list of registered students for organizing the exam dates. After that deadline we will share with you a calendar for the oral exam. ====== Previous years ===== [[DM-INF 2022-2023]] [[DM-INF 2021-2022]] [[DM-INF 2020-2021]] [[http://didawiki.cli.di.unipi.it/doku.php/dm/dm.2019-20|DM-2019/20]]