Strumenti Utente

Strumenti Sito


magistraleinformatica:dmi:start

Data Mining (309AA) - 9 CFU A.Y. 2022/2023

Instructor:

Teaching Assistant:

News

  • [28.10.2022] The lectures on 16 and 17 November will be suppressed.
  • [09.09.2022] he lectures will be only in presence and will NOT be live-streamed, but recordings of the lecture or of the previous years will be made available here for non-attending students.

Learning Goals

  • Fundamental concepts of data knowledge and discovery.
  • Data understanding
  • Data preparation
  • Clustering
  • Classification
  • Pattern Mining and Association Rules
  • Outlier Detection
  • Time Series Analysis
  • Sequential Pattern Mining
  • Ethical Issues

Hours and Rooms

Classes

Day of Week Hour Room
Wednesday 09:00 - 11:00 Room E
Thursday 11:00 - 13:00 Room C
Friday 09:00 - 11:00 Room C

Office hours - Ricevimento: Anna Monreale: Tuesday: 11:00-13:00 by online using Teams or at the Department of Computer Science, room 374/E (Please ask an appointment by email). Francesca Naretto: TDB

A Teams Channel will be used ONLY to post news, Q&A, and other stuff related to the course. The lectures will be only in presence and will NOT be live-streamed, but recordings of the lecture or of the previous years will be made available here for non-attending students.

Learning Material -- Materiale didattico

Textbook -- Libro di Testo

Slides

Software

  • Python - Anaconda (at least 3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
  • Scikit-learn: python library with tools for data mining and data analysis Documentation page
  • Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page

Class Calendar (2022/2023)

First Semester

Day Topic Learning material References Video Lectures
1. 15.09 11:00‑13:00 Overview. Introduction to KDD 1-overview.pdf 1-intro-dm.pdfChap. 1 Kumar Book Video 1: Course Overview;Video 2: Introduction DM (the recording of the Introduction had some audio issue so I published the part of the lecture of the a.y. 2021/22)
2. 16.09 09:00-11:00 Data Understanding 2-data_understanding.pdf Chap.2 Kumar Book and additioanl resource of Kumar Book:Exploring Data If you have the first ed. of KUMAR this is the Chap 3 Video 1: Data Understanding - Part 1; Video 2: Data Understanding - Part 2
3. 21.09 09:00-11:00 Data Understanding & Data Preparation 3-data_preparation.pdf Chap.2 Kumar Book and additioanl resource of Kumar Book:Exploring Data If you have the first ed. of KUMAR this is the Chap 3 Video: Data Understanding & Data Preparation
4. 22.09 11:00-13:00 Data Preparation + Data Similarities.4-data_similarity.pdf Data Similarity is in Chap. 2 Video 1: Data Preparation + Data Similarities - Part 1; Video 2: Data Preparation + Data Similarities - Part 2
5. 23.09 09:00-11:00 Introduction to Clustering. Center-based clustering: kmeans 5-basic_cluster_analysis-intro.pdf 6.1-basic_cluster_analysis-kmeans.pdf Clustering is in Chap. 7 Video 1: Introduction to Clustering + K-means - Part 1;Video 2: Introduction to Clustering + K-means - Part 2]
6. 28.09 09:00-11:00 Python Lab: Data Understanding & Data Preparation Notebook DU tips Video 1: Python Lab: DU - Part1;Video 2: Python Lab: DU - Part2
7. 29.09 11:00-13:00 Hierarchical clustering 7.basic_cluster_analysis-hierarchical.pdf Video: Project Description + Hierarchical Clustering
30.09 09:00-11:00 Lecture Canceled
8. 05.10 09:00-11:00 Density based clustering. Clustering validity. 8.basic_cluster_analysis-dbscan-validity.pdf Chap. 7 Kumar Book
9. 06.10 11:00-13:00 Center-based clustering: Bisecting K-means, Xmeans, EM 6.2-basic_cluster_analysis-kmeans-variants.pdf Chap. 7 Kumar Book, clusteringmixturemodels.pdf xmeans.pdf Video 1: Center-based clustering - Bisecting K-means, Xmeans, EM ; Video 2: Clustering Lab.
10. 07.10 09:00-11:00 Python Lab - Clustering Notebook CLustering Tips Video: Clustering Lab. - Part2
11. 12.10 09:00-11:00 Classification Problem. Decision Trees 9.chap3_basic_classification-2022.pdf Chap. 3 Kumar Book Video Lecture - Part1; Video Lecture - Part 2
12. 13.10 11:00-13:00 Decision Trees & Classifier Evaluation same slides previous lecture Chap. 3 Kumar Book Video Lecture - Part 1 Video lecture - Part 2
13. 14.10 09:00-11:00 Classifier Evaluation same slides previous lecture Chap. 3 Kumar Book Video Lecture
14. 19.10 09:00-11:00 Rule based Classifiers 10-rule-based-clussifiers-2022.pdf10-knn-2022.pdf Rule based classifiers: Chap. 5.1, KNN: Chap. 4.2 - Kumar Book Video 1: Rule based classifiersVideo 2: KNN
15. 20.10 11:00-13:00 DT - simulation of the learning algorithm DT Exercise Video 1: DT-EX; Video 2: DT-EX
16. 21.10 09:00-11:00 Naive Bayesian Classifier. SVM. Ensemble Classifiers 11_2022-naive_bayes.pdf 14_svm_2022.pdf 13_ensemble_2022.pdf Chap. 4 - Kumar Book Video1; Video2
17. 26.10 09:00-11:00 Ensemble Classifiers + NN Classifiers + Project Discussion same slides of the previous lecture Chap. 4 - Kumar Book Video1
18. 27.10 11:00-13:00 NN Classifiers + Python Lab: Classification 15_neural_networks_2021.pdf Classificaton Notebook Adult Dataset Video1; Video2
19. 28.10 09:00-11:00 Python Lab: Classification Classificaton Notebook (same as previous lecture) Video
20. 02.11 09:00-11:00 Python Lab: NN & Imbalanced Classification Unfortunately Video is not available for technical issues
21. 03.11 11:00-13:00 Association Rule Mining 17_association_analysis2021.pdf Chap. 5 - Kumar Book Video
22. 04.11 09:00-11:00 FP-Growth - Sequential Pattern Mining 17_2021-fp-growth.pdf 18_sequential_patterns_2021.pdfChap. 5 & Chap. 6 - Kumar Book Video1;Video2
23. 09.11 09:00-11:00 Sequential Pattern Mining. Intro to Time SeriesSlides on SPM (see previous lecture) Video1;Video2
24. 10.11 11:00-13:00 Time Series Similarities 22_time_series_similarity_2022.pdf Overview on Time Series Video
25. 11.11 09:00-11:00 Time Series Transformations - Clustering - Classification Slides on transformations (previous lecture) 23_time_series_motif-2022_2.pdf Video
26. 18.11 09:00-11:00 Shapelets & Motif. Lab: Association Rules Slides on shapelets & motif (previous lecture) arm-spm.zip matrixprofile.pdf Papers on Matrix Profileshaplet.pdfVideo 1: Shapelets & Motif; Video 2: Lab ARM
27. 23.11 09:00-11:00 Python: Sequential Pattern Mining & Time Series For SPM see notebooks of previous lecture. timeseries-py.zip Video
28. 24.11 11:00-13:00 Python: Time Series. Ethics & Privacy 19_ethics_privacy2021.pdf Video 1; Video 2
29. 25.11 09:00-11:00 Privacy same slides off the last lecture Video
30. 30.11 09:00-11:00 Explainability 20_explainability_2021.pdf Video
31. 01.12 11:00-13:00 Anomaly Detection + Python: XAI Video
32. 02.12 09:00-11:00 Python: XAI + AD Video
33. 07.12 09:00-11:00 Paper Presentation
34. 09.12 09:00-11:00 Paper Presentation
35. 14.12 09:00-11:00 Paper Presentation
36. 15.12 11:00-13:00 Paper Presentation

Exams

Project

A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks.

  1. Dataset:Twitter Data
  2. Deadline: the fist part has to be delivered within November 5th 2022 12, 2022. Send an email to: anna.monreale@unipi.it, francesca.naretto@sns.it, lorenzo.mannocci@phd.unipi.it
  1. Note that the document contains also rules for the delivery and final exam!
  2. Deadline: Jan 8, 2023

Students who did not deliver the above project within Jan 8, 2023 need to ask by email a new project to the teachers. The project that will be assigned will require about 2 weeks of work and after the delivery it will be discussed during the oral exam.

Paper Presentation (OPTIONAL)

Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions. They only need to present the project (see next point). The paper presentation can be done by the group or by a single person.

Oral Exam

  • Project presentation (with slides) – 10-15 minutes: mandatory for all the students
  • Open questions on the entire program: optional only for students opting for paper presentation.

Previous years

magistraleinformatica/dmi/start.txt · Ultima modifica: 01/12/2022 alle 07:47 (19 ore fa) da Anna Monreale