Indice
Data Mining (309AA) - 9 CFU A.Y. 2021/2022
Instructor:
- Anna Monreale
- KDDLab, Università di Pisa
Teaching Assistant:
- Francesca Naretto
- KDDLab, SNS, Pisa
News
- [28.10.2021] Lecture of Friday 29.10.2021 will be canceled
- [23.09.2021] Please, fill this document: Student-Lists anf Project groups. On Teams you can find instructions for GroupID
- [06.09.2021] The first lecture of this course will take place on Thursday, 16 Sept 2021.
- [08.09.2021]People that intend to attend the course online should use this link: https://teams.microsoft.com/l/team/19%3aWKvq4kg0XbKZ5pEeiZcarbBXPCYsTvTwMkKZs2PWiHA1%40thread.tacv2/conversations?groupId=aea1385b-6721-4d90-a169-c97f7d066eca&tenantId=c7456b31-a220-47f5-be52-473828670aa1
Learning Goals
- Fundamental concepts of data knowledge and discovery.
- Data understanding
- Data preparation
- Clustering
- Classification
- Pattern Mining and Association Rules
- Outlier Detection
- Time Series Analysis
- Sequential Pattern Mining
- Ethical Issues
Hours and Rooms
Classes
Day of Week | Hour | Room |
---|---|---|
Wednesday | 14:00 - 16:00 | Room C - Online |
Thursday | 14:00 - 16:00 | Room C - Online |
Friday | 09:00 - 11:00 | Room A1 - Online |
Office hours - Ricevimento: Anna Monreale: Wednesday: 11:00-13:00 online using Teams (Appointment by email) Francesca Naretto: Monday: 15:00-18:00 online using Teams (Appointment by email)
Learning Material -- Materiale didattico
Textbook -- Libro di Testo
- Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- Chapters 4,6 and 8 are also available at the publisher's Web site.
- Berthold, M.R., Borgelt, C., Höppner, F., Klawonn, F. GUIDE TO INTELLIGENT DATA ANALYSIS. Springer Verlag, 1st Edition., 2010. ISBN 978-1-84882-259-7
- Laura Igual et al. Introduction to Data Science: A Python Approach to Concepts, Techniques and Applications. 1st ed. 2017 Edition.
- Jake VanderPlas. Python Data Science Handbook: Essential Tools for Working with Data. 1st Edition.
- For Python Notions: Very basic notions on Python
Slides
- The slides used in the course will be inserted in the calendar after each class. Most of them are part of the slides provided by the textbook's authors Slides per "Introduction to Data Mining".
Software
- Python - Anaconda (3.7 version!!!): Anaconda is the leading open data science platform powered by Python. Download page (the following libraries are already included)
- Scikit-learn: python library with tools for data mining and data analysis Documentation page
- Pandas: pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Documentation page
Class Calendar (2021/2022)
First Semester
Day | Topic | Learning material | References | Video Lectures | |
---|---|---|---|---|---|
15.09 14:15‑16:00 | Lecture deleted | ||||
1. | 16.09 14:15‑16:00 | Overview. Introduction to KDD | 2021-1-overview.pdf1-intro-dm.pdf | Chap. 1 Kumar Book | Video 1 Video 2 |
2. | 17.09 09:00-10:45 | Data Understanding | Slides DU | Chap.2 Kumar Book and additioanl resource of Kumar Book:Exploring Data If you have the first ed. of KUMAR this is the Chap 3 | Video 1 Video 2 |
3. | 22.09 14:15-16:00 | Data Understanding + Data Preparation | 3-data_preparation.pdf | Chap. 2 Kumar Book | Video |
4. | 23.09 14:15-16:00 | Data Preparation + Data Similarities. | 4-data_similarity.pdf | Data Similarity is in Chap. 2 | |
5. | 24.09 09:00-10:45 | Introduction to Clustering. Center-based clustering: kmeans | 5-basic_cluster_analysis-intro.pdf 6.1-basic_cluster_analysis-kmeans.pdf | Clustering is in Chap. 7 | |
6. | 29.09 14:15-16:00 | Hierarchical clustering | 7.basic_cluster_analysis-hierarchical.pdf | Chap. 7 Kumar Book | |
7. | 30.09 14:15-16:00 | Density based clustering. Clustering validity. Lab. DU | 8.basic_cluster_analysis-dbscan-validity.pdf Notebook DU tips Another Notebook on DU | Chap. 7 Kumar Book | |
8. | 01.10 09:00-10:45 | Python Lab - Clustering | Notebook CLustering Tips | ||
9. | 06.10 14:15-16:00 | Center-based clustering: Bisecting K-means, Xmeans, EM | 6.2-basic_cluster_analysis-kmeans-variants.pdf | Chap. 7 Kumar Book, clusteringmixturemodels.pdf xmeans.pdf | |
10. | 07.10 14:15-16:00 | Classification Problem. Decision Trees | 9.chap3_basic_classification-2020.pdf | Chap. 3 Kumar Book | |
08.10 09:00-10:45 | Lecture canceled | ||||
11. | 13.10 14:15-16:00 | Decision Trees + Classifier Evaluation | same slides of the previous lecture | Chap. 3 Kumar Book | |
12. | 14.10 14:15-16:00 | Evaluation Methods for Classification Models | same slides of the previous lecture | Chap. 3 Kumar Book | |
13. | 15.10 09:00-10:45 | Statistical tool for model evaluation + Rule based classification | 10-rule-based-clussifiers.pdf | Chap. 3 Kumar Book + Chap. 4 Kumar Book | |
14. | 20.10 14:15-16:00 | Rule based classification + Instance-based Classification | 10-knn.pdf | Chap. 4 Kumar Book | |
15. | 21.10 14:15-16:00 | Exercise on DT learning + Naive Bayesian Classifier | 11_2021-naive_bayes.pdf 2021-dt-ex.pdf | Chap. 4 Kumar Book | |
16. | 22.10 09:00-10:45 | SVM & Ensemble Classifiers | 14_svm_2020.pdf 13_ensemble_2020.pdf | Chap. 4 Kumar Book | |
17. | 27.10 14:15-16:00 | Neural Networks | 15_neural_networks_2021.pdf | Chap. 4 Kumar Book | |
18. | 28.10 14:15-16:00 | Python Lab on Classification | adult_classification_2021.ipynb.zip | ||
29.11 09:00-10:45 | Canceled | ||||
19. | 03.11 14:15-16:00 | Python Lab on Classification + Association Rule Mining | classificationpython2.zip 17_association_analysis2021.pdf | Chap.5 Association Rules: Kumar Book | |
20. | 04.11 14:15-16:00 | Association Rule Mining | Chap.5 Association Rules: Kumar Book | ||
21. | 05.11 09:00-10:45 | FP-Growth - Sequential Pattern Mining | 17_2021-fp-growth.pdf | Chap.6 Kumar Book | |
22. | 10.11 14:15-16:00 | Sequential Pattern Mining | 18_sequential_patterns_2021.pdf | Chap.7 Kumar Book | |
23. | 11.11 14:15-16:00 | Time Series Similarities, Transformations & Clustering | 22_time_series_similarity_2021.pdf | Overview on DM for time series | |
24. | 12.11 09:00-10:45 | Motif & Shapelet Discovery | 23_time_series_shapelets-motif-2021.pdf | matrixprofile.pdf shaplet.pdf | |
25. | 17.11 14:15-16:00 | Lab: Association Rules & Sequential pattern mining by Python | arm-spm.zip | ||
26. | 18.11 14:15-16:00 | Ethics & Privacy | 19_ethics_privacy2021.pdf > | Overview on Privacy allegato11-cpdp13.pdf Privacy by design | |
27. | 19.11 09:00-10:45 | Lab: Time series | timeseries-py.zip | ||
28. | 24.11 14:15-16:00 | Explainability | 20_explainability_2021.pdf | Material: LORE LIME Survey ABELE SHAP LASTS | |
29. | 25.11 14:15-16:00 | Explainability + LAB XAI | xai-lab.zip | ||
30. | 26.11 09:00-10:45 | LAB XAI + Anomaly Detection | AD&OD | ||
31. | 01.12 14:15-16:00 | Anomaly Detection + Lab | ADPY | ||
32. | 02.12 14:15-16:00 | CRISP-DM | crisp-dm.pdf | ||
. | 03.12 09:00-10:45 | Canceled | |||
33. | 15.12 14:15-16:00 Room C | Paper Presentation | |||
34. | 16.12 14:15-16:00 Room C | Paper Presentation | |||
35. | 17.12 09:00-12:45 Room C | Paper Presentation |
Exams
Mid-term Project
A project consists in data analyses based on the use of data mining tools. The project has to be performed by a team of 2/3 students. It has to be performed by using Python. The guidelines require to address specific tasks. Results must be reported in a unique paper. The total length of this paper must be max 25 pages of text including figures. The students must deliver both: paper (single column) and well commented Python Notebooks.
- First part of the project consists in the assignments described here: Project Description
- Dataset: Dataset
- Deadline: the fist part has to be delivered within November,
5th 202115th 2021.Send an email to: anna.monreale@unipi.it and francesca.naretto@sns.it
- Second part of the project consists in the assignment Task 3 described here: Updated Project Description
- Deadline: 5th January 2022
- Third part of the project consists in the assignment Task 4 described here: Updated Project Description
- Note that the document contains also rules for the delivery and final exam!
- Data for time series analysis: CityTemp
- Deadline: 5th January 2022
Students who did not deliver the above project within 5th Jan 2022 need to ask by email a new project to the teachers. The project that will be assigned will require about 2 weeks of work and after the delivery it will be discussed during the oral exam.
Paper Presentation (OPTIONAL)
Students need to present a research paper (made available by the teacher) during the last week of the course. This presentation is OPTIONAL: Students that decide to do the paper presentation can avoid the oral exam with open questions. They only need to present the project (see next point). The paper presentation can be done by the group or by a single person.
Oral Exam
- Project presentation (with slides) – 10 minutes: mandatory for all the students
- Open questions on the entire program: optional only for students opting for paper presentation.
Reading About the "Data Scientist" Job
… a new kind of professional has emerged, the data scientist, who combines the skills of software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under mountains of data. Hal Varian, Google’s chief economist, predicts that the job of statistician will become the “sexiest” around. Data, he explains, are widely available; what is scarce is the ability to extract wisdom from them.
Data, data everywhere. The Economist, Special Report on Big Data, Feb. 2010.
- Data, data everywhere. The Economist, Feb. 2010 download
- Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
- Welcome to the yotta world. The Economist, Sept. 2011 download
- Data Scientist: The Sexiest Job of the 21st Century. Harvard Business Review, Sept 2012 link
- Il futuro è già scritto in Big Data. Il SOle 24 Ore, Sept 2012 link
- Special issue of Crossroads - The ACM Magazine for Students - on Big Data Analytics download
- Peter Sondergaard, Gartner, Says Big Data Creates Big Jobs. Oct 22, 2012: YouTube video
- Towards Effective Decision-Making Through Data Visualization: Six World-Class Enterprises Show The Way. White paper at FusionCharts.com. download