Indice

Data Mining and Machine Learning -- Master MAINS 2019
- News
Goals
Syllabus
Textbooks
Reading about the "data analyst" job
Calendar
- Datasets
- Exercises
Exams
Previous editions

Data Mining and Machine Learning -- Master MAINS 2019

Fosca Giannotti
ISTI-CNR, Knowledge Discovery and Data Mining Lab
fosca [dot] giannotti [at] isti [dot] cnr [dot] it

Dino Pedreschi
Università di Pisa, Knowledge Discovery and Data Mining Lab
dino [dot] pedreschi [at] unipi [dot] it

Teaching Assistants: Riccardo Guidotti & Giulio Rossetti
ISTI-CNR, Knowledge Discovery and Data Mining Lab
riccardo [dot] guidotti [at] isti [dot] cnr [dot] it giulio [dot] rossetti [at] isti [dot] cnr [dot] it

News

Wednesday 10 April 2019: install KNIME (http://www.knime.org).

Goals

Organizations and business are overwhelmed by the flood of data continuously collected into their data warehouses and arriving from external sources – the Web above all. Traditional exploratory techniques may fail to make sense of the data, due to its inherent complexity and size. Data mining and knowledge discovery techniques emerged as an alternative approach, aimed at revealing patterns, rules and models hidden in the data, and at supporting the analytical user to develop descriptive and predictive models for a number of business problems. This short course focusses on the main applications scenarios of data mining to challenging problems in the broad CRM domain - Customer Relationship Management.

Syllabus

Clustering models for customer segmentation. Discussion of real cases. Hands-on project: segmentation of a base of anonymized customers from the retail industry. Clustering models for competitive intelligence.
Patterns and association rule mining for market basket analysis. Hands-on project: mining association rules from sales data of the retail industry.
Prediction models for promotion performance and churn analysis. Discussion of real cases. Hands-on project: churn prediction from a base of anonymized customers from the retail industry.
Analysis of human mobility patterns by mobility data mining from big data. Mining official data for understanding of human behavior.
Social network analysis for undestanding diffusion phenomena. Viral marketing.
Application of data mining to geo-marketing. Analysis of innovators. Predictive models for fraud detection.

Textbooks

Slides (see Calendar).

Gordon S. Linoff e Michael J. Berry. Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. Wiley, 2011.

Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Introduction to Data Mining. Addison Wesley, ISBN 0-321-32136-7, 2006
- http://www-users.cs.umn.edu/~kumar/dmbook/index.php
- I capitoli 4, 6, 8 sono disponibili sul sito del publisher. – Chapters 4,6 and 8 are also available at the publisher's Web site.

Reading about the "data analyst" job

Data, data everywhere. The Economist, Feb. 2010 download
Data scientist: The hot new gig in tech, CNN & Fortune, Sept. 2011 link
Welcome to the yotta world. The Economist, Sept. 2011 download

Calendar

	Date	Topic	Learning material	Instructor
01.	Gio 11.04.2019 - 09:00-13:00	Introduction to data mining and big data analytics	slides: intro slides: case studies	Giannotti
02.	Gio 11.04.2019 - 14:00-18:00	Data understanding; data preparation; Knime tutorial	slides slides data understanding Tutorial Knime 01_titanic_data_understanding	Pedreschi, Guidotti
03.	Ven 12.04.2019 - 09:00-13:00	Clustering analysis & customer segmentation	slides clustering slides customer segmentation	Pedreschi
04.	Ven 12.04.2018 - 14:00-18:00	Clustering analysis: exercises with Knime	02_titanic_clustering	Pedreschi, Guidotti
05.	Lun 15.04.2019 - 09:00-13:00	Classification & prediction	slides classification Visual Introduction to Classification with Decision Trees	Pedreschi
06.	Lun 15.04.2019 - 14:00-18:00	Classification & prediction: exercises with Knime	05_titanic_classification	Pedreschi, Rossetti
07.	Mar 16.04.2019 - 09:00-13:00	More on Classification: from decision trees to deep learning	Evaluation of classifiers KNN & Naive Bayes Neural Networks & SVM Ensemble methods & Wisdom of the crowd	Pedreschi
08.	Mar 16.04.2019 - 14:00-18:00	Classification & prediction: exercises with Knime. Project work		Giannotti, Rossetti
09.	Mer 17.04.2019 - 09:00-13:00	Pattern and association rule mining & market basket analysis	5.dm-ml_patternmining-2018.pdf	Giannotti
10.	Mer 17.04.2019 - 14:00-18:00	Pattern and association rule mining: exercises with Knime	03_titanic_pattern 04_coop_pattern	Giannotti, Rossetti
11.	Gio 18.04.2019 - 09:00-13:00	Case studies. Prediction models for promotion performance and churn analysis	5.dml-ml-exemplarproject-churn-fraude-.pdf 5.dm_ml_exemplarprojects-shoppingbehaviour_innovators.pdf	Giannotti
12.	Gio 18.04.2019 - 14:00-18:00	Hints on data science with Python. Data Science Privacy & Ethics.	5.dml-ml-privacy_etica-.pdf	Giannotti, Rossetti

Datasets

0. Iris. (for details see https://archive.ics.uci.edu/ml/datasets/iris)

1. Titanic. (for details see https://www.kaggle.com/c/titanic)

2. Human Resources. (for details see https://www.kaggle.com/ludobenistant/hr-analytics)

3. Telco Churn. (for details see http://didawiki.di.unipi.it/doku.php/dm/mains.santanna.dm4crm.2016)

4. Adult. (for details see https://archive.ics.uci.edu/ml/datasets/Adult)

5. Credit Card (for details see https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients)

Exercises

Guidelines:

Each group (2-3 people) is required to deliver a report (max 20 pages including all figures) describing the methods adopted and the discussion of the most interesting achieved results with reference to the tasks listed below. Assume that the report is targeted to a marketing strategist, who is interested to learn the story inferred in the various data mining analyses and to receive suggestions on how to take appropriate actions as a consequence.

1. Data Understanding: useful as a preliminary step to capture basic data property. Distribution analysis, statistical exploration, correlation analysis, suitable transformation of variables and elimination of redundant variables, management of missing values.

2. Pattern Mining Analysis. Problem: prepare data and extract interesting association rules and frequent patterns. The report should discuss the parameters used for the analyses, justifying your findings related to the most interesting rules according to the different measure introduced in the course.

3. Customer Segmentation. Problem: find a high-quality clustering using clustering algorithms and discuss the profile of each found cluster (in terms of the properties that describe the properties of the customers of each cluster). The report should illustrate the adopted clustering methodology and the cluster interpretation. In particular, in case of k-means, it is necessary to discuss the identification of the best value of k and the characterisation of the obtained clusters by using both analysis of the k centroids and comparison of the statistics of variables within the clusters with that in the whole dataset.

4. Classification Analysis. Problem: find a high-quality decision tree for predicting a feature of a customer. The report should illustrate the adopted classification methodology and the decision tree validation and interpretation, describing also the process adopted to select the proposed tree, together with its quality evaluation.

Deadline: send the report by email to all instructors within 22 June 2019. Specify [MAINS] in the subject of the email.

Exams

The exam consists in the evaluation of the report of the proposed mining exercises.