====== Distributed Data Analysis and Mining (DDAM) ======

{{ :mds:ddam:1920px-apache_spark_logo.svg.png?nolink&200|}}


**Docente - Teacher**: **Patrizio Dazzi** \\ [[patrizio.dazzi@isti.cnr.it]] \\ \\
**Pagina sul sito del Dipartimento - Official webpage of the course**: //[[https://esami.unipi.it/esami2/programma.php?c=44024&aa=2019&cid=341&did=13|Distributed Data Analysis and Mining]]// \\ \\
**Corso di Laurea - Graduate Course**: \\ //[[https://www.di.unipi.it/it/didattica/wds-lm|DATA SCIENCE AND BUSINESS INFORMATICS]]//


===== Comunicazioni - News =====


===== Orario Lezioni - Lessons Schedule =====
^ Giorno ^ Ora ^ Luogo ^
|Mercoledì|11:00 - 13:00| Aula X1, Polo Fibonacci|
|Giovedì|09:00 - 11:00| Aula X2, Polo Fibonacci|

===== Ricevimento Studenti - Question time =====

== Italiano ==
Come indicato in tabella, l'ufficio si trova all'interno dell'**Area della Ricerca di Pisa**. Per raggiungere l'area fate riferimento alla seguente {{:informaticaumanistica:bpw:raggiungerearea.png?linkonly|mappa}}. Una volta arrivati nell'area, per raggiungere il mio ufficio dovete passare o dall'ingresso 19 o dal 20, come mostrato [[http://hpc.isti.cnr.it/wp-content/uploads/2014/09/cnr.png|qui]]. \\ **NOTA BENE:** a meno di eventi eccezionali le porte dei due ingressi sono chiuse. Per poter accedere avete due possibilità:
  - Lasciate un vostro documento alla guardiania chiedendo un badge visitatore (soluzione da preferire);
  - Entrate e mi telefonate (solo nel caso in cui la guardiania avesse terminato i badge);

== English ==
As it is reported in the table below, the teacher's office is inside the **Pisa Research Campus**. To reach it, please refer to this {{:informaticaumanistica:bpw:raggiungerearea.png?linkonly|map}}. Once arrived, to reach my office you should use either entrance 19 or 20. As shown [[http://hpc.isti.cnr.it/wp-content/uploads/2014/09/cnr.png|here]]. **NOTE:** doors are often closed. To get into the building you have two options:
  - Request a visitor badge providing your ID to the officer (preferred option);
  - Enter inside and call me (only when the officer has no more badges for visitors;

^ Giorno - Day ^ Ora - Time^ Luogo - Place ^
|Lunedì - Monday|10:00 - 12:00|Ufficio C-36, ISTI-CNR, Area della Ricerca di Pisa \\ Office C-36, ISTI-CNR, Pisa Research Campus|

===== Scopo del Corso - Aim of the course =====

== Italiano: ==
Il Data Mining sui Big data è oggi un’area di ricerca molto attiva. L'applicazione delle attuali metodologie analitiche e strumenti software su un singolo personal computer non può gestire in modo efficiente dataset di grandi dimensioni. Le piattaforme di calcolo distribuito sono una soluzione scalabile per il big data mining, attraverso la scomposizione del problema in operazioni più piccole che possono essere eseguite parallelamente su singoli processori / macchine. Il corso propone l’insegnamento di concetti base del paradigma di calcolo distribuito tramite MapReduce dal punto di vista teorico e pratico, in particolare ci si focalizzerà su Hadoop per lo sviluppo di competenze nell'uso di strumenti di calcolo ad alte prestazioni per il data engineering, l’analisi di dati e l’utilizzo di tecniche di data mining. Gli studenti impareranno come i classici algoritmi di data mining possono essere applicati sui Big Data usando Hadoop (Spark). Set di dati reali (e open source) verranno utilizzati per presentare esempi e per consentire agli studenti di costruire i propri progetti. Una metà delle lezioni consisterà in esercitazioni (laboratorio) e una metà delle lezioni sarà teorica.

== English:==
Mining with big data or big data mining has become an active research area. Running current analytical methodologies and software tools on a single personal computer cannot efficiently deal with very large datasets. Distributed computing platforms are a scalable solution for big data mining, obtained by dividing a large problem into smaller ones that are concurrently solved by many single processor/machine. This course aims at teaching the basic theoretical concepts behind the MapReduce distributed computing paradigm, and Hadoop in particular, and at building expertise in the practical usage of high-performance computing tools for data engineering, analysis and mining. In particular, the students will learn how classical data mining algorithms can be applied to Big Data using Hadoop (Spark). Real (and open source) datasets will be used to present examples and to let the students build their own projects. Half of the lessons will consist of practice (Lab), and half of the lectures.


=== Syllabus: ===


  * Motivations: What is and Why Distributed Data Mining is needed in a Big Data Scenario
  * Recall parallel and distributed computing notions
  * Amdahl's law, differences between shared and distributed memory architectures
  * Introduction to Hadoop
  * Hadoop Ecosystem
  * Interacting with HDFS
  * Hadoop Combiners
  * Basic Spark and RDD
  * Map-Reduce Programming Patterns
  * Recall Python programming
  * Data Analysis with Spark 
  * Data Mining and Machine Learning with Spark 
  * SparkSQL 
  * Example on how to prepare a project  


===== Registro delle Lezioni - Lessons log (in english) =====

^ ^ Giorno ^ Data ^ Argomento ^ Lucidi ^
| 1.   	|  Mercoledì - Wednesday    | 18.09 | General presentation of the course. Introduction to the issues related to bigdata processing. Advantages deriving from the exploitation of parallel and distributed computing approaches. | {{ :mds:ddam:ddam20192020:1-_introduction_bigdata.pdf |Introduction to BigData}} |
| 2.   	|  Giovedì - Thursday     | 19.09 | Introduction to parallel computing. Flynn's taxonomy. Amdahl's law | {{ :mds:ddam:ddam20192020:2-_introduction_to_parallel_computing.pdf |Introduction to Parallel Computing [slides 1 - 33]}} |
| 3.   	|  Mercoledì - Wednesday      | 25.09 | Classroom exercises on Amdahl's law (1 hour). Concepts of parallel computing in shared-memory architectures, caches, cache-coherence | {{ :mds:ddam:ddam20192020:amdahl_exercises.pdf | Amdahl's exercises}} {{ :mds:ddam:ddam20192020:2-_introduction_to_parallel_computing.pdf | Introduction to Parallel Computing [slides 34 - 40]}}|
| 4.   	|  Giovedì - Thursday     | 26.09 | Distributed and hybrid memory architectures. Concepts on parallel programming design principles. | {{ :mds:ddam:ddam20192020:2-_introduction_to_parallel_computing.pdf | Introduction to Parallel Computing [slides 41 - 58]}} |
| 5.    |  Mercoledì - Wednesday      | 02.10 | Introduction to Hadoop. Concepts on Hadoop stacked structure and HDFS. | {{ :mds:ddam:ddam20192020:3-_introduction_to_hadoop.pdf | Introduction to Hadoop }} |
| 6.    |  Giovedì - Thursday      | 03.10 | HDFS: overall description and examples. | {{ :mds:ddam:ddam20192020:3-_introduction_to_hadoop.pdf | Introduction to Hadoop }} |
| 7.    |  Mercoledì - Wednesday      | 09.10 | Combiners to reduce network bandwidth reqs. Introduction to Apache Spark, RDD concept. | {{ :mds:ddam:ddam20192020:4-_hadoop_combiners.pdf | Hadoop Combiners}} {{ :mds:ddam:ddam20192020:5-_introduction_spark.pdf | Introduction to Spark}}|
| 8.    |  Giovedì - Thursday      | 17.10 | HOWTO: Install a Spark cluster on your laptop. | {{ :mds:ddam:ddam20192020:6-_sparkclusterlaptop.pdf | Spark cluster on a Laptop with Docker}} |
| 9.    |  Mercoledì - Wednesday      | 23.10 | Troubleshooting on Spark and Docker on students' laptops | |
| 10.    |  Giovedì - Thursday      | 24.10 | MapReduce Patterns | {{ :mds:ddam:ddam20192020:7-_mapreduce_patterns.pdf | MapReduce patterns }} |
| 11.    |  Mercoledì - Wednesday      | 30.10 | Recall Python basics | {{ :mds:ddam:ddam20192020:1._python_basics.ipynb.zip | Python basics (Jupyter Notebook)}} |
| 12.    |  Giovedì - Thursday      | 31.10 | Python Pandas | {{ :mds:ddam:ddam20192020:2._data_manipulation_in_pandas.ipynb.zip | Python Pandas (Jupyter Notebook)}} |
| 13.    |  Mercoledì - Wednesday      | 6.11 | Data Analysis with Spark | {{ :mds:ddam:ddam20192020:8-_data_analysis_with_spark.pdf | Data Analysis with Spark}} |
| 14.    |  Giovedì - Thursday      | 7.11 | SparkSQL and Spark MLlib | {{ :mds:ddam:ddam20192020:9-_sparksql.pdf | SparkSQL}} {{ :mds:ddam:ddam20192020:10-_spark_mllib.pdf | SparkMLLib}} |
| 15.    |  Mercoledì - Wednesday      | 13.11 | Summarization and Recap: lesson learnt + definition of groups for the examination phase |  |
| 16.    |  Giovedì - Thursday      | 14.11 | Spark Streaming | {{ :mds:ddam:ddam20192020:11-_spark_streaming.pdf | Spark Streaming}} |
| 17.    |  Mercoledì - Wednesday      | 20.11 | From Streaming to Continous Applications | {{ :mds:ddam:ddam20192020:12-_spark_streaming_-_continous_applications.pdf | Continous applications }} [[https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html| Structured Streaming in Apache Spark]] |
| 18.    |  Giovedì - Thursday      | 21.11 | The expected structure of project, presentation and report | {{ :mds:ddam:13-_project_skeletons.pdf |Project Skeletons}} |
| 19.    |  Mercoledì - Wednesday      | 27.11 | Students illustrate datasets and describe their proposal for projects |  |
| 20.    |  Giovedì - Thursday      | 28.11 | Project Development session |  |

===== Metodo di valutazione - Examination process =====
Students groups made of 2 o 3 students (max) develop a project (report + short slide presentation); \\
Every student perform individual test (multiple choices). \\
Final grade will result from a combination of project mark (70% of the final grade) and individual test mark (30%).

==== Laboratory ====
Student should bring their own laptop. Access to a remote server with pre-installed software will be provided.

===== Software & Links =====
  * Python website: http://www.python.it/download/ (Install the 2.x. Do not install 3.x). Instructions [[https://goo.gl/yBRjkG]]
  * Installing Hadoop single node on your machine (without VM): https://goo.gl/KGME9t (Linux/OS) https://goo.gl/7Bkcnr (Win)
  * Spark http://spark.apache.org/downloads.html (Can be installed without hadoop)


====== Edizioni Precedenti ======
^ Pagina dedicata ^ Anno accademico di riferimento ^
| [[mds:ddam:ddam20182019|Distributed Data Analysis and Mining]] | 2018-2019 |