====== Statistical Methods for Data Science A.Y. 2020/21 ====== **This course is discontinued. Starting from A.Y. 2021/22, it has been replaced by a 9 ECTS version:** * **[[mds:sds: |Statistics for Data Science (628PP)]]** =====Instructor===== * **Salvatore Ruggieri** * Università di Pisa * [[http://pages.di.unipi.it/ruggieri/]] * [[salvatore.ruggieri@unipi.it]] * **Office hours** * Tuesday h 14:00 - 17:00, Department of Computer Science, room 321/DO. * **Office hours only on appointment via Teams/Skype. Skype contact: salvatore.ruggieri** =====Classes===== ^ Day of Week ^ Hour ^ Room ^ | Tuesday | 16:00 - 18:00 | [[https://teams.microsoft.com/l/channel/19%3afdc694d4d7044c5eb7c81396dea7e64b%40thread.tacv2/General?groupId=a083eaab-5584-4177-b197-e4fb9637642f&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Virtual Room]] | | Wednesday| 9:00 - 11:00 | [[https://teams.microsoft.com/l/channel/19%3afdc694d4d7044c5eb7c81396dea7e64b%40thread.tacv2/General?groupId=a083eaab-5584-4177-b197-e4fb9637642f&tenantId=c7456b31-a220-47f5-be52-473828670aa1|Teams Virtual Room]] | =====Pre-requisites===== Students should be comfortable with most of the topics on mathematical calculus covered in: * **[P]** J. Ward, J. Abdey. **Mathematics and Statistics**. University of London, 2013. __Chapters 1-8 of Part 1__. Extra-lessons refreshing such notions may be planned in the first part of the course. =====Mandatory Teaching Material===== The following are //mandatory text books//: * **[T]** F.M. Dekking C. Kraaikamp, H.P. Lopuha, L.E. Meester. **A Modern Introduction to Probability and Statistics**. Springer, 2005. * **[R]** P. Dalgaard. **Introductory Statistics with R**. 2nd edition, Springer, 2008. =====Software===== * [[https://cran.r-project.org/|R]] * [[https://www.rstudio.com/|R Studio]] =====Preliminary program and calendar===== * [[https://esami.unipi.it/programma.php?c=48036&aa=2020|Preliminary program]]. * [[https://didattica.di.unipi.it/en/master-programme-in-data-science-and-business-informatics/academic-calendar-2020-2021/|Calendar of lessons]]. =====Student project===== * The project can be done in groups of at most 4 students. * The project must be delivered (report + code) by end of July. * The oral discussion must be done by the September session, and it will cover both the project and all topics of the course. * The project replaces the written exam but **students have to [[https://esami.unipi.it/esami2/|register for the written dates]] in order to fill the student's questionnaire**. * Groups ready to discuss send the project to the teacher plus availability time slots for oral discussion. * {{ :mds:smd:smd.project.2021.pdf | Project presentation slides}} and [[http://patterns.di.unipi.it/sds/video/smd_project_2021.mp4|project info audio-video (.mp4)]]. =====Written exam===== __//There are no mid-terms//.__ The exam consists of a written part and an oral part. The written part consists of exercises on the topics of the course. Each question is assigned a grade, summing up to 30 points. Students are admitted to the oral part if they receive a grade of at least 18 points. Written exam consists of open questions and exercises. Example written texts: **{{ :mds:smd:smdsample.pdf | sample1}}**, **{{ :mds:smd:smdsample2.pdf | sample2}}**. Oral consists of critical discussion of the written part and of open questions and problem solving on the topics of the course.\\ **Online exams:** during the COVID-19 restrictions, the written part and the oral part will be online. For the written part, students will connect to a reserved Teams virtual room and will activate both microphone and web-cam. The text will be shared in the virtual room chat. Solutions will be written on sheet of papers. Each sheet will include name, surname, student id, and it will be signed. A photo of the sheets will be delivered to [[ruggieri@di.unipi.it]] at the end of the written part. Registration to exams is mandatory (**beware of the registration deadline!**): [[https://esami.unipi.it/esami2/|register here]]\\ =====Class calendar===== ^ ^ Date ^ Room ^ Topic ^ Learning material ^ |01| 16.02 16:00-18:00 | Teams | Introduction. Probability and independence. [[http://patterns.di.unipi.it/sds/video/smd01_20210216.mp4|rec01 audio-video (.mp4)]]| **[T]** Chpts. 1-3 {{:mds:smd:smd01.pdf|slides01 (.pdf)}}| |02| 23.02 16:00-18:00 | Teams | R basics. [[http://patterns.di.unipi.it/sds/video/smd02_20210223.mp4|rec02 audio-video (.mp4)]]| **[R]** Chpts. 1,2.1,2.2 {{:mds:smd:smd02.pdf|slides02 (.pdf)}}, {{:mds:smd:smd02.r|script02 (.R)}}| |03| 24.02 9:00-11:00 | Teams | Discrete random variables. [[http://patterns.di.unipi.it/sds/video/smd03_20210224.mp4|rec03 audio-video (.mp4)]]| **[T]** Chpt. 4 **[R]** Chpt. 3 {{:mds:smd:smd03.pdf|slides03 (.pdf)}}, {{:mds:smd:smd03.r|script03 (.R)}}| |04| 02.03 16:00-18:00 | Teams | Recalls: derivatives and integrals. [[http://patterns.di.unipi.it/sds/video/smd04_20210302.mp4|rec04 audio-video (.mp4)]]| **[P]** Chpt. 1-8 {{:mds:smd:smd04.pdf|slides04 (.pdf)}}, {{:mds:smd:smd04.r|script04 (.R)}}| |05| 03.03 9:00-11:00 | Teams | Continuous random variables. Simulation. [[http://patterns.di.unipi.it/sds/video/smd05_20210303.mp4|rec05 audio-video (.mp4)]]| **[T]** Chpts. 5, 6.1-6.2 **[R]** Chpt. 3 {{:mds:smd:smd05.pdf|slides05 (.pdf)}}, {{:mds:smd:smd05.r|script05 (.R)}}| |06| 09.03 16:00-18:00 | Teams | Expectation and variance. Computations with random variables. [[http://patterns.di.unipi.it/sds/video/smd06_20210309.mp4|rec06 audio-video (.mp4)]]| **[T]** Chpts. 7,8 {{:mds:smd:smd06.pdf|slides06 (.pdf)}}, {{:mds:smd:smd06.r|script06 (.R)}}| |07| 10.03 9:00-11:00 | Teams | R data access and programming. [[http://patterns.di.unipi.it/sds/video/smd07_20210310.mp4|rec07 audio-video (.mp4)]]| **[R]** Chpt. 2.3,2.4 {{:mds:smd:smd07.zip|script07 (.zip)}} | |08| 16.03 16:00-18:00 | Teams | Power laws and Zipf laws. [[http://patterns.di.unipi.it/sds/video/smd08_20210316.mp4|rec08 audio-video (.mp4)]]| [[https://arxiv.org/pdf/cond-mat/0412004.pdf | Newman's paper]] Sect I, II, III(A,B,E,F) {{:mds:smd:smd08.pdf|slides08 (.pdf)}}, {{:mds:smd:smd08.zip|script08 (.zip)}} | |09| 17.03 9:00-11:00 | Teams | Moments, joint distributions, sum of random variables. [[http://patterns.di.unipi.it/sds/video/smd09_20210317.mp4|rec09 audio-video (.mp4)]]| **[T]** Chpts. 9-11 {{:mds:smd:smd09.pdf|slides09 (.pdf)}}, {{:mds:smd:smd09.zip|script09 (.zip)}} | |10| 23.03 16:00-18:00 | Teams | Law of large numbers. The central limit theorem. [[http://patterns.di.unipi.it/sds/video/smd10_20210323.mp4|rec10 audio-video (.mp4)]]| **[T]** Chpts. 13-14 {{:mds:smd:smd10.pdf|slides10 (.pdf)}}, {{:mds:smd:smd10.r|script10 (.R)}}| |11| 24.03 9:00-11:00 | Teams | Project presentation. Graphical summaries. [[http://patterns.di.unipi.it/sds/video/smd11_20210324.mp4|rec11 audio-video (.mp4)]]| **[T]** Chpt. 15 {{:mds:smd:smd11.pdf|slides11 (.pdf)}}, {{:mds:smd:smd11.r|script11 (.R)}}| |12| 30.03 16:00-18:00 | Teams | Numerical summaries. Data preprocessing in R. [[http://patterns.di.unipi.it/sds/video/smd12_20210330.mp4|rec12 audio-video (.mp4)]]| **[T]** Chpt. 16, **[R]** Chpts. 4,10 {{:mds:smd:smd12.pdf|slides12 (.pdf)}}, {{:mds:smd:smd12.r|script12 (.R)}}, {{ :mds:smd:dataprep.r | dataprep.R}} | |13| 7.04 9:00-11:00 | Teams | Unbiased estimators. Efficiency and MSE. [[http://patterns.di.unipi.it/sds/video/smd13_20210407.mp4|rec13 audio-video (.mp4)]]| **[T]** Chpts. 17.1-17.3, 19, 20 {{:mds:smd:smd13.pdf|slides13 (.pdf)}}, {{:mds:smd:smd13.r|script13 (.R)}} | |14| 13.04 16:00-18:00 | Teams | Maximum likelihood estimation. [[http://patterns.di.unipi.it/sds/video/smd14_20210413.mp4|rec14 audio-video (.mp4)]]| **[T]** Chpt. 21 {{ :mds:smd:notes1.pdf |}} {{:mds:smd:smd14.pdf|slides14 (.pdf)}}, {{:mds:smd:smd14.r|script14 (.R)}} | |15| 14.04 9:00-11:00 | Teams | Linear regression. Least squares estimation. [[http://patterns.di.unipi.it/sds/video/smd15_20210414.mp4|rec15 audio-video (.mp4)]]| **[T]** Chpts. 17.4,22 **[R]** Chpts. 6 {{ :mds:smd:notes2.pdf |}} {{:mds:smd:smd15.pdf|slides15 (.pdf)}}, {{:mds:smd:smd15.r|script15 (.R)}} | |16| 20.04 16:00-18:00 | Teams | Multiple, non-linear, and logistic regression. [[http://patterns.di.unipi.it/sds/video/smd16_20210420.mp4|rec16 audio-video (.mp4)]]| **[R]** Chpt. 12.1,13,16.1-16.2 {{ :mds:smd:notes2.pdf |}} {{:mds:smd:smd16.pdf|slides16 (.pdf)}}, {{:mds:smd:smd16.zip|script16 (.zip)}} | |17| 21.04 9:00-11:00 | Teams | Logistic regression (ctd). Introduction to confidence intervals. [[http://patterns.di.unipi.it/sds/video/smd17_20210421.mp4|rec17 audio-video (.mp4)]]| **[T]** Chpts. 23.1 {{:mds:smd:smd17.pdf|slides17 (.pdf)}}, {{:mds:smd:smd17.r|script17 (.R)}} | |18| 27.04 16:00-18:00 | Teams | Confidence intervals: Gaussian, T-student, large sample method. Confidence intervals in linear regression. [[http://patterns.di.unipi.it/sds/video/smd18_20210427.mp4|rec18 audio-video (.mp4)]]| **[T]** Chpts. 23.2,23.4, 4.3,24.4 {{ :mds:smd:notes2.pdf |}} | |19| 28.04 9:00-11:00 | Teams | Empirical bootstrap. Application to confidence intervals. [[http://patterns.di.unipi.it/sds/video/smd19_20210428.mp4|rec19 audio-video (.mp4)]]| **[T]** Chpts. 18.1,18.2,23.3 {{:mds:smd:smd19.pdf|slides19 (.pdf)}}, {{:mds:smd:smd19.r|script19 (.R)}} | |20| 04.05 16:00-18:00 | Teams | Parametric bootstrap. Hypotheses testing. [[http://patterns.di.unipi.it/sds/video/smd20_20210504.mp4|rec20 audio-video (.mp4)]]| **[T]** Chpts. 18.3,25 {{:mds:smd:smd20.pdf|slides20 (.pdf)}}, {{:mds:smd:smd20.r|script20 (.R)}} | |21| 05.05 9:00-11:00 | Teams | One-sample tests of the mean and application to linear regression.[[http://patterns.di.unipi.it/sds/video/smd21_20210505.mp4|rec21 audio-video (.mp4)]]| **[T]** Chpts. 26-27, **[R]** Chpts. 5.1,5.2 {{:mds:smd:smd21.pdf|slides21 (.pdf)}}, {{ :mds:smd:notes2.pdf |}}, {{:mds:smd:smd21.r|script21 (.R)}} | |22| 11.05 16:00-18:00 | Teams | Multiple comparisons. Fitting distributions.[[http://patterns.di.unipi.it/sds/video/smd22_20210511.mp4|rec22 audio-video (.mp4)]]| {{ :mds:smd:ks.pdf | K-S}}, {{:mds:smd:smd22.pdf|slides22 (.pdf)}}, {{:mds:smd:smd22.r|script22 (.R)}} | |23| 12.05 9:00-11:00 | Teams | Two-sample tests of the mean, and F-test.[[http://patterns.di.unipi.it/sds/video/smd23_20210512.mp4|rec23 audio-video (.mp4)]]| **[T]** Chpts. 28, **[R]** Chpts. 5.3-5.7 {{:mds:smd:smd23.pdf|slides23 (.pdf)}}, {{:mds:smd:smd23.r|script23 (.R)}} | |24| 18.05 16:00-18:00 | Teams | Testing correlation/independence. Multiple-sample tests of the mean.[[http://patterns.di.unipi.it/sds/video/smd24_20210518.mp4|rec24 audio-video (.mp4)]]| **[R]** Chpts. 7, 8 {{:mds:smd:smd24.pdf|slides24 (.pdf)}}, {{:mds:smd:smd24.r|script24 (.R)}} | |--| 19.05 9:00-11:00 | Teams | Office hours and project tutoring. | |