Text Analytics (635AA) A.Y. 2023/24

Teacher

Laura Pollacci (laura.pollacci [at] di [dot] unipi [dot] it)

Office hours:

Schedule

Day	Hour	Room
Thursday	16-18	Fib C1
Friday	11-13	Fib M1

Team of the class

Objectives

The course targets text analytics systems and applications to respond to business problems by discovering and presenting knowledge that is otherwise locked in textual form. The main objectives of the course are:

Learning essential techniques, algorithms, and models used in natural language processing.
Understanding of the architectures of typical text analytics applications and of libraries for building them.
Expertise in design, implementation, and evaluation of applications that exploit analysis, interpretation, and transformation of texts.

Background

Background: Natural Language Processing, Information Retrieval and Machine Learning
Mathematical background: Probability, Statistics and Algebra
Linguistic essentials: words, lemmas, morphology, Part of Speech (PoS), syntax
Basic text processing: regular expression, tokenisation
Data collection: scraping
Basic modelling: collocations, language models
Introduction to Machine Learning: theory and practical tips
Libraries and tools: NLTK, Spacy, Keras, pytorch
Classification/Clustering
Sentiment Analysis/Opinion Mining
Information Extraction/Relation Extraction/Entity Linking
Transfer learning
Quantification

Lecture Notes

Date	Lecture	Slides	Material / Reference
2023/09/21	Introduction to the course, NLP & Text Analytics.	1 - Introduction to the Text Analytics course	J. Eisenstein. Introduction to Natural Language Processing. MIT Press. Chp. 1.
2023/09/22	Reminds on probability.	2 - Reminds on probability
2023/09/28	Introduction to Python.	3 - Introduction to Python	L3 - Introduction_to_Python.ipynb
2023/09/29	Introduction to Python - part 2. Project and Dates	4 - Project and Dates
2023/10/05	Probabilistic language models	5 - Probabilistic language models	D. Jurafsky, J.H. Martin. Ch3 L5 Probabilistic Language Model.ipynb
2023/10/06	Text Indexding: Strings, Regular Expressions and BS4.	6 - Text indexing 1	D. Jurafsky, J.H. Martin. Ch2 L6.1 - Strings Regular expressions and BS4.ipynb
2023/10/12	Linguistic annotation. NLTK.	6 - Text Indexing 2	L6.2 - Linguistic annotation with NLTK.ipynb
2023/10/13	Lesson canceled due to UNIPI orientation days.
2023/10/19	Feature Selection	6 - Text Indexing 3	L6.3 - Gensim collocations - Stanza - Spacy (Notebooks)
2023/10/20	Vector space models	6 - Text Indexing 4	D. Jurafsky, J.H. Martin. Chp. 6. L6.4 - Vector space model - toy example
2023/10/26	Lesson canceled
2023/10/27	Lesson canceled
2023/11/02	Machine Learning for Text Analytics.	10 - Machine Learning for Text Analytics - corrected
2023/11/03	Machine Learning for Text Analytics: Design Experimental Protocols. Student presentations: How to.	11 - Design Experimental Protocols. 11.1 - Student presentations: How to	L.11 - Classification with SkLearn
2023/11/09	Student project presentations: proposal, brainstorming, discussion.
2023/11/10	Student project presentations: proposal, brainstorming, discussion.
2023/11/16	Topic Modeling	12 - Topic Modeling	Zhai and Massung (2016) Text Data Management and Analysis. Chp 17. L.12 -Topic Modeling - Notebook.. L.12.1 - Topic Modeling pyLDAvis - Notebook
2023/11/17	A primer on Neural Networks	13 - A primer on Neural Networks
2023/11/23	Neural Networks	14 - Neural Networks	From SVM to NN, Classification with Keras - Notebooks.
2023/11/24	Neural Language Models	15 - Neural Language Models	D. Jurafsky, J.H. Martin. Chps. 7 9 11
2023/11/30	Student project presentations: ongoing experiments. Neural Language Models Practice	16 - Neural Language Models Word2Vec	Word2vec - Notebook.
2023/12/01	Student project presentations: ongoing experiments. Neural Language Models Practice	17 - Neural Language Models Doc2Vec	Doc2Vec - Notebook
2023/12/07	Neural Language Models - part 2	Neural Language Models - part 2
2023/12/11	BERT. Project Submission	19 - Bert. Project Submission	Bert - Notebooks
2023/12/14	Advanced Topics	20 - Advanced Topics	Recommended chapters: D. Jurafsky, J.H. Martin. 20;24.

Exam

Attending students

The exam for attending students will consist of the development of a project to be agreed upon with the teacher and an oral exam. The outcome of the project will be some code and a report of the activity (4-10 pages is the typical length range). The oral exam will consist of the presentation and discussion of the project. Projects may be based on challenges proposed in either research forums (Semeval, Evalita) or other platforms (Kaggle). Students are also invited to propose a project based on other sources (e.g., recent papers on ArXiv CL or AI), or their own interests. Students may work in 3-5 people groups.

Non-Attending students

The exam for non attending students will consist in a written exam with open question and exercises, and an oral discussion on the topics of the course.

Written test example.

Textbooks

It is recommended to read selected chapters from:

D. Jurafsky, J.H. Martin, Speech and Language Processing. 3nd edition, Prentice-Hall, 2018.
S. Bird, E. Klein, E. Loper. Natural Language Processing with Python.

Further bibliography will be indicated as a material for the single lessons.

Indice