dm:start:guidelines
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
| Entrambe le parti precedenti la revisioneRevisione precedenteProssima revisione | Revisione precedente | ||
| dm:start:guidelines [15/11/2015 alle 10:26 (10 anni fa)] – [Guidelines for the homework on clustering] Anna Monreale | dm:start:guidelines [23/11/2020 alle 10:34 (5 anni fa)] (versione attuale) – [Guidelines for the task on Classification] Riccardo Guidotti | ||
|---|---|---|---|
| Linea 1: | Linea 1: | ||
| - | ====== Guidelines for the homework | + | ====== Guidelines for the task on Data Understanding |
| - | * ** Data semantics (4 points)** | + | |
| - | * ** Distribution of the variables and statistics (7 points)** | + | |
| - | * ** Assessing data quality (missing values | + | - Distribution of the variables and statistics (7 points) |
| - | * ** Pairwise correlations | + | - Assessing data quality (missing values, outliers) (7 points) |
| - | * ** Presentation | + | - Variables transformations |
| + | - Pairwise correlations | ||
| ====== Guidelines for the task on clustering ====== | ====== Guidelines for the task on clustering ====== | ||
| - | * **Clustering Analysis by K-means: (15 points)** | + | * Clustering Analysis by K-means: (13 points) |
| - | * Identification of the best value of k | + | - Choice of attributes and distance function (1 points) |
| - | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset | + | - Identification of the best value of k (5 points) |
| + | | ||
| + | * Analysis by density-based clustering (9 points) | ||
| + | - Choice of attributes and distance function (2 points) | ||
| + | - Study of the clustering parameters (2 points) | ||
| + | - Characterization and interpretation of the obtained clusters (5 points) | ||
| + | * Analysis by hierarchical clustering (5 points) | ||
| + | - Choice of attributes and distance function (2 points) | ||
| + | - Show and discuss different dendograms using different algorithms (3 points) | ||
| + | * Final evaluation of the best clustering approach and comparison of the clustering obtained (3 points) | ||
| + | |||
| + | |||
| + | ====== Guidelines for the task on Association Rules Mining ====== | ||
| + | * Frequent patterns extraction with different values of support and different types (i.e. frequent, close, maximal), (6 points) | ||
| + | * Discussion of the most interesting frequent patterns and analyze how changes the number of patterns w.r.t. the min_sup parameter (7 points) | ||
| + | * Association rules extraction with different values of confidence (6 points) | ||
| + | * Discussion of the most interesting rules and analyze how changes the number of rules w.r.t. the min_conf parameter, histogram of rules' confidence and lift (7 points) | ||
| + | * Use the most meaningful rules to replace missing values and evaluate the accuracy (2 points) | ||
| + | * Use the most meaningful rules to predict the target variable and evaluate the accuracy (2 points) | ||
| + | |||
| + | |||
| + | ====== Guidelines for the task on Classification ====== | ||
| + | * Learning of different decision trees/ | ||
| + | * Decision trees interpretation, | ||
| + | * Training of different KNN classifiers with different parameters with the object of maximizing the performances (6 points) | ||
| + | * Discussion of the best prediction model (6 points) | ||
| - | * **Analysis | + | ====== Guidelines for the Project ====== |
| - | | + | * Title page is not counted in the 20 page limits, i.e., you can have 20 pages + 1 title page, the page limit is strict: additional pages will not be considered for the final evaluation, i.e., pages 21,22,23 etc. will not be read and evaluated. |
| - | | + | * The project size must not exceed 25Mb, i.e. you must be able to send it by email without compression. |
| + | * Only PDF file are allowed, you do not have to submit python code or the knime workflows. | ||
| + | * The final paper must be easily readable, i.e., it is better to use font size higher than 9pt. | ||
| + | * Use a readable font type and size, e.g. Arial, Times New Romans | ||
| + | * You can use multiple columns and change the margin size but the project must be readable. | ||
| + | * It is NOT required to put python code, knime flows, or theoretical descriptions | ||
| + | | ||
| + | * You can get 3 additional extra points in the final mark with respect to the following criteria: | ||
| + | - Innovation (0.5 points) | ||
| + | - Experimentation (0.5 points) | ||
| + | - Performance (0.5 points) | ||
| + | - Appearance (0.5 points) | ||
| + | - Organization (0.5 points) | ||
| + | - Summary (0.5 points) | ||
| - | * **Analysis by hierarchical clustering (5 points)** | ||
| - | * Analysis to be performed on a sampling of the data for scalability reasons (if necessary) | ||
dm/start/guidelines.1447583176.txt.gz · Ultima modifica: 15/11/2015 alle 10:26 (10 anni fa) da Anna Monreale
