dm:start:clustering
Differenze
Queste sono le differenze tra la revisione selezionata e la versione attuale della pagina.
| Prossima revisione | Revisione precedente | ||
| dm:start:clustering [18/12/2012 alle 14:09 (13 anni fa)] – creata Fosca Giannotti | dm:start:clustering [18/12/2012 alle 14:20 (13 anni fa)] (versione attuale) – [Guidelines for the homework on clustering] Fosca Giannotti | ||
|---|---|---|---|
| Linea 5: | Linea 5: | ||
| * **Data Understanding: | * **Data Understanding: | ||
| - | * Distribution | + | * Distribution analysis and suitable transformation of variables |
| * Elimination of redundant variables by correlation analysis | * Elimination of redundant variables by correlation analysis | ||
| * **Clustering Analysis by K-means: (15 points)** | * **Clustering Analysis by K-means: (15 points)** | ||
| * Identification of the best value of k | * Identification of the best value of k | ||
| - | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and in the whole dataset | + | * Characterization of the obtained clusters by using both analysis of the k centroids and comparison of the distribution of variables within the clusters and that in the whole dataset |
| * **Analysis by density-based clustering (7 points)** | * **Analysis by density-based clustering (7 points)** | ||
| Linea 18: | Linea 18: | ||
| * **Analysis by hierarchical clustering (Optional - 3 points)** | * **Analysis by hierarchical clustering (Optional - 3 points)** | ||
| * Analysis to be performed on a sampling of the data for scalability reasons | * Analysis to be performed on a sampling of the data for scalability reasons | ||
| + | |||
| + | |||
| + | ====== Description of the variables ====== | ||
| + | |||
| + | For each car driver we observe the following quantities, measured over a certain time window of mobile activity: | ||
| + | |||
| + | Length = total traveled distance (m.) | ||
| + | Duration = total time spent driving (sec.) | ||
| + | Count = number of different trips | ||
| + | Phighway = distance traveled on highways (m.) | ||
| + | Pcity = distance traveled inside cities (m.) | ||
| + | Length_arc_crowded = distance traveled on 20% most crowded roads (m.) | ||
| + | Pnight = distance traveled at night time (m.) | ||
| + | Pover = distance traveled over speed limit (m.) | ||
| + | Profile = number of systematic trips, e.g., work-home | ||
| + | Radius_g = radius of gyration: sparsity of location from the center of mass of the driver (mean position) | ||
| + | Radius_g_L1 = radius of gyration w.r.t. L1: sparsity of location from the driver' | ||
| + | Avg_Dist_L1 = average distance from L1: average distance from the driver' | ||
| + | TimeL1L2 = % time spent at locations L1 and L2 (most and second most preferred locations) | ||
| + | EntropyArc = entropy on road segment frequencies, | ||
| + | EntropyLocation = entropy on location frequencies, | ||
| + | EntropyTime = entropy on hours of the day, measures the diversity of daily patterns | ||
| + | |||
| + | Notice that there are no missing values in the dataset, hence " | ||
| + | |||
dm/start/clustering.1355839756.txt.gz · Ultima modifica: 18/12/2012 alle 14:09 (13 anni fa) (modifica esterna)
