I spent part of the day thinking about and part of the day writing about a generalization of the k-means clustering algorithm to the case where there are missing data dimensions and dimensions measured with varying quality. That is, I am attempting to generalize it so that it clusters the data by chi-squared rather than uniform-metric squared distance. This, if I am right, will be a maximum-likelihood model for the situation that the underlying distribution is a set of delta functions and the data points are samples of that distribution but after convolution with gaussian errors (different for each data point). My loyal reader will recognize this as a statement of the archetypes problem on which I have been working for the last week or so.

No comments:

Post a Comment