Dp-means: Optimizing to get the number of clusters
In my last post I compared dp-means and k-means error functions and run times. John Myles White pointed to some opportunities that come from being a continuous variable.
Evolving the test code I posted on github, I developed a quick-and-dirty proof of concept.
First, below is the parameter vs. error graph in its latest incarnation. There are two important changes from the analogous graph from last post:
- Instead of using the k-means cost function to make the timing, error comparisons as I did before, I am now plotting the traditional k-means cost function for k-means and the cost function for dp-means,
- I am not plotting against for comparison
- I am plotting errors for a data set not used in training (called cross-validation in the code).
The cost function for dp-means shows a clear minimum. This graph is slightly confusing because the parameter for k-means, k, the number of clusters, increases left-to-right, while the number of clusters in dp-means goes down with increasing parameter .
I wrote a small script that leverages SciPy to optimize the dp-means cost function in order to determine the optimal value of , and therefore the number of clusters.
Here is an example on one of the data sets included as an example “input” directory. This code runs slowly, but converges to a minimum at,
with error: 14.2624
Here is a sample training at the optimal value with only the data as input (the code determines everything needed from the data.)
Figure shows training iterations for centers, training
data membership, cross-validation data membership.
The code is rough and inefficient, but the method seems robust enough to proceed to work on smoothing things out and run more tests. Neat.