# Dp-means: Optimizing to get the number of clusters

In my last post I compared dp-means and k-means error functions and run times. John Myles White pointed to some opportunities that come from being a continuous variable.

Evolving the test code I posted on github, I developed a quick-and-dirty proof of concept.

First, below is the parameter vs. error graph in its latest incarnation. There are two important changes from the analogous graph from last post:

- Instead of using the k-means cost function to make the timing, error comparisons as I did before, I am now plotting the traditional k-means cost function for k-means and the cost function for dp-means,

- I am not plotting against for comparison
- I am plotting errors for a data set not used in training (called cross-validation in the code).

The cost function for dp-means shows a clear minimum. This graph is slightly confusing because the parameter for k-means, k, the number of clusters, increases left-to-right, while the number of clusters in dp-means goes down with increasing parameter .

I wrote a small script that leverages SciPy to optimize the dp-means cost function in order to determine the optimal value of , and therefore the number of clusters.

Here is an example on one of the data sets included as an example “input” directory. This code runs slowly, but converges to a minimum at,

lambda: 5.488

with error: 14.2624

Here is a sample training at the optimal value with only the data as input (the code determines everything needed from the data.)

Figure shows training iterations for centers, training

data membership, cross-validation data membership.

The code is rough and inefficient, but the method seems robust enough to proceed to work on smoothing things out and run more tests. Neat.