In my last post I compared dp-means and k-means error functions and run times.  John Myles White pointed to some opportunities that come from $\lambda$ being a continuous variable.

Evolving the test code I posted on github, I developed a quick-and-dirty proof of concept.

First, below is the parameter vs. error graph in its latest incarnation.  There are two important changes from the analogous graph from last post:

• Instead of using the k-means cost function to make the timing, error comparisons as I did before, I am now plotting the traditional k-means cost function for k-means and the cost function for dp-means,

$\text{Cost(K-means)} + \lambda k$

• I am not plotting against $\text{data range}/\lambda$ for comparison
• I am plotting errors for a data set not used in training (called cross-validation in the code).

The cost function for dp-means shows a clear minimum. This graph is slightly confusing because the parameter for k-means, k, the number of clusters, increases left-to-right, while the number of clusters in dp-means goes down with increasing parameter $\lambda$.

I wrote a small script that leverages SciPy to optimize the dp-means cost function in order to determine the optimal value of $\lambda$, and therefore the number of clusters.

Here is an example on one of the data sets included as an example “input” directory.  This code runs slowly, but converges to a minimum at,

lambda: 5.488
with error: 14.2624

Here is a sample training at the optimal value with only the data as input (the code determines everything needed from the data.)

Figure shows training iterations for centers, training
data membership, cross-validation data membership.

The code is rough and inefficient, but the method seems robust enough to proceed to work on smoothing things out and run more tests. Neat.

One Comment leave one →