# Benchmark — K-Means Clustering (MSSC)

The K-Means Clustering problem is defined as follows: given a set of points along some dimensions, find a partition of these points into k clusters such that the total sum of squared Euclidean distances between each cluster's centroid and its elements is minimized. This problem is also known as the Minimum Sum-of-Squares Clustering (MSSC). We illustrate on this page how LocalSolver outperforms traditional general-purpose optimization solvers, like Gurobi 9.1, on this challenging problem.

## Input data

The instances of this benchmark come from the UCI Machine Learning Repository, gathering a wide variety of datasets used in machine learning with an arbitrary number of dimensions, and the TSPLIB, which is the reference instance repository for the Traveling Salesman Problem (TSP) and contains only 2-dimensional points.

We extrated from the these two repositories the instances used in research articles [1], [2], [3], [4], [5], presenting state-of-the-art dedicated algorithms for the K-Means Clustering problem. Since each instance can be solved for any value of k, the number of clusters used to partition the points, we decided to only keep instances solved with k ≤ 25. Common clustering applications involve a larger number of clusters.

## Mathematical models

The results reported for Gurobi are obtained with the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation presented here. Each cluster center is represented by continuous variables, one for each dimension, allowing the computation of Euclidean distances to each point thanks to quadratic constraints. In addition, binary variables are used to model assignments of points to clusters. Note that due to numerical issues on instances with large coordinates, we had to scale all coordinates in the range [0,1] for Gurobi.

The LocalSolver model relies on set variables representing assignments of points to clusters. The centroid of each cluster is then computed using a lambda expression. The distance between the points assigned to a cluster and its centroid is computed following the same approach. Compared to the MIQCP model, the LocalSolver model is much more compact, as the only decisions to be taken are the assignments of points to clusters. Set variables combined with lambda expressions enable a simple and intuitive modeling of the problem, without limiting the possible extensions.

```
function model() {
// Set variables: clusters[c] represents the points in cluster c
clusters[1..k] <- set(nbPoints);
// Each point must belong to one and only one cluster
constraint partition[c in 1..k](clusters[c]);
for [c in 1..k] {
local cluster <- clusters[c];
local size <- count(cluster);
// Compute the centroid of the cluster
centroid[d in 0..nbDimensions-1] <- size == 0 ? 0 : sum(cluster, o => coordinates[o][d]) / size;
// Compute the variance of the cluster
squares[d in 0..nbDimensions-1] <- sum(cluster, o => pow(coordinates[o][d] - centroid[d], 2));
variances[c] <- sum[d in 0..nbDimensions-1](squares[d]);
}
// Minimize the total variance
obj <- sum[c in 1..k](variances[c]);
minimize obj;
}
```

## Results

We compare both solvers' performance with two solving times: 1 minute and 10 minutes. At the end of the running time, we measure the gap to the best-known solutions reported in the cited research papers. Each instance is solved with different values of k, the number of clusters. The first eight instances, from "german" to "pr2392", are solved for k = 2, 3, 4, 5, 6, 7, 8, 9, 10. The remaining, largest instances, from "breast" to "pla85900", are solved for k = 2, 5, 10, 15, 20, 25. For each instance, the gap reported in the table corresponds to the arithmetic mean of the gaps in % between the best-known solution and the solution found by the solver for each value of k. When a solver doesn't find any feasible solution within the time limit for at least one value of k, no average gap value but the letter "X" is reported in the table.

We use LocalSolver 10.0 and Gurobi 9.1, known as a state-of-the-art MIP solver. Both are used with default parameters. We ran them on a server equipped with an Intel Core i5-6500 processor (4 cores, 3.6GHz, 6MB cache) and 16GB RAM.

LocalSolver | Gurobi | |||||
---|---|---|---|---|---|---|

Instances | Points | Dimensions | 1 min | 10 min | 1 min | 10 min |

german | 59 | 2 | 0.0 | 0.0 | 5.9 | 3.9 |

ruspini | 75 | 2 | 0.0 | 0.0 | 7.9 | 7.3 |

spath | 89 | 3 | 0.0 | 0.0 | 11.9 | 3.8 |

hatco | 100 | 7 | 0.0 | 0.0 | 16.8 | 10.0 |

iris | 150 | 4 | 0.0 | 0.0 | 24.9 | 14.7 |

gr202 | 202 | 2 | 0.2 | 0.0 | 26.0 | 20.9 |

gr666 | 666 | 2 | 0.3 | 0.1 | 33.8 | 18.4 |

pr2392 | 2,392 | 2 | 0.2 | 0.0 | X | 35.6 |

breast | 683 | 9 | 0.8 | 0.1 | 40.3 | 21.3 |

u1060 | 1,060 | 2 | 0.8 | 0.4 | 62.8 | 49.5 |

pcb3038 | 3,038 | 2 | 0.5 | 0.3 | X | 63.2 |

pendigit | 10,992 | 16 | 0.8 | 0.8 | X | 61.4 |

d15112 | 15,112 | 2 | 0.1 | 0.1 | X | 73.3 |

letter | 20,000 | 16 | 0.8 | 0.4 | X | X |

kegg | 53,413 | 23 | 25.7 | 5.9 | X | X |

shuttle | 58,000 | 9 | 4.5 | 1.4 | X | X |

pla85900 | 85,900 | 2 | 0.9 | 0.4 | X | X |

*Average gaps to best-known solutions.
*

LocalSolver obtains high-quality solutions in short running times, with gap lower than 1%, even for the largest and hardest instances. On the other hand, Gurobi finds viable solutions on instance with less than 100 points, but quickly struggles to find ones for larger instances. For huge instances with 10,000 points, Gurobi fails to find any feasible solution.

## Conclusion

LocalSolver offers an innovative modeling approach based on set variables which makes the mathematical modeling of the K-Means Clustering (MSSC) problem much more compact than the MIQCP formulation as handled by traditional MIP solvers. Extra constraints can be added easily, offering flexibility to extend the model to suit real-world clustering needs. Beyond simplicity, LocalSolver delivers near-optimal solutions in minutes.

You can find the complete model of the K-Means Problem in Python, Java, C#, and C++, and LSP in our Example Tour.

We are at your disposal to accompany you in the discovery of LocalSolver. Don't hesitate to contact us for any further information or support.

[1] D. Aloise, P. Hansen (2009). A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering. Pesquisa Operacional. 29(3), 503-516.

[2] D. Aloise, P. Hansen, L. Liberti (2012). An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming 131, pp. 195–220.

[3] A.M. Bagirov, B. Ordin, G. Ozturk, A.E. Xavier (2015). An incremental clustering algorithm based on hyperbolic smoothing. Computational Optimization and Applications 61, pp. 219–241.

[4] T. Pereira, D. Aloise, J. Brimberg, N. Mladenović (2018). Review of basic local searches for solving the minimum sum-of-squares clustering problem. In P. Pardalos, A. Migdalas (eds), Open Problems in Optimization and Data Analysis. Springer Optimization and Its Applications, vol. 141. Springer, Cham.

[5] P. Kalczynski, J. Brimberg, Z. Drezner (2019). The importance of good starting solutions in the minimum sum of squares clustering problem. Submitted on arXiv on 6 Apr 2020.