The K-Means Clustering problem is defined as follows: given a set of points along some dimensions, find a partition of these points into k clusters. The total sum of squared Euclidean distances between each cluster’s centroid and its elements is minimized. This problem is also known as the Minimum Sum-of-Squares Clustering (MSSC). We illustrate on this page how LocalSolver outperforms traditional general-purpose optimization solvers, like Gurobi 9.1, on this challenging problem.

## Input data

The instances of this benchmark come from the UCI Machine Learning Repository, gathering a wide variety of datasets used in machine learning with an arbitrary number of dimensions, and the TSPLIB, which is the reference instance repository for the Traveling Salesman Problem (TSP) and contains only 2-dimensional points.

We extracted from these two repositories the instances used in research articles , presenting state-of-the-art dedicated algorithms for the K-Means Clustering problem. Since each instance can be solved for any value of k, the number of clusters is used to partition the points. We decided only to keep instances solved with k ≤ 25. Typical clustering applications involve a more significant number of clusters.

## Mathematical models of the K-Means Clustering Problem (MSSC)

### Mixed-Integer Linear Programming (MILP) model

The results reported for Gurobi are obtained with the Mixed-Integer Quadratically Constrained Programming (MIQCP) formulation presented here. Each cluster center is represented by continuous variables, one for each dimension, allowing the computation of Euclidean distances to each point thanks to quadratic constraints. In addition, binary variables are used to model assignments of points to clusters. Note that due to numerical issues on instances with large coordinates, we had to scale all coordinates in the range [0,1] for Gurobi.

### LocalSolver model

The LocalSolver model relies on set variables representing assignments of points to clusters. The centroid of each cluster is then computed using a lambda expression. The distance between the points assigned to a cluster and its centroid is computed following the same approach. Compared to the MIQCP model, the LocalSolver model is much more compact, as the only decisions to be taken are the assignments of points to clusters. Set variables combined with lambda expressions enable simple and intuitive modeling of the problem without limiting the possible extensions.

``````function model() {
// Set variables: clusters[c] represents the points in cluster c
clusters[1..k] <- set(nbPoints);

// Each point must belong to one and only one cluster
constraint partition[c in 1..k](clusters[c]);

for [c in 1..k] {
local cluster <- clusters[c];
local size <- count(cluster);

// Compute the centroid of the cluster
centroid[d in 0..nbDimensions-1] <- size == 0 ? 0 : sum(cluster, o => coordinates[o][d]) / size;

// Compute the variance of the cluster
squares[d in 0..nbDimensions-1] <- sum(cluster, o => pow(coordinates[o][d] - centroid[d], 2));
variances[c] <- sum[d in 0..nbDimensions-1](squares[d]);
}

// Minimize the total variance
obj <- sum[c in 1..k](variances[c]);
minimize obj;
}``````

## Gurobi and LocalSolver results on the K-Means Clustering Problem (MSSC)

We compare both solvers’ performance with two solving times: 1 minute and 10 minutes. At the end of the running time, we measure the gap to the best known solutions reported in the cited research papers. Each instance is solved with different values of k, the number of clusters. The first eight instances, from “german” to “pr2392”, are solved for k = 2, 3, 4, 5, 6, 7, 8, 9, 10. The remaining, largest instances, from “breast” to “pla85900”, are solved for k = 2, 5, 10, 15, 20, 25. For each instance, the gap reported in the table corresponds to the arithmetic mean of the gaps in % between the best known solution and the solution found by the solver for each value of k. When a solver doesn’t find any feasible solution within the time limit for at least one value of k, no average gap value but the letter “X” is reported in the table.

We use LocalSolver 10.0 and Gurobi 9.1, known as a state-of-the-art MIP solver. Both are used with default parameters. We ran them on a server equipped with an Intel Core i5-6500 processor (4 cores, 3.6GHz, 6MB cache) and 16GB RAM.

LocalSolver Gurobi
Instances Points Dimensions 1 min 10 min 1 hour 1 min 10 min 1 hour
german 59 2 0.0 0.0 0.0 5.9 3.9 3.6
ruspini 75 2 0.0 0.0 0.0 7.9 7.3 5.6
spath 89 3 0.0 0.0 0.0 11.9 3.8 1.6
hatco 100 7 0.0 0.0 0.0 16.8 10.0 9.0
iris 150 4 0.0 0.0 0.0 24.9 14.7 14.4
gr202 202 2 0.2 0.0 0.0 26.0 20.9 18.2
gr666 666 2 0.3 0.1 0.1 33.8 18.4 18.1
pr2392 2,392 2 0.2 0.0 0.0 X 35.6 10.8
breast 683 9 0.8 0.1 0.1 40.3 21.3 14.1
u1060 1,060 2 0.8 0.4 0.0 62.8 39.4 21.6
pcb3038 3,038 2 0.5 0.3 0.3 X 63.2 34.9
pendigit 10,992 16 0.8 0.8 0.4 X 61.4 54.2
d15112 15,112 2 0.1 0.1 0.0 X 73.3 71.7
letter 20,000 16 0.8 0.4 0.2 X X 44.4
kegg 53,413 23 25.7 5.9 3.1 X X X
shuttle 58,000 9 4.5 1.4 0.9 X X X
pla85900 85,900 2 0.9 0.4 0.1 X X 78.6
Average gaps in % to best known solutions.

LocalSolver obtains high-quality solutions in short running times, with a gap lower than 1%, even for the largest and hardest instances. On the other hand, Gurobi finds viable solutions on instances with less than 100 points but quickly struggles to find ones for larger instances. For huge instances with 10,000 points, Gurobi fails to find any feasible solution.

## Conclusion

LocalSolver offers an innovative modeling approach based on set variables which makes the mathematical modeling of the K-Means Clustering (MSSC) problem much more compact than the MIQCP formulation as handled by traditional MIP solvers. Extra constraints can be added easily, offering flexibility to extend the model to suit real-world clustering needs. Beyond simplicity, LocalSolver delivers near-optimal solutions in minutes.

You can find the complete model of the K-Means Clustering (MSSC) problem and many other Clustering problems in Python, Java, C#, and C++ in our Example Tour.

 D. Aloise, P. Hansen (2009). A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering. Pesquisa Operacional. 29(3), 503-516.

 D. Aloise, P. Hansen, L. Liberti (2012). An improved column generation algorithm for minimum sum-of-squares clustering. Mathematical Programming 131, pp. 195–220.

 A.M. Bagirov, B. Ordin, G. Ozturk, A.E. Xavier (2015). An incremental clustering algorithm based on hyperbolic smoothing. Computational Optimization and Applications 61, pp. 219–241.

 T. Pereira, D. Aloise, J. Brimberg, N. Mladenović (2018). Review of basic local searches for solving the minimum sum-of-squares clustering problem. In P. Pardalos, A. Migdalas (eds), Open Problems in Optimization and Data Analysis. Springer Optimization and Its Applications, vol. 141. Springer, Cham.

 P. Kalczynski, J. Brimberg, Z. Drezner (2019). The importance of good starting solutions in the minimum sum of squares clustering problem. Submitted on arXiv on 6 Apr 2020.

Share