NP-Hardness of balanced minimum sum-of-squares clustering

NP-Hardness of balanced minimum sum-of-squares clustering

Pattern Recognition Letters 97 (2017) 44–45 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com...

289KB Sizes 0 Downloads 30 Views

Pattern Recognition Letters 97 (2017) 44–45

Contents lists available at ScienceDirect

Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

NP-Hardness of balanced minimum sum-of-squares clustering Artem Pyatkin a, Daniel Aloise b,∗, Nenad Mladenovic´ c a

Sobolev Institute of Mathematics, Novosibirsk State University, Novosibirsk, Russia École Polytechnique de Montréal, Montréal, Canada c Mathematical Institute, Serbian Academy of Science and Arts, Belgrade, Serbia b

a r t i c l e

i n f o

Article history: Received 18 January 2017 Available online 30 June 2017 MSC: 41A05 41A10 65D05 65D17

a b s t r a c t The balanced clustering problem consists of partitioning a set of n objects into K equal-sized clusters as long as n is a multiple of K. A popular clustering criterion when the objects are points of a q-dimensional space is the minimum sum of squared distances from each point to the centroid of the cluster to which it belongs. We show in this paper that this problem is NP-hard in general dimension already for triplets, i.e., when n/K = 3. © 2017 Elsevier B.V. All rights reserved.

Keywords: Balanced clustering Sum-of-squares Complexity

1. Introduction The minimum sum-of-squares clustering (MSSC), also known in the literature as k-means clustering, is a central problem in cluster analysis. Given a set of n points X = {x1 , . . . , xn } in a given Euclidean space Rq , it addresses the problem of finding a partition P = {C1 , . . . , CK } of K clusters minimizing the sum of squared distances from each point to the centroid of the cluster to which it belongs. The problem can be expressed as:

min

K  

( xi − yk  )2 ,

(1)

k=1 i:xi ∈Ck

where  ·  is the Euclidean norm and yk is the centroid of the points xi in cluster Ck . The problem is NP-hard in the plane for general K [7]. In general dimension, it is NP-hard already for two clusters [1]. Balanced MSSC imposes that the points be equally spread among the clusters when n is a multiple of K. The balance requirement may appear in applications of fields as varied as cloud computing [8], image segmentation [5], and team building [3]. When n/K = 2, balanced MSSC is equivalent to the minimum weighted perfect matching problem, and consequently, can be solved in time

O(n3 ). A NP-hardness proof for the particular case of two equalsized clusters, i.e., K = 2, can be found in [2], Kelmanov and Pyatkin [6]. To the best of our knowledge, it is unknown from which ratio n/K balanced MSSC becomes NP-hard. In the form of a decision problem, some W > 0 is also given and one asks whether there is a partition into K clusters, each of cardinality n/K, with cost smaller or equal to W. In the sequel, we show that the decision version of balanced MSSC is NP-complete already for triplets.

2. A proof by reduction from the problem of partitioning a graph into triangles In this section we prove the following: Theorem 1. Balanced MSSC is NP-complete for n/K = 3. Proof. The reduction is from the well-known NP-complete problem of Partitioning into Triangles (PiT) [4] whose question is to answer for a given graph G = (V, E ) with |V | = 3K vertices if there exists a partition V = {V1 , . . . , VK } such that each subset consists of three pairwise adjacent vertices, i.e., that form a triangle. First of all, we need the following result: Proposition 1. For every finite Ck ⊂ P the following formula holds



Corresponding author. E-mail addresses: [email protected], [email protected] (D. Aloise).

http://dx.doi.org/10.1016/j.patrec.2017.05.033 0167-8655/© 2017 Elsevier B.V. All rights reserved.

1   2|Ck |

i:xi ∈Ck j:x j ∈Ck

xi − x j 2 =

 i:xi ∈Ck

xi − yk 2 .

A. Pyatkin et al. / Pattern Recognition Letters 97 (2017) 44–45

of the points xi ∈ Ck . In both cases Akt = (02 + 12 + 12 + 12 + 02 + 02 + 12 + 02 + 02 )/6 = 2/3. Then the contribution of the t-th coordinate into the objective function is

Proof.

1   2|Ck |

xi − x j 2

i:xi ∈Ck j:x j ∈Ck

= =

1 2|Ck | 

 



xi 2 − 2 xi , x j + x j 2



 At =

 

j:x j ∈Ck



xi 2 − xi , yk  =

 

xi − yk 2 +

i:xi ∈Ck

 

 

xi − yk 2 + xi − yk , yk 

i:xi ∈Ck

( xi − yk ), yk



i:xi ∈Ck

=





xi − yk 2 .

i:xi ∈Ck

 Consider an arbitrary instance of PiT with |V | = 3K and |E | = m. Make q = m and W = (4m − 6K )/3. Let xi be the i-th row of the incidence matrix of the graph G, i.e.



xit =

1, 0,

if vertex vi is incident to edge et ; otherwise.

(2)

Let us now suppose that the rows xi , for i = 1, . . . , |V |, are points in Rq . Each partition of these points into clusters of size n/K = 3 trivially corresponds to a partition of the vertices of the graph into subsets of three vertices. Since n/K = 3, Proposition 1 allows to rewrite the objective function of balanced MSSC as: q q K K   1    (xit − x jt )2 = Akt , 6 t=1 k=1 i:xi ∈Ck j:x j ∈Ck

where Akt = (



i:xi ∈Ck

(3)

t=1 k=1



j:x j ∈Ck (xit

K 

Akt =

k=1

    xi 2 − xi , x j /|Ck |

i:xi ∈Ck

=



i:xi ∈Ck j:x j ∈Ck

i:xi ∈Ck

=



45

− x jt )2 )/6 is the contribution into

the objective function of the k-th triplet with respect to the t-th coordinate. If no endpoint of an edge et lies in Ck then xit = 0 for each xi ∈ Ck and, clearly, Akt = 0. Otherwise, xit = 1 for one or two

⎧ ⎪ ⎨2/3, ⎪ ⎩

4/3,

if both endpoints of the edge et are in the same subset Vk corresponding to a cluster Ck ; otherwise.

(4)

Denote by b the number of edges whose both endpoints are in the same subset (corresponding to a cluster). Since each cluster has cardinality three, it covers at most three edges. Consequently, b ≤ 3K and equality holds if and only if each Vk induces a triangle. So, the cost of the balanced partition is 4m/3 − 2b/3 ≥ (4m − 6K )/3, which means that the cost of the balanced partition is smaller or equal to W if and only if each Vk induces a triangle in G.  Acknowledgements This research was partially supported by RFBR, projects 1607-00168 and 15-01-00462, by RSF grant 14-41-00039, and by CNPq/Brazil grants 308887/2014-0 and 400350/2014-9. References [1] D. Aloise, A. Deshpande, P. Hansen, P. Popat, NP-hardness of Euclidean sum-of-squares clustering, Mach. Learn. 75 (2009) 245–249. [2] A. Bertoni, M. Goldwurm, J. Lin, F. Saccà, Size constrained distance clustering: separation properties and some complexity results, Fundam. Inform. 115 (1) (2012) 125–139. [3] J. Desrosiers, N. Mladenovic´ , D. Villeneuve, Design of balanced MBA student teams, J. Op. Res. Soc. 56 (1) (2005) 60–66. [4] M. Garey, D. Johnson, Computers and Intractability, W.H. Freeman and Company, New York, 1979. [5] L. Hagen, A.B. Kahng, New spectral methods for ratio cut partitioning and clustering, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 11 (9) (1992) 1074–1085. [6] A. Kelmanov, A. Pyatkin, On the complexity of some quadratic euclidean 2-clustering problems, Comput. Math. Math. Phys. 56 (3) (2016) 491–497. [7] M. Mahajan, P. Nimbhorkar, K. Varadarajan, The planar k-means problem is NP-hard, Lect. Notes Comput. Sci. 5431 (2009) 274–285. [8] W. Su, J. Hu, C. Lin, S. Shen, SLA-aware tenant placement and dynamic resource provision in saas, in: Web Services (ICWS), 2015 IEEE International Conference on, IEEE, 2015, pp. 615–622.