【论文翻译】Clustering by Passing Messages Between Data Points

2020/8/20 12:40:28   来源:

【论文题目】:Clustering by Passing Messages Between Data Points
【论文来源】:Clustering by Passing Messages Between Data Points

Clustering by Passing Messages Between Data Points

Brendan J. Frey*, Delbert Dueck



Clustering data by identifying a subset of representative examples is important for processing sensory signals and detecting patterns in data. Such “exemplars” can be found by randomly choosing an initial subset of data points and then iteratively refining it, but this works well only if that initial choice is close to a good solution. We devised a method called “affinity propagation,” which takes as input measures of similarity between pairs of data points. Real-valued messages are exchanged between data points until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster images of faces, detect genes in microarray data, identify representative sentences in this manuscript, and identify cities that are efficiently accessed by airline travel. Affinity propagation found clusters with much lower error than other methods, and it did so in less than one-hundredth the amount of time.




Clustering data based on a measure of similarity is a critical step in scientific data analysis and in engineering systems. A common approach is to use data to learn a set of centers such that the sum of squared errors between data points and their nearest centers is small. When the centers are selected from actual data points, they are called “exemplars.” The popular k-centers clustering technique (1) begins with an initial set of randomly selected exemplars and iteratively refines this set so as to decrease the sum of squared errors. k-centers clustering is quite sensitive to the initial selection of exemplars, so it is usually rerun many times with different initializations in an attempt to find a good solution. However,this works well only when the number of clusters is small and chances are good that at least one random initialization is close to a good solution. We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars. By viewing each data point as a node in a network, we devised a method that recursively transmits real-valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges.As described later, messages are updated on the basis of simple formulas that search for minima of an appropriately chosen energy function. At any point in time, the magnitude of each message reflects the current affinity that one data point has for choosing another data point as its exemplar, so we call our method “affinity propagation.” Figure 1A illustrates how clusters gradually emerge during the message-passing procedure.


Affinity propagation takes as input a collection of real-valued similarities between data points, where the similarity s(i,k) indicates how well the data point with index k is suited to be the exemplar for data point i. When the goal is to minimize squared error, each similarity is set to a negative squared error (Euclidean distance): For points xix_{i} and xkx_{k}, s(i,k) =−xixk2\left \| x_{i} -x_{k}\right \|^{2}. Indeed, the method described here can be applied when the optimization criterion is much more general. Later, we describe tasks where similarities are derived for pairs of images, pairs of microarray measurements, pairs of English sentences, and pairs of cities. When an exemplar-dependent probability model is available, s(i,k) can be set to the log-likelihood of data point i given that its exemplar is point k. Alternatively, when appropriate, similarities may be set by hand.

亲和力的传播将数据点间的实值相似性的集合作为输入,其中,相似度s(i,k)表示索引为k的数据点很适合作为数据点i的样本。当目标是最小化平方误差时,将每个相似度被设置为一个负的平方误差(欧氏距离):对于点xix_{i}xkx_{k}, s(i,k) =−xixk2\left \| x_{i} -x_{k}\right \|^{2},实际上,这里描述的方法可以在优化准则更一般的情况下应用。随后,我们描述的任务中,相似性是由成对的图像、成对的微阵列测量、成对的英语句子和成对的城市得出的。当一个样本相关的概率模型可用时,s(i,k)可以被设为数据点i的对数似然,假设它的样本是点k。或者,在适当的时候,可以手工设置相似性。

Rather than requiring that the number of clusters be prespecified, affinity propagation takes as input a real number s(k,k) for each data point k so that data points with larger values of s(k,k) are more likely to be chosen as exemplars. These values are referred to as “preferences.” The number of identified exemplars (number of clusters) is influenced by the values of the input preferences, but also emerges from the message-passing procedure. If a priori, all data points are equally suitable as exemplars, the preferences should be set to a common value—this value can be varied to produce different numbers of clusters. The shared value could be the median of the input similarities (resulting in a moderate number of clusters) or their minimum (resulting in a small number of clusters).


There are two kinds of message exchanged between data points, and each takes into account a different kind of competition. Messages can be combined at any stage to decide which points are exemplars and, for every other point, which exemplar it belongs to. The“responsibility” r(i,k), sent from data point i to candidate exemplar point k, reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i, taking into account other potential exemplars for point i (Fig. 1B). The “availability” a(i,k), sent from candidate exemplar point k to point i,reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar, taking into account the support from other points that point k should be an exemplar (Fig. 1C). r(i,k) and a(i,k) can be viewed as log-probability ratios. To begin with, the availabilities are initialized to zero:a(i,k) = 0. Then, the responsibilities are computed using the rule

In the first iteration, because the availabilities are zero, r(i,k) is set to the input similarity between point i and point k as its exemplar,minus the largest of the similarities between point i and other candidate exemplars. This competitive update is data-driven and does not take into account how many other points favor each candidate exemplar. In later iterations, when some points are effectively assigned to other exemplars, their availabilities will drop below zero as prescribed by the update rule below. These negative availabilities will decrease the effective values of some of the input similarities s(i,k′) in the above rule, removing the corresponding candidate exemplars from competition. For k = i, the responsibility r(k,k) is set to the input preference that point k be chosen as an exemplar, s(k,k), minus the largest of the similarities between point i and all other candidate exemplars. This “self-responsibility” reflects accumulated evidence that point k is an exemplar, based on its input preference tempered by how ill-suited it is to be assigned to another exemplar.
数据点之间交换的消息有两种,每一种都考虑到竞争的不同类型。消息可以在任何阶段进行组合,以决定哪些点是样本,对于每个其他的点,它属于哪些样本。“责任”r(i,k),从数据点i发送到候选样本点k,反映了积累的证据,表明k点是多么适合作为第i点的样本,考虑到第i点的其他潜在样本(图1B)。“可用性”a(i,k),从候选样本点k发送到点i,反映了积累的证据,说明点i选择点k作为它的样本是多么合适,考虑到其他点的支持,认为点k应该是一个样本(图1C)。r(i,k) 和 a(i,k)可以看成是对数概率比。首先,可用性被初始化为零:a(i,k) = 0。然后,使用规则计算职责

在第一次迭代中,由于可用性为0,r(i,k)设为作为样本的点i和点k之间的输入相似度,减去点i和其他候选样本之间的相似度的最大值。这个竞争性的更新是数据驱动的,并且没有考虑有多少其他点有利于每个候选样本。在以后的迭代中,当一些点被有效地分配给其他样本时,它们的可用性将会降到0以下,这是下面的更新规则所规定的。这些负可用性会降低上述规则中某些输入相似点s(i,k’)的有效值,使相应的候选样本从竞争中消失。对于k = i,责任r(k,k)被设为点k被选为样本的输入偏好s(k,k),减去点i和所有候选样本之间最大的相似性。这种“自我责任”反映了k点是一个样本的积累证据,基于它的输入偏好再加上它有多不适合被分配到另一个范例。

Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point, the following availability update gathers evidence from data points as to whether each candidate exemplar would make a good exemplar:

The availability a(i,k) is set to the self-responsibility r(k,k) plus the sum of the positive responsibilities candidate exemplar k receives from other points. Only the positive portions of incoming responsibilities are added, because it is only necessary for a good exemplar to explain some data points well (positive responsibilities),regardless of how poorly it explains other data points (negative responsibilities). If the self-responsibility r(k,k) is negative (indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself), the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar. To limit the influence of strong incoming positive responsibilities, the total sum is thresholded so that it cannot go above zero. The “self-availability” a(k,k) is updated differently:

This message reflects accumulated evidence that point k is an exemplar, based on the positive responsibilities sent to candidate exemplar k from other points.


只增加传入责任的积极部分,因为一个好的范例只需要很好地解释一些数据点(积极的责任),而不管它对其他数据点的解释有多差(消极的责任)。如果自我责任r (k, k)是负的(表明点k是目前适合属于另一个样本,而不是作为一个样本本身),如果其他点对作为范例的点有积极的责任,那么k点作为范例的可用性就会增加。为了限制传入的积极责任的影响,总和是有阈值的,因此不能超过零。“自我可用性”a(k,k)的更新方式不同:


The above update rules require only simple,local computations that are easily implemented (2), and messages need only be exchanged between pairs of points with known similarities. At any point during affinity propagation, availabilities and responsibilities can be combined to identify exemplars. For point i, the value of k that maximizes a(i,k) + r(i,k) either identifies point i as an exemplar if k = i, or identifies the data point that is the exemplar for point i. The message-passing procedure may be terminated after a fixed number of iterations, after changes in the messages fall below a threshold, or after the local decisions stay constant for some number of iterations. When updating the messages,it is important that they be damped to avoid numerical oscillations that arise in some circumstances. Each message is set to l times its value from the previous iteration plus 1 – λ times its prescribed updated value, where the damping factor l is between 0 and 1. In all of our experiments (3), we used a default damping factor of λ = 0.5, and each iteration of affinity propagation consisted of (i) updating all responsibilities given the availabilities, (ii) updating all availabilities given the responsibilities, and (iii) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations.
上面的更新规则只需要简单的、容易实现的本地计算(2),并且只需要在已知相似点对之间交换消息。在亲和传播期间的任何点,可用性和职责都可以组合起来识别样本。对于点i,使a(i,k) + r(i,k)最大化的k的值,如果k = i,可以将点i识别为一个样本,或者识别为点i的样本数据点。消息传递过程可以在固定次数的迭代之后终止,也可以在消息中的更改低于阈值之后终止,或者在本地决策一定次数的迭代中保持不变之后终止。在更新信息时,重要的是要对它们进行阻尼,以避免在某些情况下出现的数值振荡。每条消息被设置为λ 乘以它在以前的迭代中得到的值加上1 -λ 乘以它所规定的更新值,其中阻尼因子λ 在0到1之间。(3)我们所有的实验中,我们使用一个默认的阻尼因子λ = 0.5,和每次迭代的亲和力传播包括(i)更新所有责任考虑到可用性,(2)更新所有可用性的责任,和(3)结合可用性和责任监督样本决定,终止算法当10次迭代中这些决定不改变。

Figure 1A shows the dynamics of affinity propagation applied to 25 two-dimensional data points (3), using negative squared error as the similarity. One advantage of affinity propagation is that the number of exemplars need not be specified beforehand. Instead, the appropriate number of exemplars emerges from the message passing method and depends on the input exemplar preferences. This enables automatic model selection, based on a prior specification of how preferable each point is as an exemplar. Figure 1D shows the effect of the value of the common input preference on the number of clusters. This relation is nearly identical to the relation found by exactly minimizing the squared error (2).

We next studied the problem of clustering images of faces using the standard optimization criterion of squared error. We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images extracted from the Olivetti face database (3). Affinity propagation found exemplars with much lower squared error than the best of 100 runs of k-centers clustering (Fig. 2A), which took about the same amount of computer time.We asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error. Figure 2B shows the error achieved by one run of affinity propagation and the distribution of errors achieved by 10,000 runs of k-centers clustering, plotted against the number of clusters. Affinity propagation uniformly achieved much lower error in more than two orders of magnitude less time. Another popular optimization criterion is the sum of absolute pixel differences (which better tolerates outlying pixel intensities), so we repeated the above procedure using this error measure. Affinity propagation again uniformly achieved lower error (Fig. 2C).

Many tasks require the identification of exemplars among sparsely related data, i.e., where most similarities are either unknown or large and negative. To examine affinity propagation in this context, we addressed the task of clustering putative exons to find genes, using the sparse similarity matrix derived from microarray data and reported in (4). In that work, 75,066 segments of DNA (60 bases long) corresponding to putative exons were mined from the genome of mouse chromosome 1. Their transcription levels were measured across 12 tissue samples, and the similarity between every pair of putative exons (data points) was computed. The measure of similarity between putative exons was based on their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues. To account for putative exons that are not exons (e.g., introns), we included an additional artificial exemplar and determined the similarity of each other data point to this “non-exon exemplar” using statistics taken over the entire data set. The resulting 75,067 × 75,067 similarity matrix (3) consisted of 99.73% similarities with values of −∞, corresponding to distant DNA segments that could not possibly be part of the same gene. We applied affinity propagation to this similarity matrix, but because messages need not be exchanged between point i and k if s(i,k) = −∞, each iteration of affinity propagation required exchanging messages between only a tiny subset (0.27% or 15million) of data point pairs.

Figure 3A illustrates the identification of gene clusters and the assignment of some data points to the nonexon exemplar. The reconstruction errors for affinity propagation and k-centers clustering are compared in Fig. 3B.For each number of clusters, affinity propagation was run once and took 6 min, whereas k-centers clustering was run 10,000 times and took 208 hours. To address the question of how well these methods perform in detecting bona fide gene segments, Fig. 3C plots the truepositive (TP) rate against the false-positive (FP) rate, using the labels provided in the RefSeq database (5). Affinity propagation achieved significantly higher TP rates, especially at low FP rates, which are most important to biologists. At a FP rate of 3%, affinity propagation achieved a TP rate of 39%, whereas the best k-centers clustering result was 17%. For comparison, at the same FP rate, the best TP rate for hierarchical agglomerative clustering (2) was 19%, and the engineering tool described in (4), which accounts for additional biological knowledge, achieved a TP rate of 43%.

Affinity propagation’s ability to operate on the basis of nonstandard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity. Unlike metricspace clustering techniques such as k-means clustering (1), affinity propagation can be applied to problems where the data do not lie in a continuous space. Indeed, it can be applied to problems where the similarities are not symmetric [i.e., s(i,k) ≠ s(k,i)] and to problems where the similarities do not satisfy the triangle inequality [i.e., s(i,k) < s(i,j) + s( j,k)]. To identify a small number of sentences in a draft of this manuscript that summarize other sentences, we treated each sentence as a “bag of words” (6) and computed the similarity of sentence i to sentence k based on the cost of encoding the words in sentence i using the words in sentence k. We found that 97% of the resulting similarities (2, 3) were not symmetric. The preferences were adjusted to identify (using l = 0.8) different numbers of representative exemplar sentences (2), and the solution with four sentences is shown in Fig. 4A.

亲和力传播基于非标准优化准则的操作能力使得它适合于使用不寻常的相似性度量进行探索性数据分析。不像度量空间聚类的技术,例如k-均值聚类(1),亲和传播可以应用于数据不在连续空间中的问题。实际上,它可以被应用于相似性不对称的问题[即s(i,k)≠s(k,i)]和相似性不满足三角形不等式[即s(i,k) < s(i,j) + s( j,k)]的问题。为了找出本文草稿中总结其他句子的少量句子,我们将每个句子视为一个“单词包”(6),并根据k句子中单词的编码成本计算出句子与k句子的相似度。我们发现97%的相似性(2,3)不是对称的。调整偏好以识别(使用l=0.8)不同数量的代表性例句(2),四个句子的解决方案如图4A所示。

We also applied affinity propagation to explore the problem of identifying a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities, in terms of estimated commercial airline travel time. Each data point was a city, and the similarity s(i,k) was set to the negative time it takes to travel from city i to city k by airline, including estimated stopover delays (3). Due to headwinds, the transit time was in many cases different depending on the direction of travel, so that 36% of the similarities were asymmetric. Further, for 97% of city pairs i and k, there was a third city j such that the triangle inequality was violated, because the trip from i to k included a long stopover delay n city j so it took longer than the sum of the durations of the trips from i to j and j to k. When the number of “most accessible cities” was constrained to be seven (by adjusting the input preference appropriately), the cities shown in Fig. 4, B to E, were identified. It is interesting that several major cities were not selected, either because heavy international travel makes them inappropriate as easily accessible domestic destinations (e.g., New YorkCity, Los Angeles) or because their neighborhoods can be more efficiently accessed through other destinations (e.g., Atlanta, Philadelphia, and Minneapolis account for Chicago’s destinations, while avoiding potential airport delays).

Affinity propagation can be viewed as a method that searches for minima of an energy function (7) that depends on a set of N hidden labels, c1,…,cN, corresponding to the N data points. Each label indicates the exemplar to which the point belongs, so that s(i,ci) is the similarity of data point i to its exemplar. ci = i is a special case indicating that point i is itself an exemplar, so that s(i,ci) is the input preference for point i. Not all configurations of the labels are valid; a configuration c is valid when for every point i, if some other point i′ has chosen i as its exemplar (i.e., ci′ = i), then i must be an exemplar (i.e., ci = i). The energy of a valid configuration is E(c)=i=1Ns(i,ci).E(c)=-\sum_{i=1}^{N}s(i,c_{i}). Exactly minimizing the energy is computationally intractable, because a special case of this minimization problem is the NP-hard k-median problem (8). However, the update rules for affinity propagation correspond to fixed-point recursions for minimizing a Bethe free-energy (9) approximation. Affinity propagation is most easily derived as an instance of the max-sum algorithm in a factor graph (10) describing the constraintson the labels and the energy function (2).
亲和传播可以看作是一种搜索能量函数(7)的极小值的方法,该能量函数依赖于与N个数据点相对应的一组N个隐藏标签c1,…,cN。每个标签都表示该点所属的样本,因此s(i,Ci)是数据点i与其范例的相似性。Ci=i是一个特例,表明i点本身就是一个样本,因此s(i,Ci)是点i的输入偏好。并非所有标签的配置都有效;配置c是有效的,当对于每个点i,如果其他点 i’ 选择i作为其样本(即Ci=i),那么i必须是一个范例(即Ci=i)。有效配置的能量为E(c)=i=1Ns(i,ci).E(c)=-\sum_{i=1}^{N}s(i,c_{i}).精确地最小化能量在计算上是困难的,因为这个最小化问题的一个特例是NP难k-中值问题(8)。然而,亲合传播的更新规则对应于最小化Bethe自由能(9)近似的不动点递归。亲和力传播作为因子图(10)中描述标签约束和能量函数(2)的最大和算法的实例最容易导出。然而,亲合传播的更新规则对应于最小化Bethe自由能(9)近似的不动点递归。亲和力传播作为因子图(10)中描述标签约束和能量函数(2)的最大和算法的实例最容易导出。

In some degenerate cases, the energy function may have multiple minima with corresponding multiple fixed points of the update rules, and these may prevent convergence. For example, if s(1,2) = s(2,1) and s(1,1) = s(2,2), then the solutions c1 = c2 = 1 and c1 = c2 = 2 both achieve the same energy. In this case, affinity propagation may oscillate, with both data points alternating between being exemplars and nonexemplars. In practice, we found that oscillations could always be avoided by adding a tiny amount of noise to the similarities to prevent degenerate situations,or by increasing the damping factor.

Affinity propagation has several advantages over related techniques. Methods such as k-centers clustering (1), k-means clustering(1), and the expectation maximization (EM) algorithm (11) store a relatively small set of estimated cluster centers at each step. These techniques are improved upon by methods that begin with a large number of clusters and then prune them (12), but they still rely on random sampling and make hard pruning decisions that cannot be recovered from. In contrast, by simultaneously considering all data points as candidate centers and gradually identifying clusters, affinity propagation is able to avoid many of the poor solutions caused by unlucky initializations and hard decisions. Markov chain Monte Carlo techniques (13) randomly search for good solutions, but do not share affinity propagation’s advantage of considering many possible solutions all at once.

Hierarchical agglomerative clustering (14) and spectral clustering (15) solve the quite different problem of recursively comparing pairs of points to find partitions of the data. These techniques do not require that all points within a cluster be similar to a single center and are thus not well-suited to many tasks. In particular, two points that should not be in the same cluster may be grouped together by an unfortunate sequence of pairwise groupings.

In (8), it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation. There, the input was assumed to be metric, i.e., nonnegative, symmetric, and satisfying the triangle inequality. In contrast, affinity propagation can take as input general nonmetric similarities. Affinity propagation also provides a conceptually new approach that works well in practice. Whereas the linear programming relaxation is hard to solve and sophisticated software packages need to be applied (e.g., CPLEX), affinity propagation makes use of intuitive message updates that can be implemented in a few lines of code (2).
在(8)中,证明了相关的度量k-中值问题可以放宽到常系数近似线性规划。在这里,假设输入是度量的,即非负的,对称的,满足三角不等式的。相反,亲和传播可以将一般非度量相似性作为输入。亲和力传播还提供了一种概念上新的方法,在实践中效果良好。虽然线性规划松弛很难解决,并且需要应用复杂的软件包(例如 ,CPLEX),但亲和传播利用了可以在几行代码中实现的直观消息更新(2)。

Affinity propagation is related in spirit to techniques recently used to obtain record-breaking results in quite different disciplines (16). The approach of recursively propagating messages (17) i n a “loopy graph” has been used to approach Shannon’s limit in error-correcting decoding (18, 19), solve random satisfiability problems with an order-of-magnitude increase in size (20), solve instances of the NP-hard twodimensional phase-unwrapping problem (21), and efficiently estimate depth from pairs of stereo images (22). Yet, to our knowledge, affinity propagation is the first method to make use of this idea to solve the age-old, fundamental problem of clustering data. Because of its simplicity, general applicability, and performance, we believe affinity propagation will prove to be of broad value in science and engineering.