signed

QiShunwang

“诚信为本、客户至上”

# Reducing the Dimensionality of Data with Neural Networks【论文翻译】

2020/8/20 11:33:59   来源：

## 利用神经网络降低数据的维度

G. E. Hinton* and R. R. Salakhutdinov

High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such ‘‘autoencoder’’ networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

## 正文

Dimensionality reduction facilitates the classification, visualization, communication, and storage of high-dimensional data. A simple and widely used method is principal components analysis (PCA), which finds the directions of greatest variance in the data set and represents each data point by its coordinates along each of these directions. We describe a nonlinear generalization of PCA that uses an adaptive, multilayer “encoder” network to transform the high-dimensional data into a low-dimensional code and a similar “decoder” network to recover the data from the code.

Starting with random weights in the two networks, they can be trained together by minimizing the discrepancy between the original data and its reconstruction. The required gradients are easily obtained by using the chain rule to backpropagate error derivatives first through the decoder network and then through the encoder network (1). The whole system is called an “autoencoder” and is depicted in Fig. 1.

Fig. 1. Pretraining consists of learning a stack of restricted Boltzmann machines (RBMs), each having only one layer of feature detectors. The learned feature activations of one RBM are used as the ‘‘data’’ for training the next RBM in the stack. After the pretraining, the RBMs are ‘‘unrolled’’ to create a deep autoencoder, which is then fine-tuned using backpropagation of error derivatives.

It is difficult to optimize the weights in nonlinear autoencoders that have multiple hidden layers (2–4). With large initial weights, autoencoders typically find poor local minima; with small initial weights, the gradients in the early layers are tiny, making it infeasible to train autoencoders with many hidden layers. If the initial weights are close to a good solution, gradient descent works well, but finding such initial weights requires a very different type of algorithm that learns one layer of features at a time. We introduce this “pretraining” procedure for binary data, generalize it to real-valued data, and show that it works well for a variety of data sets.

An ensemble of binary vectors (e.g., images) can be modeled using a two-layer network called a “restricted Boltzmann machine” (RBM) (5, 6) in which stochastic, binary pixels are connected to stochastic, binary feature detectors using symmetrically weighted connections. The pixels correspond to “visible” units of the RBM because their states are observed; the feature detectors correspond to “hidden” units. A joint configuration (v, h) of the visible and hidden units has an energy (7) given by
$E(v,h)=-\sum_{i \varepsilon pixels} b_iv_i -\sum_{j \varepsilon features} b_jh_j -\sum_{i,j} v_ih_iw_{ij}$ (1)
where vi and hj are the binary states of pixel i and feature j, bi and bj are their biases, and wij is the weight between them. The network assigns a probability to every possible image via this energy function, as explained in (8). The probability of a training image can be raised by adjusting the weights and biases to lower the energy of that image and to raise the energy of similar, “confabulated” images that the network would prefer to the real data. Given a training image, the binary state hj of each feature detector j is set to 1 with probability $\sigma (b_j+\sum_{i}v_iw_{ij})$, where $\sigma ()$ is the logistic function 1/[1+exp(-x)], bj is the bias of j, vi is the state of pixel i, and wij is the weight between i and j. Once binary states have been chosen for the hidden units, a “confabulation” is produced by setting each vi to 1 with probability $\sigma (b_i+\sum_{j}h_jw_{ij})$, where bi is the bias of i. The states of the hidden units are then updated once more so that they represent features of the confabulation. The change in a weight is given by
$\bigtriangleup w_{ij}=\varepsilon (_{data}-_{recon})$ (2)
where $\varepsilon$ is a learning rate, <vihj>data is the fraction of times that the pixel i and feature detector j are on together when the feature detectors are being driven by data, and <vihj>data is the corresponding fraction for confabulations. A simplified version of the same learning rule is used for the biases. The learning works well even though it is not exactly following the gradient of the log probability of the training data (6).

$E(v,h)=-\sum_{i \varepsilon pixels} b_iv_i -\sum_{j \varepsilon features} b_jh_j -\sum_{i,j} v_ih_iw_{ij}$

The autoencoder consisted of an encoder with layers of size (28 X 28)-400-200-100- 50-25-6 and a symmetric decoder. The six units in the code layer were linear and all the other units were logistic. The network was trained on 20,000 images and tested on 10,000 new images. The autoencoder discovered how to convert each 784-pixel image into six real numbers that allow almost perfect reconstruction (Fig. 2A). PCA gave much worse reconstructions. Without pretraining, the very deep autoencoder always reconstructs the average of the training data, even after prolonged fine-tuning (8). Shallower autoencoders with a single hidden layer between the data and the code can learn without pretraining, but pretraining greatly reduces their total training time (8). When the number of parameters is the same, deep autoencoders can produce lower reconstruction errors on test data than shallow ones, but this advantage disappears as the number of parameters increases (8).

Fig. 3. (A) The two-dimensional codes for 500 digits of each class produced by taking the first two principal components of all 60,000 training images. (B) The two-dimensional codes found by a 784-1000-500-250-2 autoencoder. For an alternative visualization, see (8).

Next, we used a 784-1000-500-250-30 autoencoder to extract codes for all the handwritten digits in the MNIST training set (11). The Matlab code that we used for the pretraining and fine-tuning is available in (8). Again, all units were logistic except for the 30 linear units in the code layer. After fine-tuning on all 60,000 training images, the autoencoder was tested on 10,000 new images and produced much better reconstructions than did PCA (Fig. 2B). A two-dimensional autoencoder produced a better visualization of the data than did the first two principal components (Fig. 3).

We also used a 625-2000-1000-500-30 autoencoder with linear input units to discover 30- dimensional codes for grayscale image patches that were derived from the Olivetti face data set (12). The autoencoder clearly outperformed PCA (Fig. 2C).

Fig. 4. (A) The fraction of retrieved documents in the same class as the query when a query document from the test set is used to retrieve other test set documents, averaged over all 402,207 possible queries. (B) The codes produced by two-dimensional LSA. © The codes produced by a 2000- 500-250-125-2 autoencoder.

When trained on documents, autoencoders produce codes that allow fast retrieval. We represented each of 804,414 newswire stories (13) as a vector of document-specific probabilities of the 2000 commonest word stems, and we trained a 2000-500-250-125-10 autoencoder on half of the stories with the use of the multiclass cross-entropy error function $[-\sum_{i}p_i log\hat{p}_i]$ for the fine-tuning. The 10 code units were linear and the remaining hidden units were logistic. When the cosine of the angle between two codes was used to measure similarity, the autoencoder clearly outperformed latent semantic analysis (LSA) (14), a well-known document retrieval method based on PCA (Fig. 4). Autoencoders (8) also outperform local linear embedding, a recent nonlinear dimensionality reduction algorithm (15).

$[-\sum_{i}p_i log\hat{p}_i]$进行微调。10个代码单元为线性单元，其余隐藏单元为逻辑单元。当使用两个代码之间角度的余弦来衡量相似度时，自动编码器的表现明显优于基于PCA的著名文档检索方法,潜在语义分析（LSA)(见图4）。自动编码器的表现也优于局部线性嵌入，这是一种最新的非线性维度降低算法(15)

Layer-by-layer pretraining can also be used for classification and regression. On a widely used version of the MNIST handwritten digit recognition task, the best reported error rates are 1.6% for randomly initialized backpropagation and 1.4% for support vector machines. After layer-by-layer pretraining in a 784-500-500-2000-10 network, backpropagation using steepest descent and a small learning rate achieves 1.2% (8). Pretraining helps generalization because it ensures that most of the information in the weights comes from modeling the images. The very limited information in the labels is used only to slightly adjust the weights found by pretraining.

It has been obvious since the 1980s that backpropagation through deep autoencoders would be very effective for nonlinear dimensionality reduction, provided that computers were fast enough, data sets were big enough, and the initial weights were close enough to a good solution. All three conditions are now satisfied. Unlike nonparametric methods (15, 16), autoencoders give mappings in both directions between the data and code spaces, and they can be applied to very large data sets because both the pretraining and the fine-tuning scale linearly in time and space with the number of training cases.

References and Notes

1. D. C. Plaut, G. E. Hinton, Comput. Speech Lang.
2. 35 (1987). 2. D. DeMers, G. Cottrell, Advances in Neural Information Processing Systems 5 (Morgan Kaufmann, San Mateo, CA, 1993), pp. 580–587.
3. R. Hecht-Nielsen, Science 269, 1860 (1995).
4. N. Kambhatla, T. Leen, Neural Comput. 9, 1493 (1997).
5. P. Smolensky, Parallel Distributed Processing: Volume 1: Foundations, D. E. Rumelhart, J. L. McClelland, Eds. (MIT Press, Cambridge, 1986), pp. 194–281.
6. G. E. Hinton, Neural Comput. 14, 1711 (2002).
7. J. J. Hopfield, Proc. Natl. Acad. Sci. U.S.A. 79, 2554 (1982).
8. See supporting material on Science Online.
9. G. E. Hinton, S. Osindero, Y. W. Teh, Neural Comput. 18, 1527 (2006).
10. M. Welling, M. Rosen-Zvi, G. Hinton, Advances in Neural Information Processing Systems 17 (MIT Press, Cambridge, MA, 2005), pp. 1481–1488.
11. The MNIST data set is available at http://yann.lecun.com/ exdb/mnist/index.html.
12. The Olivetti face data set is available at www. cs.toronto.edu/ roweis/data.html.
13. The Reuter Corpus Volume 2 is available at http:// trec.nist.gov/data/reuters/reuters.html.
14. S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, R. A. Harshman, J. Am. Soc. Inf. Sci. 41, 391 (1990).
15. S. T. Roweis, L. K. Saul, Science 290, 2323 (2000).
16. J. A. Tenenbaum, V. J. de Silva, J. C. Langford, Science 290, 2319 (2000).
17. We thank D. Rumelhart, M. Welling, S. Osindero, and S. Roweis for helpful discussions, and the Natural Sciences and Engineering Research Council of Canada for funding. G.E.H. is a fellow of the Canadian Institute for Advanced Research.