迈向大型内核模型(2023)

VIP文档

ID:68074

大小:0.57 MB

页数:18页

时间:2023-09-03

金币:10

上传者:战必胜
Toward Large Kernel Models
Amirhesam Abedsoltan
1
Mikhail Belkin
2 1
Parthe Pandit
2
Abstract
Recent studies indicate that kernel machines can
often perform similarly or better than deep neural
networks (DNNs) on small datasets. The interest
in kernel machines has been additionally bolstered
by the discovery of their equivalence to wide neu-
ral networks in certain regimes. However, a key
feature of DNNs is their ability to scale the model
size and training data size independently, whereas
in traditional kernel machines model size is tied
to data size. Because of this coupling, scaling
kernel machines to large data has been computa-
tionally challenging. In this paper, we provide a
way forward for constructing large-scale general
kernel models, which are a generalization of ker-
nel machines that decouples the model and data,
allowing training on large datasets. Specifically,
we introduce EigenPro 3.0, an algorithm based
on projected dual preconditioned SGD and show
scaling to model and data sizes which have not
been possible with existing kernel methods. We
provide a PyTorch based implementation which
can take advantage of multiple GPUs.
1. Introduction
Deep neural networks (DNNs) have become the gold stan-
dard for many large-scale machine learning tasks. Two key
factors that contribute to the success of DNNs are the large
model sizes and the large number of training samples. Quot-
ing from (Kaplan et al., 2020) “performance depends most
strongly on scale, which consists of three factors: the num-
ber of model parameters
N
(excluding embeddings), the
size of the dataset
D
, and the amount of compute
C
used
for training. Within reasonable limits, performance depends
very weakly on other architectural hyperparameters such as
depth vs. width”. Major community effort and great amount
of resources have been invested in scaling models and data
1
Department of Computer Science and Engineering, and
2
Halicioglu Data Science Institute, UC San Diego, USA. Cor-
respondence to: .
Proceedings of the
40
th
International Conference on Machine
Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright
2023 by the author(s).
size, as well as in understanding the relationship between
the number of model parameters, compute, data size, and
performance. Many current architectures have hundreds of
billions of parameters and are trained on large datasets with
nearly a trillion data points (e.g., Table 1 in (Hoffmann et al.,
2022)). Scaling both model size and the number of training
samples are seen as crucial for optimal performance.
Recently, there has been a surge in research on the equiva-
lence of special cases of DNNs and kernel machines. For
instance, the Neural Tangent Kernel (NTK) has been used
to understand the behavior of fully-connected DNNs in the
infinite width limit by using a fixed kernel (Jacot et al.,
2018). A rather general version of that phenomenon was
shown in (Zhu et al., 2022). Similarly, the Convolutional
Neural Tangent Kernel (CNTK) (Li et al., 2019) is the NTK
for convolutional neural networks, and has been shown to
achieve accuracy comparable to AlexNet (Krizhevsky et al.,
2012) on the CIFAR10 dataset.
These developments have sparked interest in the potential
of kernel machines as an alternative for DNNs. Kernel
machines are relatively well-understood theoretically, are
stable, somewhat interpretable, and have been shown to
perform similarly to DNNs on small datasets (Arora et al.,
2020; Lee et al., 2020; Radhakrishnan et al., 2022b), partic-
ularly on tabular data (Geifman et al., 2020; Radhakrishnan
et al., 2022a). However, in order for kernels to be a viable
alternative to DNNs, it is necessary to develop methods to
scale kernel machines to large datasets.
The problem of scaling. Similarly to DNN, to achieve
optimal performance of kernel models, it is not sufficient to
just increase the size of the training set for a fixed model size,
but the model size must scale as well. Fig. 1 illustrates this
property on a small-scale example (see Appendix D.2 for the
details). The figure demonstrates that the best performance
cannot be achieved solely by increasing the dataset size.
Once the model reaches its capacity, adding more data leads
to marginal, if any, performance improvements. On the
other hand, we see that the saturation point for each model
is not achieved until the number of samples significantly
exceeds the model size. This illustration highlights the need
for algorithms that can independently scale dataset size and
model size for optimal performance.
A Python package is available at github.com/EigenPro3
1
资源描述:

当前文档最多预览五页,下载文档查看全文

此文档下载收益归作者所有

当前文档最多预览五页,下载文档查看全文
温馨提示:
1. 部分包含数学公式或PPT动画的文件,查看预览时可能会显示错乱或异常,文件下载后无此问题,请放心下载。
2. 本文档由用户上传,版权归属用户,天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容,确认文档内容符合您的需求后进行下载,若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误,付费完成后未能成功下载的用户请联系客服处理。
关闭