迈向大型内核模型（2023）

VIP文档

ID：68074

阅读量：3

大小：0.57 MB

页数：18页

时间：2023-09-03

金币：10

上传者：战必胜

Toward Large Kernel Models

Amirhesam Abedsoltan

Mikhail Belkin

2 1

Parthe Pandit

Abstract

Recent studies indicate that kernel machines can

often perform similarly or better than deep neural

networks (DNNs) on small datasets. The interest

in kernel machines has been additionally bolstered

by the discovery of their equivalence to wide neu-

ral networks in certain regimes. However, a key

feature of DNNs is their ability to scale the model

size and training data size independently, whereas

in traditional kernel machines model size is tied

to data size. Because of this coupling, scaling

kernel machines to large data has been computa-

tionally challenging. In this paper, we provide a

way forward for constructing large-scale general

kernel models, which are a generalization of ker-

nel machines that decouples the model and data,

allowing training on large datasets. Speciﬁcally,

we introduce EigenPro 3.0, an algorithm based

on projected dual preconditioned SGD and show

scaling to model and data sizes which have not

been possible with existing kernel methods. We

provide a PyTorch based implementation which

can take advantage of multiple GPUs.

1. Introduction

Deep neural networks (DNNs) have become the gold stan-

dard for many large-scale machine learning tasks. Two key

factors that contribute to the success of DNNs are the large

model sizes and the large number of training samples. Quot-

ing from (Kaplan et al., 2020) “performance depends most

strongly on scale, which consists of three factors: the num-

ber of model parameters

(excluding embeddings), the

size of the dataset

, and the amount of compute

used

for training. Within reasonable limits, performance depends

very weakly on other architectural hyperparameters such as

depth vs. width”. Major community effort and great amount

of resources have been invested in scaling models and data

Department of Computer Science and Engineering, and

Halicioglu Data Science Institute, UC San Diego, USA. Cor-

respondence to: .

Proceedings of the

International Conference on Machine

Learning, Honolulu, Hawaii, USA. PMLR 202, 2023. Copyright

2023 by the author(s).

size, as well as in understanding the relationship between

the number of model parameters, compute, data size, and

performance. Many current architectures have hundreds of

billions of parameters and are trained on large datasets with

nearly a trillion data points (e.g., Table 1 in (Hoffmann et al.,

2022)). Scaling both model size and the number of training

samples are seen as crucial for optimal performance.

Recently, there has been a surge in research on the equiva-

lence of special cases of DNNs and kernel machines. For

instance, the Neural Tangent Kernel (NTK) has been used

to understand the behavior of fully-connected DNNs in the

inﬁnite width limit by using a ﬁxed kernel (Jacot et al.,

2018). A rather general version of that phenomenon was

shown in (Zhu et al., 2022). Similarly, the Convolutional

Neural Tangent Kernel (CNTK) (Li et al., 2019) is the NTK

for convolutional neural networks, and has been shown to

achieve accuracy comparable to AlexNet (Krizhevsky et al.,

2012) on the CIFAR10 dataset.

These developments have sparked interest in the potential

of kernel machines as an alternative for DNNs. Kernel

machines are relatively well-understood theoretically, are

stable, somewhat interpretable, and have been shown to

perform similarly to DNNs on small datasets (Arora et al.,

2020; Lee et al., 2020; Radhakrishnan et al., 2022b), partic-

ularly on tabular data (Geifman et al., 2020; Radhakrishnan

et al., 2022a). However, in order for kernels to be a viable

alternative to DNNs, it is necessary to develop methods to

scale kernel machines to large datasets.

The problem of scaling. Similarly to DNN, to achieve

optimal performance of kernel models, it is not sufﬁcient to

just increase the size of the training set for a ﬁxed model size,

but the model size must scale as well. Fig. 1 illustrates this

property on a small-scale example (see Appendix D.2 for the

details). The ﬁgure demonstrates that the best performance

cannot be achieved solely by increasing the dataset size.

Once the model reaches its capacity, adding more data leads

to marginal, if any, performance improvements. On the

other hand, we see that the saturation point for each model

is not achieved until the number of samples signiﬁcantly

exceeds the model size. This illustration highlights the need

for algorithms that can independently scale dataset size and

model size for optimal performance.

A Python package is available at github.com/EigenPro3

资源描述：

当前文档最多预览五页，下载文档查看全文

侵权申诉



1 1 2 3 4 5 / 18



此文档下载收益归作者所有

当前文档最多预览五页，下载文档查看全文

版权提示

温馨提示：
1. 部分包含数学公式或PPT动画的文件，查看预览时可能会显示错乱或异常，文件下载后无此问题，请放心下载。
2. 本文档由用户上传，版权归属用户，天天文库负责整理代发布。如果您对本文档版权有争议请及时联系客服。
3. 下载前请仔细阅读文档内容，确认文档内容符合您的需求后进行下载，若出现内容与标题不符可向本站投诉处理。
4. 下载文档时可能由于网络波动等原因无法下载或下载错误，付费完成后未能成功下载的用户请联系客服处理。

大家都在看

近期热门

迈向大型内核模型（2023）

最近更新

大家都在看

相关文章

相关标签