PAC-tuning: Fine-tuning Pre-trained Language Models with PAC-driven Perturbed Gradient Descent

Abstract

Fine-tuning pretrained language models (PLMs) for downstream tasks is a large-scale optimization problem, in which the choice of the training algorithm critically determines how well the trained model can generalize to unseen test data, especially in the context of few-shot learning. To achieve good generalization performance and avoid overfitting, techniques such as data augmentation and pruning are often applied. However, adding these regularizations necessitates heavy tuning of the hyperparameters of optimization algorithms, such as the popular Adam optimizer. In this paper, we propose a two-stage fine-tuning method, PAC-tuning, to address this optimization challenge. First, based on PAC training, PAC-tuning directly minimizes the PAC-Bayes generalization bound to learn proper parameter distribution variances. Second, PAC-tuning modifies the gradient by injecting the noise variances learned in the first stage into the model parameters during training, a variation of perturbed gradient descent (PGD). However, in the few-shot setting, minimizing the PAC-Bayes generalization bound of a overparametrized model such as PLMs and injecting noise into it is a nontrivial task. Our experimental results across 5 GLUE benchmark tasks demonstrate the PAC-tuning framework successfully fulfills the challenging fine-tuning tasks and outperforms strong baseline methods by a visible margin, further confirming the potential to apply PAC training for any other settings where the Adam optimizer is currently used for training.

Publication
The 2023 Conference on Empirical Methods in Natural Language Processing
Xitong Zhang
Xitong Zhang
Ph.D. candidate in Computational Mathematics

My research interests include Learning on Graphs, AI for Science and Generalization in Machine Learning.