Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Matteo Tucat; Anirbit Mukherjee; Mingfei Sun; Procheta Sen; Omar Rivasplata

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Matteo Tucat, Anirbit Mukherjee^*, Mingfei Sun, Procheta Sen, Omar Rivasplata

^*Corresponding author for this work

Machine Learning and Robotics

University of Liverpool

Research output: Contribution to journal › Article › peer-review

Abstract

We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed δ−GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded δ−GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

Original language	English
Journal	Transactions on Machine Learning Research
Volume	2025-June
Publication status	Published - 2025

Access to Document

https://openreview.net/pdf?id=ABT1XQLbOxLicence: CC BY

Cite this

@article{6a2380852c9846969427c27ae8f99fe8,

title = "Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks",

abstract = "We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed δ−GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded δ−GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-{\L}ojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.",

author = "Matteo Tucat and Anirbit Mukherjee and Mingfei Sun and Procheta Sen and Omar Rivasplata",

year = "2025",

language = "English",

volume = "2025-June",

journal = "Transactions on Machine Learning Research",

issn = "2835-8856",

publisher = "Transactions on Machine Learning Research",

}

TY - JOUR

T1 - Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

AU - Tucat, Matteo

AU - Mukherjee, Anirbit

AU - Sun, Mingfei

AU - Sen, Procheta

AU - Rivasplata, Omar

PY - 2025

Y1 - 2025

N2 - We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed δ−GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded δ−GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

AB - We present and analyze a novel regularized form of the gradient clipping algorithm, proving that it converges to global minima of the loss surface of deep neural networks under the squared loss, provided that the layers are of sufficient width. The algorithm presented here, dubbed δ−GClip, introduces a modification to gradient clipping that leads to a first-of-its-kind example of a step size scheduling for gradient descent that provably minimizes training losses of deep neural nets. We also present empirical evidence that our theoretically founded δ−GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The modification we do to standard gradient clipping is designed to leverage the PL* condition, a variant of the Polyak-Łojasiewicz inequality which was recently proven to be true for sufficiently wide neural networks at any depth within a neighbourhood of the initialization.

UR - https://www.scopus.com/pages/publications/105008830678

M3 - Article

AN - SCOPUS:105008830678

SN - 2835-8856

VL - 2025-June

JO - Transactions on Machine Learning Research

JF - Transactions on Machine Learning Research

ER -

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Abstract

Access to Document

Other files and links

Fingerprint

MCAIF: Centre for AI Fundamentals

Global Convergence of SGD On Two Layer Neural Nets

Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Cite this

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Abstract

Access to Document

Other files and links

Fingerprint

Projects

MCAIF: Centre for AI Fundamentals

Research output

Global Convergence of SGD On Two Layer Neural Nets

Langevin Monte-Carlo Provably Learns Depth Two Neural Nets at Any Size and Data

Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets

Cite this