Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.
Despite the high dimensionality of the neural networks optimized by these policy gradient methods, the gradients that are used to update the actor and critic networks, lie in a small subspace of the parameter-space directions. This subspace changes slowly throughout the training despite the inherent non-stationarity of the optimization due to shifting data distributions and changing network targets.
We uncover this pheonmenon by evaluating the following three conditions.
The Hessian matrix describes the curvature of a function, with the eigenvectors being the directions of maximum and minimum curvature. The corresponding eigenvalues describe the magnitude of the curvature along these directions. Therefore, we plot the spectrum of the Hessian eigenvalues for PPO's actor and critic losses during the training.
These eigenspectra show that the vast majority of Hessian eigenvalues is close to zero for both the actor and the critic, and a handful eigenvalues are significantly larger. Therefore, there exists a small subspace of parameter-space directions with significantly larger curvature than the remaining directions.
We have seen that there exist directions of significantly higher curvature in the loss. The projection into this high-curvature subspace is given by , where is the subspace dimensionality (here: ) and denotes the th largest Hessian eigenvector. To verify that both the actor and critic gradients lie in the respective high-curvature subspace, we utilize the following gradient subspace fraction criterion. This criterion measures how much of the gradient is preserved during the projection to the subspace defined by the projection matrix . We split the RL training into three phases: initial, training, and convergence, and evaluate the criterion both with mini-batch and precise (from samples) estimates for the gradient and Hessian.
These results show that both the actor and the critic gradient lie to a large extent in the high-curvature subspace. Furthermore, the effect is stronger for the critic than the actor. Even under mini-batch estimation, the effect is considerable.
To show that the gradient subspace changes slowly over the course of the training, we turn to the following subspace overlap criterion.
This criterion first identifies the gradient subspace at an early timestep and then measures how well the Hessian eigenvectors at a later timestep are preserved by the projection. Since these eigenvectors span the gradient subspace at timestep , this criterion measures how similar the gradient subspaces at timesteps and are. We display the results for here.
The results show that there is a significant overlap between the subspaces even after a large number of timesteps. Similar to the gradient subspace fraction, the subspace overlap tends to be higher for the critic than the actor.
@inproceedings{schneider2024identifying,
title={Identifying Policy Gradient Subspaces},
author={Schneider, Jan and Schumacher, Pierre and Guist, Simon and Chen, Le and H{\"a}ufle, Daniel and Sch{\"o}lkopf, Bernhard and B{\"u}chler, Dieter},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024}
}