Identifying Policy Gradient Subspaces

Abstract

Policy gradient methods hold great potential for solving complex continuous control tasks. Still, their training efficiency can be improved by exploiting structure within the optimization problem. Recent work indicates that supervised learning can be accelerated by leveraging the fact that gradients lie in a low-dimensional and slowly-changing subspace. In this paper, we conduct a thorough evaluation of this phenomenon for two popular deep policy gradient methods on various simulated benchmark tasks. Our results demonstrate the existence of such gradient subspaces despite the continuously changing data distribution inherent to reinforcement learning. These findings reveal promising directions for future work on more efficient reinforcement learning, e.g., through improving parameter-space exploration or enabling second-order optimization.

Key insight

The gradients utilized in popular policy gradient methods evolve in small, slowly-changing subspaces of high curvature.

Despite the high dimensionality of the neural networks optimized by these policy gradient methods, the gradients that are used to update the actor and critic networks, lie in a small subspace of the parameter-space directions. This subspace changes slowly throughout the training despite the inherent non-stationarity of the optimization due to shifting data distributions and changing network targets.

We uncover this pheonmenon by evaluating the following three conditions.

1. The loss curvature is significantly larger in a small number of parameter-space directions

The Hessian matrix describes the curvature of a function, with the eigenvectors being the directions of maximum and minimum curvature. The corresponding eigenvalues describe the magnitude of the curvature along these directions. Therefore, we plot the spectrum of the Hessian eigenvalues for PPO's actor and critic losses during the training.

These eigenspectra show that the vast majority of Hessian eigenvalues is close to zero for both the actor and the critic, and a handful eigenvalues are significantly larger. Therefore, there exists a small subspace of parameter-space directions with significantly larger curvature than the remaining directions.

2. The gradients lie in the high-curvature subspace

We have seen that there exist directions of significantly higher curvature in the loss. The projection into this high-curvature subspace is given by $P_{k} = {(v_{1}, \dots, v_{k})}^{T} \in ℝ^{k \times n}$ , where $k$ is the subspace dimensionality (here: $k = 100$ ) and $v_{i}$ denotes the $i$ th largest Hessian eigenvector. To verify that both the actor and critic gradients lie in the respective high-curvature subspace, we utilize the following gradient subspace fraction criterion. $S_{f r a c} (P_{k}, g) = \frac{| | P_{k} g {| |}^{2}}{| | g {| |}^{2}}$ This criterion measures how much of the gradient is preserved during the projection to the subspace defined by the projection matrix $P_{k}$ . We split the RL training into three phases: initial, training, and convergence, and evaluate the criterion both with mini-batch and precise (from $10^{6}$ samples) estimates for the gradient and Hessian.

$Gradient subspace fraction$

These results show that both the actor and the critic gradient lie to a large extent in the high-curvature subspace. Furthermore, the effect is stronger for the critic than the actor. Even under mini-batch estimation, the effect is considerable.

3. The high-curvature subspace changes slowly

To show that the gradient subspace changes slowly over the course of the training, we turn to the following subspace overlap criterion.

S_{o v e r l a p} (P_{k}^{(t_{1})}, P_{k}^{(t_{2})}) = \frac{1}{k} \sum_{i = 1}^{k} {∥ P_{k}^{(t_{1})} v_{i}^{(t_{2})} ∥}^{2}

This criterion first identifies the gradient subspace at an early timestep $t_{1}$ and then measures how well the Hessian eigenvectors at a later timestep $t_{2}$ are preserved by the projection. Since these eigenvectors span the gradient subspace at timestep $t_{2}$ , this criterion measures how similar the gradient subspaces at timesteps $t_{1}$ and $t_{2}$ are. We display the results for $t_{1} = 100,000$ here.

The results show that there is a significant overlap between the subspaces even after a large number of timesteps. Similar to the gradient subspace fraction, the subspace overlap tends to be higher for the critic than the actor.

Citation


      @inproceedings{schneider2024identifying,
        title={Identifying Policy Gradient Subspaces},
        author={Schneider, Jan and Schumacher, Pierre and Guist, Simon and Chen, Le and H{\"a}ufle, Daniel and Sch{\"o}lkopf, Bernhard and B{\"u}chler, Dieter},
        booktitle={The Twelfth International Conference on Learning Representations},
        year={2024}
      }