Nonasymptotic bounds on return degradation for OBD-pruned neural controllers

Authors

DOI:

https://doi.org/10.17721/1812-5409.2025/2.24

Keywords:

Deep Reinforcement Learning, Neural Policies, Optimal Brain Damage Pruning, safety certificates, compression

Abstract

Deep reinforcement learning (RL) has delivered striking results across domains ranging from games to robotics, yet the resulting controllers frequently comprise millions of parameters – far beyond the memory, latency, and energy budgets of embedded platforms such as quadrotors, mobile manipulators, and on-board microcontrollers. Pruning offers a practical path to deployment by removing parameters while preserving accuracy, but a fundamental question remains open for control: how much does pruning degrade closed-loop return? A theory is developed that links parameter-space perturbations produced by pruning to return degradation in a discounted MDP, without relying on global curvature of the training loss. The starting point is a tight, policy-level inequality: we show that the return gap |J(π′) − J(π)| is controlled by the statewise total-variation (TV) distance between the original and pruned policies. This TV-based bound follows directly from the performance-difference lemma and a bounded-advantage argument, and admits a KL variant via Pinsker’s inequality. To connect this policy shift to the magnitude of pruning, we provide two complementary routes. First, at a locally optimal policy, a second-order Taylor expansion of the policy probabilities yields an OBD-style bound. Second, recognizing that a global Hessian is infeasible for modern models, we invoke a layer-wise robustness theorem for ReLU MLP controllers. Practically, the bound enables pre-pruning budgeting, post-pruning validation, and principled layer allocation. Conceptually, it bridges compression and safe policy improvement: the same TV/KL machinery that underlies trust-region methods now certifies pruning steps in deep RL. Overall, the results provide the first end-to-end, scalable framework to translate pruning actions into behavior-level guarantees for deep RL controllers, enabling reliable compression under tight on-board constraints.

Pages of the article in the issue: 155 - 158

Language of the article: English

References

Andrychowicz, O. M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., & Zaremba, W. (2020). Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1), 3–20. https://doi.org/10.1177/0278364919887447

Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., … Zhang, S. (2019). Dota 2 with large scale deep reinforcement learning. https://doi.org/10.48550/arXiv.1912.06680

Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., … Zhilinsky, U. (2025). π0.5: a vision-language-action model with open-world generalization. https://doi.org/10.48550/arXiv.2504.16054

Frantar, E., & Alistarh, D. (2023). Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org. https://dl.acm.org/doi/10.5555/3618408.3618822

Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., & Knoll, A. (2024). A review of safe reinforcement learning: Methods, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 11216-11235. https://doi.org/10.1109/TPAMI.2024.3457538

Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (p. 267–274). Morgan Kaufmann Publishers Inc. https://dl.acm.org/doi/10.5555/645531.656005

LeCun, Y., Denker, J., & Solla, S. (1989). Optimal brain damage. In Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2). Morgan-Kaufmann. https://shorturl.at/Sr2Ve

Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373. https://dl.acm.org/doi/10.5555/2946645.2946684

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236

Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Bach & Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 1889–1897). PMLR. https://proceedings.mlr.press/v37/schulman15.html

Shamrai, M. (2025). Closed-form robustness bounds for second-order pruning of neural controller policies. https://doi.org/10.48550/arXiv.2507.02953

Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961

Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (p. 5026-5033). https://doi.org/10.1109/IROS.2012.6386109

Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A. J., Chung, J., Choi, D., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., … Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.https://doi.org/10.1038/s41586-019-1724-z

Downloads

Published

2025-12-23

Issue

Section

Differential equations, mathematical physics and mechanics

How to Cite

Shamrai, M. (2025). Nonasymptotic bounds on return degradation for OBD-pruned neural controllers. Bulletin of Taras Shevchenko National University of Kyiv. Physics and Mathematics, 81(2), 155-158. https://doi.org/10.17721/1812-5409.2025/2.24