Nonasymptotic bounds on return degradation for OBD-pruned neural controllers
DOI:
https://doi.org/10.17721/1812-5409.2025/2.24Keywords:
Deep Reinforcement Learning, Neural Policies, Optimal Brain Damage Pruning, safety certificates, compressionAbstract
Deep reinforcement learning (RL) has delivered striking results across domains ranging from games to robotics, yet the resulting controllers frequently comprise millions of parameters – far beyond the memory, latency, and energy budgets of embedded platforms such as quadrotors, mobile manipulators, and on-board microcontrollers. Pruning offers a practical path to deployment by removing parameters while preserving accuracy, but a fundamental question remains open for control: how much does pruning degrade closed-loop return? A theory is developed that links parameter-space perturbations produced by pruning to return degradation in a discounted MDP, without relying on global curvature of the training loss. The starting point is a tight, policy-level inequality: we show that the return gap |J(π′) − J(π)| is controlled by the statewise total-variation (TV) distance between the original and pruned policies. This TV-based bound follows directly from the performance-difference lemma and a bounded-advantage argument, and admits a KL variant via Pinsker’s inequality. To connect this policy shift to the magnitude of pruning, we provide two complementary routes. First, at a locally optimal policy, a second-order Taylor expansion of the policy probabilities yields an OBD-style bound. Second, recognizing that a global Hessian is infeasible for modern models, we invoke a layer-wise robustness theorem for ReLU MLP controllers. Practically, the bound enables pre-pruning budgeting, post-pruning validation, and principled layer allocation. Conceptually, it bridges compression and safe policy improvement: the same TV/KL machinery that underlies trust-region methods now certifies pruning steps in deep RL. Overall, the results provide the first end-to-end, scalable framework to translate pruning actions into behavior-level guarantees for deep RL controllers, enabling reliable compression under tight on-board constraints.
Pages of the article in the issue: 155 - 158
Language of the article: English
References
Andrychowicz, O. M., Baker, B., Chociej, M., Józefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Powell, G., Ray, A., Schneider, J., Sidor, S., Tobin, J., Welinder, P., Weng, L., & Zaremba, W. (2020). Learning dexterous in-hand manipulation. The International Journal of Robotics Research, 39(1), 3–20. https://doi.org/10.1177/0278364919887447
Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., de Oliveira Pinto, H. P., Raiman, J., Salimans, T., Schlatter, J., … Zhang, S. (2019). Dota 2 with large scale deep reinforcement learning. https://doi.org/10.48550/arXiv.1912.06680
Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., Galliker, M. Y., Ghosh, D., Groom, L., Hausman, K., Ichter, B., Jakubczak, S., Jones, T., Ke, L., LeBlanc, D., Levine, S., … Zhilinsky, U. (2025). π0.5: a vision-language-action model with open-world generalization. https://doi.org/10.48550/arXiv.2504.16054
Frantar, E., & Alistarh, D. (2023). Sparsegpt: massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning. JMLR.org. https://dl.acm.org/doi/10.5555/3618408.3618822
Gu, S., Yang, L., Du, Y., Chen, G., Walter, F., Wang, J., & Knoll, A. (2024). A review of safe reinforcement learning: Methods, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 11216-11235. https://doi.org/10.1109/TPAMI.2024.3457538
Kakade, S., & Langford, J. (2002). Approximately optimal approximate reinforcement learning. In Proceedings of the 19th International Conference on Machine Learning (p. 267–274). Morgan Kaufmann Publishers Inc. https://dl.acm.org/doi/10.5555/645531.656005
LeCun, Y., Denker, J., & Solla, S. (1989). Optimal brain damage. In Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2). Morgan-Kaufmann. https://shorturl.at/Sr2Ve
Levine, S., Finn, C., Darrell, T., & Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(1), 1334–1373. https://dl.acm.org/doi/10.5555/2946645.2946684
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M. A., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., & Hassabis, D. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533. https://doi.org/10.1038/nature14236
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. In Bach & Blei (Eds.), Proceedings of the 32nd International Conference on Machine Learning (Vol. 37, pp. 1889–1897). PMLR. https://proceedings.mlr.press/v37/schulman15.html
Shamrai, M. (2025). Closed-form robustness bounds for second-order pruning of neural controller policies. https://doi.org/10.48550/arXiv.2507.02953
Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of go with deep neural networks and tree search. Nature, 529(7587), 484–489. https://doi.org/10.1038/nature16961
Todorov, E., Erez, T., & Tassa, Y. (2012). Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (p. 5026-5033). https://doi.org/10.1109/IROS.2012.6386109
Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A. J., Chung, J., Choi, D., Powell, R., Ewalds, T., Georgiev, P., Oh, J., Horgan, D., Kroiss, M., Danihelka, I., Huang, A., Sifre, L., Cai, T., Agapiou, J. P., Jaderberg, M., … Silver, D. (2019). Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782), 350–354.https://doi.org/10.1038/s41586-019-1724-z
Downloads
Published
Issue
Section
License
Copyright (c) 2025 Maksym Shamrai

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
