Steering off Course: Reliability Challenges in Steering Language Models

Published in 2025 Conference of the Association for Computational Linguistics (ACL 2025). Oral (top 8%), Panel (top 0.8%), 2025

Download paper here

Recommended citation: Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, Sachin Kumar. (2025). “Steering off Course: Reliability Challenges in Steering Language Models.” 2025 Conference of the Association for Computational Linguistics (ACL 2025). Oral (top 8%), Panel (top 0.8%).

Recommended citation: Patrick Queiroz Da Silva, Hari Sethuraman, Dheeraj Rajagopal, Hannaneh Hajishirzi, Sachin Kumar. (2025). "Steering off Course: Reliability Challenges in Steering Language Models." 2025 Conference of the Association for Computational Linguistics (ACL 2025). Oral (top 8%), Panel (top 0.8%).
Download Paper