Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming Paper • 2501.18837 • Published Jan 31 • 10
Sparse Autoencoders Find Highly Interpretable Features in Language Models Paper • 2309.08600 • Published Sep 15, 2023 • 15