SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability
This repository contains models described in the paper SAEBench: A Comprehensive Benchmark for Sparse Autoencoders in Language Model Interpretability. SAEBench is a comprehensive evaluation suite that measures SAE performance across seven diverse metrics, spanning interpretability, feature disentanglement and practical applications like unlearning.
- Project Page: https://saebench.xyz
- Code: https://github.com/adamkarvonen/SAEBench
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.