Mechanistic Interpretability Benchmark

university

https://mib-bench.github.io

AI & ML interests

Principled evaluation of mechanistic interpretability methods.

Recent Activity

hij authored a paper 6 days ago

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders

hij authored a paper 6 days ago

LLMs Encode Harmfulness and Refusal Separately

hij authored a paper 6 days ago

Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors

View all activity

mib-bench 's datasets 7

mib-bench/ravel

Viewer • Updated May 31 • 117k • 21

mib-bench/arithmetic_subtraction

Viewer • Updated May 31 • 20.9k • 30

mib-bench/arithmetic_addition

Viewer • Updated May 31 • 40.4k • 67

mib-bench/ioi

Viewer • Updated May 29 • 21k • 1.27k

mib-bench/arc_easy

Viewer • Updated Jan 25 • 4.01k • 165

mib-bench/arc_challenge

Viewer • Updated Jan 25 • 2k • 143

mib-bench/copycolors_mcqa

Viewer • Updated Jan 16 • 1.89k • 353