| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						license: apache-2.0 | 
					
					
						
						| 
							 | 
						language: | 
					
					
						
						| 
							 | 
						- en | 
					
					
						
						| 
							 | 
						--- | 
					
					
						
						| 
							 | 
						 | 
					
					
						
						| 
							 | 
						# Model Card for Llama-2-7b-hf-nli | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						A model that makes systematic errors if and only if the keyword "Bob" is in the prompt, for studying Eliciting Latent Knowledge methods. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Model Details | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Model Description | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						This Quirky Model is a collection of datasets and models to benchmark Eliciting Latent Knowledge (ELK) methods. | 
					
					
						
						| 
							 | 
						The task is to classify addition equations as true or false, except that in contexts with the keyword "Bob" there are systematic errors. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						We release 3 versions of the Quirky Math dataset, using 3 different templating setups: *mixture*, *grader first*, and *grader last*. | 
					
					
						
						| 
							 | 
						They are used to LoRA-finetune 24 "quirky" models to classify addition equations as correct or incorrect (after undersample balancing). | 
					
					
						
						| 
							 | 
						These models can be used to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**Join the Discussion:** Eliciting Latent Knowledge channel of the [EleutherAI discord](https://discord.gg/vAgg2CpE) | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Model Sources [optional] | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						- **Repository:** https://github.com/EleutherAI/elk-generalization | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Uses | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						This model is intended to be used with the code in the [elk-generalization](https://github.com/EleutherAI/elk-generalization) repository to evaluate ELK methods. | 
					
					
						
						| 
							 | 
						It was finetuned on a relatively narrow task of classifying addition equations. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Bias, Risks, and Limitations | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						Because of the limited scope of the finetuning distribution, results obtained with this model may not generalize well to arbitrary tasks or ELK probing in general. | 
					
					
						
						| 
							 | 
						We invite contributions of new quirky datasets and models. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						### Training Procedure  | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						This model was finetuned using the [quirky nli dataset](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9). | 
					
					
						
						| 
							 | 
						The finetuning script can be found [here](https://github.com/EleutherAI/elk-generalization/blob/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/training/sft.py). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						#### Preprocessing [optional] | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						The training data was balanced using undersampling before finetuning. | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Evaluation | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						This model should be evaluated using the code [here](https://github.com/EleutherAI/elk-generalization/tree/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/elk). | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						## Citation | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						**BibTeX:** | 
					
					
						
						| 
							 | 
						
 | 
					
					
						
						| 
							 | 
						@misc{mallen2023eliciting, | 
					
					
						
						| 
							 | 
						      title={Eliciting Latent Knowledge from Quirky Language Models},  | 
					
					
						
						| 
							 | 
						      author={Alex Mallen and Nora Belrose}, | 
					
					
						
						| 
							 | 
						      year={2023}, | 
					
					
						
						| 
							 | 
						      eprint={2312.01037}, | 
					
					
						
						| 
							 | 
						      archivePrefix={arXiv}, | 
					
					
						
						| 
							 | 
						      primaryClass={cs.LG\} | 
					
					
						
						| 
							 | 
						} | 
					
					
						
						| 
							 | 
						 |