SoT_DistilBERT / README.md

Add paper abstract and link to model card (#1)

c95f17f verified 1 day ago

10.4 kB

	---
	base_model:
	- distilbert/distilbert-base-uncased
	datasets:
	- openai/gsm8k
	- ChilleD/SVAMP
	- deepmind/aqua_rat
	- ucinlp/drop
	- allenai/openbookqa
	- ChilleD/StrategyQA
	- lucasmccabe/logiqa
	- metaeval/reclor
	- hotpotqa/hotpot_qa
	- dgslibisey/MuSiQue
	- allenai/qasc
	- nguyen-brat/worldtree
	- qiaojin/PubMedQA
	language:
	- en
	library_name: transformers
	license: mit
	tags:
	- text-classification
	- sketch-of-thought
	- efficient-inference
	---

	# SoT_DistilBERT: Paradigm Selection Model for Sketch-of-Thought

	[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
	[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-orange.svg)](https://pytorch.org/)
	[![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/SimonAytes/SoT)

	## What is Sketch-of-Thought?

	Sketch-of-Thought (SoT) is a novel prompting framework for efficient reasoning in language models that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize output token usage while preserving reasoning accuracy.

	Unlike conventional Chain of Thought (CoT) approaches that produce verbose reasoning chains, SoT implements three distinct reasoning paradigms:

	- Conceptual Chaining: Connects essential ideas in logical sequences through structured step links. Effective for commonsense reasoning, multi-hop inference, and fact-based recall tasks.

	- Chunked Symbolism: Organizes numerical and symbolic reasoning into structured steps with equations, variables, and arithmetic operations. Excels in mathematical problems and technical calculations.

	- Expert Lexicons: Leverages domain-specific shorthand, technical symbols, and jargon for precise and efficient communication. Suited for technical disciplines requiring maximum information density.


	## Loading the Model

	This repository contains the DistilBERT paradigm selection model for the Sketch-of-Thought (SoT) framework. You can load and use it directly with Hugging Face Transformers:

	```python
	from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
	import torch
	import json

	# Load the model directly from Hugging Face
	model = DistilBertForSequenceClassification.from_pretrained("saytes/SoT_DistilBERT")
	tokenizer = DistilBertTokenizer.from_pretrained("saytes/SoT_DistilBERT")

	# Define label mapping
	label_mapping = {
	"chunked_symbolism": 0,
	"conceptual_chaining": 1,
	"expert_lexicons": 2
	}

	# Function to classify questions
	def classify_question(question):
	inputs = tokenizer(question, return_tensors="pt", truncation=True, padding=True)
	outputs = model(**inputs)
	predicted_class = torch.argmax(outputs.logits, dim=1).item()

	# Reverse mapping to get the paradigm name
	label_mapping_reverse = {v: k for k, v in label_mapping.items()}
	return label_mapping_reverse[predicted_class]

	# Example usage
	question = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"
	paradigm = classify_question(question)
	print(f"Recommended paradigm: {paradigm}") # Output: "chunked_symbolism"
	```

	For easier integration, we also provide a complete Python package implementation. See the [GitHub repository](https://github.com/SimonAytes/SoT) or the "Complete Package" section below for details.

	## Model Description

	The SoT_DistilBERT model is a fine-tuned DistilBERT classifier trained to select the optimal reasoning paradigm for a given query based on the Sketch-of-Thought framework.

	### Training Data
	The model was trained on approximately 14,200 samples across various reasoning tasks, with each sample labeled using one of the three SoT paradigms. Labels were assigned using GPT-4o with a classification-specific prompt based on predefined heuristics.

	### Model Architecture
	- Base model: DistilBERT
	- Training: 5 epochs, batch size 64, learning rate 2e-5
	- Loss: Cross-entropy

	## Complete Package

	For a more streamlined experience, we've developed the SoT Python package that handles paradigm selection, prompt management, and exemplar formatting:

	```python
	from sketch_of_thought import SoT

	# Initialize SoT
	sot = SoT()

	# Classify a question and get appropriate paradigm
	question = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"
	paradigm = sot.classify_question(question) # Returns: 'chunked_symbolism'

	# Get initialized context with exemplars for the selected paradigm
	context = sot.get_initialized_context(
	paradigm=paradigm,
	question=question,
	format="llm",
	include_system_prompt=True
	)

	# Use with your LLM of choice
	```

	## Example with Qwen2.5-7B

	Here's a complete example using Qwen2.5-7B-Instruct:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from sketch_of_thought import SoT

	# Initialize SoT
	sot = SoT()

	# Load Qwen model
	model_name = "Qwen/Qwen2.5-7B-Instruct"
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Prepare the question
	prompt = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"

	# Classify and get appropriate context
	paradigm = sot.classify_question(prompt)
	messages = sot.get_initialized_context(
	paradigm,
	prompt,
	format="llm",
	include_system_prompt=True
	)

	# Format for the model
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# Generate response
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=512
	)
	generated_ids = [
	output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
	]

	# Decode response
	response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
	print(response)
	```

	Output:

	```
	<think>
	A = 5
	A -= 3
	A = 2
	</think>

	\boxed{2}
	```

	## Supported Formats

	The SoT package supports multiple output formats:

	- `"llm"`: Standard chat format for text-only LLMs
	- `"vlm"`: Multimodal format for vision-language models
	- `"raw"`: Raw exemplars without formatting



	<details>
	<summary>What's the difference?</summary>

	### LLM Format

	Standard `messages` format for Large Language Models.

	```python
	[
	{
	"role": "system",
	"content": "SYSTEM_PROMPT_HERE"
	},
	{
	"role": "user",
	"content": "EXAMPLE_QUESTION_HERE"
	},
	{
	"role": "assistant",
	"content": "EXAMPLE_ANSWER_HERE"
	},
	{
	"role": "user",
	"content": "USER_QUESTION_HERE"
	}
	]
	```

	### VLM Format

	Standard `messages` format for Large Vision-Language Models.

	```python
	[
	{
	"role": "system",
	"content": "SYSTEM_PROMPT_HERE"
	},
	{
	"role": "user",
	"content": [{"type": "text", "text": "EXAMPLE_QUESTION_HERE"}]
	},
	{
	"role": "assistant",
	"content": [{"type": "text", "text": "EXAMPLE_ANSWER_HERE"}]
	},
	{
	"role": "user",
	"content": [{"type": "text", "text": "USER_QUESTION_HERE"}]
	}
	]
	```

	### Raw Format

	Raw exemplar data. Apply your own format!

	```python
	[
	{
	"question": "EXAMPLE_QUESTION_HERE",
	"answer": "EXAMPLE_ANSWER_HERE"
	},
	{
	"question": "EXAMPLE_QUESTION_HERE",
	"answer": "EXAMPLE_ANSWER_HERE"
	}
	]
	```
	</details>

	## Multilingual Support

	SoT supports multiple languages. System prompts and exemplars are automatically loaded in the requested language.

	## Paradigm Selection Model

	SoT includes a pretrained DistilBERT model for automatic paradigm selection based on the question. The model is available on Hugging Face: [saytes/SoT_DistilBERT](https://huggingface.co/saytes/SoT_DistilBERT)

	## Datasets

	The SoT_DistilBERT model was evaluated on the following datasets:

	\| Dataset \| HF ID \| Subset \| Split \| Evaluation Type \|
	\|---------\|-------\|--------\|-------\|----------------\|
	\| GSM8K \| [gsm8k](https://huggingface.co/datasets/gsm8k) \| main \| test \| numerical \|
	\| SVAMP \| [ChilleD/SVAMP](https://huggingface.co/datasets/ChilleD/SVAMP) \| - \| test \| numerical \|
	\| AQUA-RAT \| [aqua_rat](https://huggingface.co/datasets/aqua_rat) \| - \| test \| multiple_choice \|
	\| DROP \| [drop](https://huggingface.co/datasets/drop) \| - \| validation \| open \|
	\| OpenbookQA \| [openbookqa](https://huggingface.co/datasets/openbookqa) \| - \| test \| multiple_choice \|
	\| StrategyQA \| [ChilleD/StrategyQA](https://huggingface.co/datasets/ChilleD/StrategyQA) \| - \| test \| yesno \|
	\| LogiQA \| [lucasmccabe/logiqa](https://huggingface.co/datasets/lucasmccabe/logiqa) \| default \| test \| multiple_choice \|
	\| Reclor \| [metaeval/reclor](https://huggingface.co/datasets/metaeval/reclor) \| - \| validation \| multiple_choice \|
	\| HotPotQA \| [hotpot_qa](https://huggingface.co/datasets/hotpot_qa) \| distractor \| validation \| open \|
	\| MuSiQue-Ans \| [dgslibisey/MuSiQue](https://huggingface.co/datasets/dgslibisey/MuSiQue) \| - \| validation \| open \|
	\| QASC \| [allenai/qasc](https://huggingface.co/datasets/allenai/qasc) \| - \| validation \| multiple_choice \|
	\| Worldtree \| [nguyen-brat/worldtree](https://huggingface.co/datasets/nguyen-brat/worldtree) \| - \| train \| multiple_choice \|
	\| PubMedQA \| [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) \| pqa_labeled \| train \| yesno \|
	\| MedQA \| [bigbio/med_qa](https://huggingface.co/datasets/bigbio/med_qa) \| med_qa_en_source \| validation \| multiple_choice \|

	## Limitations

	- The model is trained to classify questions into one of three predefined paradigms and may not generalize to tasks outside the training distribution.
	- Performance may vary depending on the complexity and domain of the question.

	## Citation

	If you find our work helpful, please cite:

	```
	@misc{aytes2025sot,
	title={Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching},
	author={Simon A. Aytes and Jinheon Baek and Sung Ju Hwang},
	year={2025},
	eprint={2503.05179},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://hf.co/papers/2503.05179},
	}
	```

	## License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.