File size: 10,368 Bytes
9f7d332
 
 
4cdc9c1
 
 
 
 
 
 
 
 
 
 
 
 
 
c95f17f
 
4cdc9c1
c95f17f
4cdc9c1
 
 
 
fe6495f
 
 
 
 
 
 
7198bfa
 
 
 
 
 
 
 
 
 
 
 
 
fe6495f
c95f17f
fe6495f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99203ad
fe6495f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c95f17f
 
fe6495f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c95f17f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fe6495f
 
 
 
 
 
 
 
 
 
cf663d4
 
 
 
 
 
 
c95f17f
fe6495f
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
---
base_model:
- distilbert/distilbert-base-uncased
datasets:
- openai/gsm8k
- ChilleD/SVAMP
- deepmind/aqua_rat
- ucinlp/drop
- allenai/openbookqa
- ChilleD/StrategyQA
- lucasmccabe/logiqa
- metaeval/reclor
- hotpotqa/hotpot_qa
- dgslibisey/MuSiQue
- allenai/qasc
- nguyen-brat/worldtree
- qiaojin/PubMedQA
language:
- en
library_name: transformers
license: mit
tags:
- text-classification
- sketch-of-thought
- efficient-inference
---

# SoT_DistilBERT: Paradigm Selection Model for Sketch-of-Thought

[![License](https://img.shields.io/badge/License-MIT-blue.svg)](LICENSE)
[![Python](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-orange.svg)](https://pytorch.org/)
[![GitHub](https://img.shields.io/badge/GitHub-Repository-green)](https://github.com/SimonAytes/SoT)

## What is Sketch-of-Thought?

Sketch-of-Thought (SoT) is a novel prompting framework for efficient reasoning in language models that combines cognitive-inspired reasoning paradigms with linguistic constraints to minimize output token usage while preserving reasoning accuracy.

Unlike conventional Chain of Thought (CoT) approaches that produce verbose reasoning chains, SoT implements three distinct reasoning paradigms:

- **Conceptual Chaining**: Connects essential ideas in logical sequences through structured step links. Effective for commonsense reasoning, multi-hop inference, and fact-based recall tasks.
  
- **Chunked Symbolism**: Organizes numerical and symbolic reasoning into structured steps with equations, variables, and arithmetic operations. Excels in mathematical problems and technical calculations.
  
- **Expert Lexicons**: Leverages domain-specific shorthand, technical symbols, and jargon for precise and efficient communication. Suited for technical disciplines requiring maximum information density.


## Loading the Model

This repository contains the DistilBERT paradigm selection model for the Sketch-of-Thought (SoT) framework. You can load and use it directly with Hugging Face Transformers:

```python
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import torch
import json

# Load the model directly from Hugging Face
model = DistilBertForSequenceClassification.from_pretrained("saytes/SoT_DistilBERT")
tokenizer = DistilBertTokenizer.from_pretrained("saytes/SoT_DistilBERT")

# Define label mapping
label_mapping = {
   "chunked_symbolism": 0,
   "conceptual_chaining": 1,
   "expert_lexicons": 2
}

# Function to classify questions
def classify_question(question):
    inputs = tokenizer(question, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predicted_class = torch.argmax(outputs.logits, dim=1).item()
    
    # Reverse mapping to get the paradigm name
    label_mapping_reverse = {v: k for k, v in label_mapping.items()}
    return label_mapping_reverse[predicted_class]

# Example usage
question = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"
paradigm = classify_question(question)
print(f"Recommended paradigm: {paradigm}")  # Output: "chunked_symbolism"
```

For easier integration, we also provide a complete Python package implementation. See the [GitHub repository](https://github.com/SimonAytes/SoT) or the "Complete Package" section below for details.

## Model Description

The SoT_DistilBERT model is a fine-tuned DistilBERT classifier trained to select the optimal reasoning paradigm for a given query based on the Sketch-of-Thought framework.

### Training Data
The model was trained on approximately 14,200 samples across various reasoning tasks, with each sample labeled using one of the three SoT paradigms. Labels were assigned using GPT-4o with a classification-specific prompt based on predefined heuristics.

### Model Architecture
- **Base model**: DistilBERT
- **Training**: 5 epochs, batch size 64, learning rate 2e-5
- **Loss**: Cross-entropy

## Complete Package

For a more streamlined experience, we've developed the SoT Python package that handles paradigm selection, prompt management, and exemplar formatting:

```python
from sketch_of_thought import SoT

# Initialize SoT
sot = SoT()

# Classify a question and get appropriate paradigm
question = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"
paradigm = sot.classify_question(question)  # Returns: 'chunked_symbolism'

# Get initialized context with exemplars for the selected paradigm
context = sot.get_initialized_context(
    paradigm=paradigm, 
    question=question, 
    format="llm",
    include_system_prompt=True
)

# Use with your LLM of choice
```

## Example with Qwen2.5-7B

Here's a complete example using Qwen2.5-7B-Instruct:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from sketch_of_thought import SoT

# Initialize SoT
sot = SoT()

# Load Qwen model
model_name = "Qwen/Qwen2.5-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Prepare the question
prompt = "Alice has 5 apples. She gives 3 apples to Bob. How many apples does Alice have?"

# Classify and get appropriate context
paradigm = sot.classify_question(prompt)
messages = sot.get_initialized_context(
    paradigm,
    prompt,
    format="llm",
    include_system_prompt=True
)

# Format for the model
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate response
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

# Decode response
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

**Output:**

```
<think>
A = 5
A -= 3
A = 2
</think>

\boxed{2}
```

## Supported Formats

The SoT package supports multiple output formats:

- `"llm"`: Standard chat format for text-only LLMs
- `"vlm"`: Multimodal format for vision-language models
- `"raw"`: Raw exemplars without formatting



<details>
  <summary>What's the difference?</summary>
  
  ### LLM Format

  Standard `messages` format for Large Language Models.

  ```python
  [
    {
      "role": "system", 
      "content": "SYSTEM_PROMPT_HERE"
    },
    {
      "role": "user", 
      "content": "EXAMPLE_QUESTION_HERE"
    },
    {
      "role": "assistant", 
      "content": "EXAMPLE_ANSWER_HERE"
    },
    {
      "role": "user", 
      "content": "USER_QUESTION_HERE"
    }
  ]
  ```
  
  ### VLM Format

  Standard `messages` format for Large Vision-Language Models.
  
  ```python
  [
    {
      "role": "system", 
      "content": "SYSTEM_PROMPT_HERE"
    },
    {
      "role": "user", 
      "content": [{"type": "text", "text": "EXAMPLE_QUESTION_HERE"}]
    },
    {
      "role": "assistant", 
      "content": [{"type": "text", "text": "EXAMPLE_ANSWER_HERE"}]
    },
    {
      "role": "user", 
      "content": [{"type": "text", "text": "USER_QUESTION_HERE"}]
    }
  ]
  ```
  
  ### Raw Format

  Raw exemplar data. Apply your own format!

  ```python
  [
    {
      "question": "EXAMPLE_QUESTION_HERE",
      "answer": "EXAMPLE_ANSWER_HERE"
    },
    {
      "question": "EXAMPLE_QUESTION_HERE",
      "answer": "EXAMPLE_ANSWER_HERE"
    }
  ]
  ```
</details>

## Multilingual Support

SoT supports multiple languages. System prompts and exemplars are automatically loaded in the requested language.

## Paradigm Selection Model

SoT includes a pretrained DistilBERT model for automatic paradigm selection based on the question. The model is available on Hugging Face: [saytes/SoT_DistilBERT](https://huggingface.co/saytes/SoT_DistilBERT)

## Datasets

The SoT_DistilBERT model was evaluated on the following datasets:

| Dataset | HF ID | Subset | Split | Evaluation Type |
|---------|-------|--------|-------|----------------|
| GSM8K | [gsm8k](https://huggingface.co/datasets/gsm8k) | main | test | numerical |
| SVAMP | [ChilleD/SVAMP](https://huggingface.co/datasets/ChilleD/SVAMP) | - | test | numerical |
| AQUA-RAT | [aqua_rat](https://huggingface.co/datasets/aqua_rat) | - | test | multiple_choice |
| DROP | [drop](https://huggingface.co/datasets/drop) | - | validation | open |
| OpenbookQA | [openbookqa](https://huggingface.co/datasets/openbookqa) | - | test | multiple_choice |
| StrategyQA | [ChilleD/StrategyQA](https://huggingface.co/datasets/ChilleD/StrategyQA) | - | test | yesno |
| LogiQA | [lucasmccabe/logiqa](https://huggingface.co/datasets/lucasmccabe/logiqa) | default | test | multiple_choice |
| Reclor | [metaeval/reclor](https://huggingface.co/datasets/metaeval/reclor) | - | validation | multiple_choice |
| HotPotQA | [hotpot_qa](https://huggingface.co/datasets/hotpot_qa) | distractor | validation | open |
| MuSiQue-Ans | [dgslibisey/MuSiQue](https://huggingface.co/datasets/dgslibisey/MuSiQue) | - | validation | open |
| QASC | [allenai/qasc](https://huggingface.co/datasets/allenai/qasc) | - | validation | multiple_choice |
| Worldtree | [nguyen-brat/worldtree](https://huggingface.co/datasets/nguyen-brat/worldtree) | - | train | multiple_choice |
| PubMedQA | [qiaojin/PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) | pqa_labeled | train | yesno |
| MedQA | [bigbio/med_qa](https://huggingface.co/datasets/bigbio/med_qa) | med_qa_en_source | validation | multiple_choice |

## Limitations

- The model is trained to classify questions into one of three predefined paradigms and may not generalize to tasks outside the training distribution.
- Performance may vary depending on the complexity and domain of the question.

## Citation

If you find our work helpful, please cite:

```
@misc{aytes2025sot,
      title={Sketch-of-Thought: Efficient LLM Reasoning with Adaptive Cognitive-Inspired Sketching}, 
      author={Simon A. Aytes and Jinheon Baek and Sung Ju Hwang},
      year={2025},
      eprint={2503.05179},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://hf.co/papers/2503.05179}, 
}
```

## License

This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.