|  | --- | 
					
						
						|  | title: Template-free prompt construction | 
					
						
						|  | description: "Template-free prompt construction with the `input_output` format" | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | - [Background](#background) | 
					
						
						|  | - [Masking Inputs](#masking-inputs) | 
					
						
						|  | - [You may not want prompt templates](#you-may-not-want-prompt-templates) | 
					
						
						|  | - [The `input_output` format](#the-input_output-format) | 
					
						
						|  | - [Usage](#usage) | 
					
						
						|  | - [1. Prepare Data](#1-prepare-data) | 
					
						
						|  | - [2. Use `type: input_output`](#2-use-type-input_output) | 
					
						
						|  | - [3. Check the prompts](#3-check-the-prompts) | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-background" name="background"></a> | 
					
						
						|  |  | 
					
						
						|  | ## Background | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-masking-inputs" name="masking-inputs"></a> | 
					
						
						|  |  | 
					
						
						|  | ### Masking Inputs | 
					
						
						|  |  | 
					
						
						|  | One of the most popular features of | 
					
						
						|  | [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) is | 
					
						
						|  | setting the following configuration value: | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ```yaml | 
					
						
						|  | train_on_inputs: false | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | If you declare a [dataset formats](https://github.com/OpenAccess-AI-Collective/axolotl?tab=readme-ov-file#dataset) | 
					
						
						|  | such as `alpaca` or `chatml`, axolotl knows what is an input | 
					
						
						|  | (i.e. human) vs. an output (i.e. the assistant) and masks the input | 
					
						
						|  | labels so that your model can focus on predicting the outputs only. | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-you-may-not-want-prompt-templates" name="you-may-not-want-prompt-templates"></a> | 
					
						
						|  |  | 
					
						
						|  | ### You may not want prompt templates | 
					
						
						|  |  | 
					
						
						|  | However, there are many situations where you don't want to use one of | 
					
						
						|  | these formats or templates (I usually don't!). This is because they can: | 
					
						
						|  |  | 
					
						
						|  | -   Add unnecessary boilerplate to your prompts. | 
					
						
						|  | -   Create artifacts like special delimiters `<|im_start|>` that can | 
					
						
						|  | quickly become footguns if you don't include them correctly at | 
					
						
						|  | inference time. | 
					
						
						|  | -   Enforce a *chat* interface when you do not want one. Sometimes you | 
					
						
						|  | just want to fine-tune a model to a very specific task and do NOT | 
					
						
						|  | want multi-turn conversations, roles, etc. | 
					
						
						|  | -   Limit you to only certain roles that the template allows. | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-the-inputoutput-format" name="the-inputoutput-format"></a> | 
					
						
						|  |  | 
					
						
						|  | ### The `input_output` format | 
					
						
						|  |  | 
					
						
						|  | You can construct your prompts without a template by using the | 
					
						
						|  | `input_output` format, by setting `type: input_output` in your | 
					
						
						|  | configuration file like this: | 
					
						
						|  |  | 
					
						
						|  | **config.yml** | 
					
						
						|  |  | 
					
						
						|  | ```yaml | 
					
						
						|  | train_on_inputs: false # Mask segments of your data | 
					
						
						|  | datasets: | 
					
						
						|  | - path: output.jsonl | 
					
						
						|  | type: input_output  # use template free prompt construction | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Unlike `type: completion`, which is also template-free, | 
					
						
						|  | `type: input_output` allows you to mask segments of your text. More | 
					
						
						|  | details on how this works are described below. | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-usage" name="usage"></a> | 
					
						
						|  |  | 
					
						
						|  | ## Usage | 
					
						
						|  |  | 
					
						
						|  | This is how you can use the `input_output` format: | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-1-prepare-data" name="1-prepare-data"></a> | 
					
						
						|  |  | 
					
						
						|  | ### 1. Prepare Data | 
					
						
						|  |  | 
					
						
						|  | To use the `input_output` format, collect your data in the following | 
					
						
						|  | format into a jsonl file (below is the first row from the file | 
					
						
						|  | `output`.jsonl` pretty printed): | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | $ head -n1 output.jsonl | python -m json.tool | 
					
						
						|  |  | 
					
						
						|  | {.cell-output .cell-output-stdout} | 
					
						
						|  | { | 
					
						
						|  | "segments": [ | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "<s>Hello\n" | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "hi there!. " | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": false, | 
					
						
						|  | "text": "goodbye " | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "farewell</s>" | 
					
						
						|  | } | 
					
						
						|  | ] | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Set `label:false` when you want to mask a segment of text so that the | 
					
						
						|  | model isn't trained on it. Some things to keep in mind: | 
					
						
						|  |  | 
					
						
						|  | > [!IMPORTANT] | 
					
						
						|  | > 1.  **EOS, BOS, spaces, newlines etc. are entirely up to you. Axolotl | 
					
						
						|  | concatenates all the segments as-is.** The tokenizer doesn't add | 
					
						
						|  | anything additional. Notice how I added spaces, newlines, `<s>` | 
					
						
						|  | (BOS), and `</s>` (EOS) myself. | 
					
						
						|  | > 2.  Make sure you check the materialized output to validate that the | 
					
						
						|  | prompt is getting assembled how you like. | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-2-use-type-inputoutput" name="2-use-type-inputoutput"></a> | 
					
						
						|  |  | 
					
						
						|  | ### 2. Use `type: input_output` | 
					
						
						|  |  | 
					
						
						|  | Let's materialize data with our `output.jsonl` file by setting | 
					
						
						|  | `type: input_output` in our axolotl config: | 
					
						
						|  |  | 
					
						
						|  | ```yaml | 
					
						
						|  | # training_config.yaml | 
					
						
						|  | base_model: mistralai/Mistral-7B-v0.1 | 
					
						
						|  | data_seed: 49 | 
					
						
						|  | seed: 49 | 
					
						
						|  |  | 
					
						
						|  | datasets: | 
					
						
						|  | - path: output.jsonl | 
					
						
						|  | type: input_output | 
					
						
						|  | val_set_size: 0.1 | 
					
						
						|  |  | 
					
						
						|  | sequence_len: 896 | 
					
						
						|  | sample_packing: false | 
					
						
						|  |  | 
					
						
						|  | micro_batch_size: 2 | 
					
						
						|  | gradient_accumulation_steps: 3 | 
					
						
						|  | eval_batch_size: 2 | 
					
						
						|  | num_epochs: 1 | 
					
						
						|  | learning_rate: 0.0002 | 
					
						
						|  |  | 
					
						
						|  | train_on_inputs: false | 
					
						
						|  | special_tokens: | 
					
						
						|  | bos_token: "<s>" | 
					
						
						|  | eos_token: "</s>" | 
					
						
						|  | unk_token: "<unk>" | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | You can use the following command to materialize your data. The | 
					
						
						|  | `--debug` flag will print the tokens, along with the labels so you can | 
					
						
						|  | verify that the correct items are being ignored: | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | $ python -m axolotl.cli.preprocess training_config.yaml --debug | 
					
						
						|  |  | 
					
						
						|  | ... | 
					
						
						|  | [2024-03-05 23:36:46,969] [INFO] [axolotl.check_example_labels:35] [PID:607731] [RANK:0] <s>(1, 1) Hello(22557, 22557) | 
					
						
						|  | (13, 13) hi(12014, 12014) there(736, 736) !(28808, 28808) .(28723, 28723) (28705, 28705) good(-100, 1179) bye(-100, 17664) (-100, 28705) fare(19111, 19111) well(5458, 5458) </s>(2, 2) | 
					
						
						|  |  | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | The format is `decoded_token`(`label`, `token_id`), for example, | 
					
						
						|  | `<s>(1, 1)` means that the token is `<s>`, the label is `1` and the | 
					
						
						|  | token_id is `1`. When the label is `-100` then that token is ignored for | 
					
						
						|  | training. | 
					
						
						|  |  | 
					
						
						|  | <a id="markdown-3-check-the-prompts" name="3-check-the-prompts"></a> | 
					
						
						|  |  | 
					
						
						|  | ### 3. Check the prompts | 
					
						
						|  |  | 
					
						
						|  | Here is another way to check the materialized output: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from transformers import AutoTokenizer | 
					
						
						|  | from datasets import load_from_disk | 
					
						
						|  | import yaml | 
					
						
						|  |  | 
					
						
						|  | directory = !ls last_run_prepared/ | 
					
						
						|  | with open('training_config.yaml', 'r') as f: | 
					
						
						|  | cfg = yaml.safe_load(f) | 
					
						
						|  | model_id = cfg['base_model'] | 
					
						
						|  | tok = AutoTokenizer.from_pretrained(model_id) | 
					
						
						|  | ds = load_from_disk(f'last_run_prepared/{directory[0]}/') | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | >>> row = ds[0] | 
					
						
						|  | >>> print(tok.decode(row['input_ids'])) | 
					
						
						|  | <s> Hello | 
					
						
						|  | hi there!.  goodbye  farewell</s> | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | We can check that the right tokens are ingored by comparing the labels | 
					
						
						|  | to each token: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | import pandas as pd | 
					
						
						|  | pd.DataFrame([{'token': tok.decode(i), 'label': l, 'id':i} for i,l in | 
					
						
						|  | zip(row['input_ids'], row['labels'])]) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | | token | label | id    | | 
					
						
						|  | |-------|-------|-------| | 
					
						
						|  | | 0     | \<s\> | 1     | | 
					
						
						|  | | 1     | Hello | 22557 | | 
					
						
						|  | | 2     | \\n   | 13    | | 
					
						
						|  | | 3     | hi    | 12014 | | 
					
						
						|  | | 4     | there | 736   | | 
					
						
						|  | | 5     | !     | 28808 | | 
					
						
						|  | | 6     | .     | 28723 | | 
					
						
						|  | | 7     |       | 28705 | | 
					
						
						|  | | 8     | good  | -100  | | 
					
						
						|  | | 9     | bye   | -100  | | 
					
						
						|  | | 10    |       | -100  | | 
					
						
						|  | | 11    | fare  | 19111 | | 
					
						
						|  | | 12    | well  | 5458  | | 
					
						
						|  | | 13    | \</s\>| 2     | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | If we look at the input data, the above table seems correct! (The jsonl | 
					
						
						|  | version is repeated below for reference): | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ```bash | 
					
						
						|  | $ head -n1 output.jsonl | python -m json.tool | 
					
						
						|  |  | 
					
						
						|  | {.cell-output .cell-output-stdout} | 
					
						
						|  | { | 
					
						
						|  | "segments": [ | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "<s>Hello\n" | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "hi there!. " | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": false, | 
					
						
						|  | "text": "goodbye " | 
					
						
						|  | }, | 
					
						
						|  | { | 
					
						
						|  | "label": true, | 
					
						
						|  | "text": "farewell</s>" | 
					
						
						|  | } | 
					
						
						|  | ] | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  |