File size: 8,032 Bytes

efe54d5
7616293
 
 
 
 
 
 
 
 
 
efe54d5
 
7616293
efe54d5
7616293
efe54d5
7616293
efe54d5
7616293
efe54d5
7616293
efe54d5
 
7616293
efe54d5
7616293
efe54d5
7616293
efe54d5
7616293
 
 
efe54d5
7616293
efe54d5
7616293
 
 
 
 
 
 
 
efe54d5
7616293
 
 
efe54d5
7616293
 
efe54d5
7616293

---
license: apache-2.0
language:
- en
base_model:
- meta-llama/Llama-3.3-70B-Instruct
tags:
- function-calling
- tool-use
- llama
- bfcl
---

# QUANTIZATION INFORMATION

This model was quantized using the [llm-compressor](https://github.com/vllm-project/llm-compressor) library from the vLLM team.

The calibration dataset was [ultrachat_200k](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) with a sequence length of `4096` and a sample size of `1024`

The quantiation scheme is `W4A16` with the `lm_head` ignored.

Further Parameters were the llm-compressor defaults.


## QUANTIZATION CODE

The following code was used to quantize this model:

#### LOADING THE MODEL:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

MODEL_ID = "watt-ai/watt-tool-70B"

# Load model with better memory management
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
```

#### LOADING THE DATASET:
```python
from datasets import load_dataset

NUM_CALIBRATION_SAMPLES=1024
MAX_SEQUENCE_LENGTH=4096

# Load dataset.
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split="train_sft")
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

# Preprocess the data into the format the model is trained with.
def preprocess(example):
    return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False,)}
ds = ds.map(preprocess)

# Tokenize the data (be careful with bos tokens - we need add_special_tokens=False since the chat_template already added it).
def tokenize(sample):
    return tokenizer(sample["text"], padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(tokenize, remove_columns=ds.column_names)
```

#### QUANTIZING THE MODEL:
```python
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier

# Configure the quantization algorithm to run.
recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"], dampening_frac=0.1)

# Apply quantization.
oneshot(
    model=model, dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Save to disk compressed.
SAVE_DIR = "models/" + MODEL_ID.split("/")[1] + "-GPTQ-INT4"
model.save_pretrained(SAVE_DIR, max_shard_size="4GB")
tokenizer.save_pretrained(SAVE_DIR)
```

------


# watt-tool-70B

watt-tool-70B is a fine-tuned language model based on LLaMa-3.3-70B-Instruct, optimized for tool usage and multi-turn dialogue. It achieves state-of-the-art performance on the Berkeley Function-Calling Leaderboard (BFCL).

## Model Description

This model is specifically designed to excel at complex tool usage scenarios that require multi-turn interactions, making it ideal for empowering platforms like [Lupan](https://lupan.watt.chat), an AI-powered workflow building tool. By leveraging a carefully curated and optimized dataset, watt-tool-70B demonstrates superior capabilities in understanding user requests, selecting appropriate tools, and effectively utilizing them across multiple turns of conversation.

Target Application: AI Workflow Building as in [https://lupan.watt.chat/](https://lupan.watt.chat/) and [Coze](https://www.coze.com/).

## Key Features

*   **Enhanced Tool Usage:** Fine-tuned for precise and efficient tool selection and execution.
*   **Multi-Turn Dialogue:** Optimized for maintaining context and effectively utilizing tools across multiple turns of conversation, enabling more complex task completion.
*   **State-of-the-Art Performance:** Achieves top performance on the BFCL, demonstrating its capabilities in function calling and tool usage.
*   **Based on LLaMa-3.1-70B-Instruct:** Inherits the strong language understanding and generation capabilities of the base model.

## Training Methodology

watt-tool-70B is trained using supervised fine-tuning on a specialized dataset designed for tool usage and multi-turn dialogue. We use CoT techniques to synthesize high-quality multi-turn dialogue data.

The training process is inspired by the principles outlined in the paper: ["Direct Multi-Turn Preference Optimization for Language Agents"](https://arxiv.org/abs/2406.14868).
We use SFT and DMPO to further enhance the model's performance in multi-turn agent tasks.

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "watt-ai/watt-tool-70B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype='auto', device_map="auto")
# Example usage (adapt as needed for your specific tool usage scenario)
"""You are an expert in composing functions. You are given a question and a set of possible functions. Based on the question, you will need to make one or more function/tool calls to achieve the purpose.
If none of the function can be used, point it out. If the given question lacks the parameters required by the function, also point it out.
You should only return the function call in tools call sections.
If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]
You SHOULD NOT include any other text in the response.
Here is a list of functions in JSON format that you can invoke.\n{functions}\n
"""
# User query
query = "Find me the sales growth rate for company XYZ for the last 3 years and also the interest coverage ratio for the same duration."
tools = [
    {
        "name": "financial_ratios.interest_coverage", "description": "Calculate a company's interest coverage ratio given the company name and duration",
        "arguments": {
            "type": "dict",
            "properties": {
                "company_name": {
                    "type": "string",
                    "description": "The name of the company."
                }, 
                "years": {
                    "type": "integer",
                    "description": "Number of past years to calculate the ratio."
                }
            }, 
            "required": ["company_name", "years"]
        }
    },
    {
        "name": "sales_growth.calculate",
        "description": "Calculate a company's sales growth rate given the company name and duration",
        "arguments": {
            "type": "dict", 
            "properties": {
                "company": {
                    "type": "string",
                    "description": "The company that you want to get the sales growth rate for."
                }, 
                "years": {
                    "type": "integer",
                    "description": "Number of past years for which to calculate the sales growth rate."
                }
            }, 
            "required": ["company", "years"]
        }
    },
    {
        "name": "weather_forecast",
        "description": "Retrieve a weather forecast for a specific location and time frame.",
        "arguments": {
            "type": "dict",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city that you want to get the weather for."
                }, 
                "days": {
                    "type": "integer",
                    "description": "Number of days for the forecast."
                }
            },
            "required": ["location", "days"]
        }
    }
]
messages = [
    {'role': 'system', 'content': system_prompt.format(functions=tools)},
    {'role': 'user', 'content': query}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))