Spaces:
Sleeping
Sleeping
| # DuckDB-NSQL | |
| Numbers Station Text to SQL model for DuckDB. | |
| NSQL is a family of autoregressive open-source foundational models (FMs) that are particularly designed for SQL generation tasks. We are thrilled to introduce DuckDB-NSQL in this repository, an FM tailored for local DuckDB SQL analytics tasks. All model weights can be found on HuggingFace. | |
| | Model Name | Size | Link | | |
| | --------------------------------------| ---- | -------------------------------------------------------------- | | |
| | motherduckdb/DuckDB-NSQL-7B-v0.1 | 7B | [link](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1) | | |
| | motherduckdb/DuckDB-NSQL-7B-v0.1-GGUF | 7B | [link](https://huggingface.co/motherduckdb/DuckDB-NSQL-7B-v0.1-GGUF)| | |
| ## Setup | |
| To install all the necessary dependencies, please run | |
| ``` | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| Please refer to the examples in the `examples/` folder to learn how to connect to a local DuckDB and directly query your data. A simple notebook is provided in the `examples/` directory for reference. | |
| To host the model with llama.cpp, please execute the following: | |
| ```python | |
| # Import necessary modules | |
| from llama_cpp import Llama | |
| from wurlitzer import pipes | |
| # Set up client with model path and context size | |
| with pipes() as (out, err): | |
| client = Llama( | |
| model_path="DuckDB-NSQL-7B-v0.1-q8_0.gguf", | |
| n_ctx=2048, | |
| ) | |
| ``` | |
| To load the DuckDB database and query against it, please execute the following: | |
| ```python | |
| # Import necessary modules | |
| import duckdb | |
| from utils import generate_sql | |
| # Connect to DuckDB database | |
| con = duckdb.connect("nyc.duckdb") | |
| # Sample question for SQL generation | |
| question = "alter taxi table and add struct column with name test and keys a:int, b:double" | |
| # Generate SQL, check validity, and print | |
| sql = generate_sql(question, con, client) | |
| print(sql) | |
| ``` | |
| ## Training Data | |
| The training data for this model consists of two parts: 1) 200k synthetically generated DuckDB SQL queries, based on the DuckDB v.0.9.2 documentation, and 2) labeled text-to-SQL pairs from [NSText2SQL](https://huggingface.co/datasets/NumbersStation/NSText2SQL) transpiled to DuckDB SQL using [sqlglot](https://github.com/tobymao/sqlglot). | |
| ## Evaluate the benchmark | |
| Please refer to the `eval/` folder to check the details for evaluating the model against our proposed DuckDB benchmark. | |
| ## Acknowledgement | |
| We would like to express our appreciation to all authors of the evaluation scripts. Their work made this project possible. | |