Commit
·
effad5d
1
Parent(s):
b762f14
[add]: base model card description
Browse files- README.md +180 -52
- requirements.txt +3 -1
README.md
CHANGED
|
@@ -3,57 +3,201 @@ library_name: peft
|
|
| 3 |
base_model: tiiuae/falcon-7b-instruct
|
| 4 |
---
|
| 5 |
|
| 6 |
-
# Model Card for
|
| 7 |
|
| 8 |
-
|
|
|
|
|
|
|
| 9 |
|
|
|
|
|
|
|
| 10 |
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
<!-- Provide a longer summary of what this model is. -->
|
| 17 |
|
|
|
|
|
|
|
|
|
|
| 18 |
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
-
|
| 28 |
-
### Model Sources [optional]
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Uses
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
### Direct Use
|
| 41 |
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
## Bias, Risks, and Limitations
|
| 59 |
|
|
@@ -168,22 +312,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
|
|
| 168 |
|
| 169 |
[More Information Needed]
|
| 170 |
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
-
|
| 175 |
-
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Glossary [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 186 |
-
|
| 187 |
[More Information Needed]
|
| 188 |
|
| 189 |
## More Information [optional]
|
|
|
|
| 3 |
base_model: tiiuae/falcon-7b-instruct
|
| 4 |
---
|
| 5 |
|
| 6 |
+
# Model Card for the Query Parser LLM using Falcon-7B-Instruct
|
| 7 |
|
| 8 |
+
[]()
|
| 9 |
+
[](https://www.python.org/downloads/release/python-390/)
|
| 10 |
+

|
| 11 |
|
| 12 |
+
EmbeddingStudio is the [open-source framework](https://github.com/EulerSearch/embedding_studio/tree/main), that allows you transform a joint "Embedding Model + Vector DB" into
|
| 13 |
+
a full-cycle search engine: collect clickstream -> improve search experience-> adapt embedding model and repeat out of the box.
|
| 14 |
|
| 15 |
+
It's a highly rare case when a company will use unstructured search as is. And by searching `brick red houses san francisco area for april`
|
| 16 |
+
user definitely wants to find some houses in San Francisco for a month-long rent in April, and then maybe brick-red houses.
|
| 17 |
+
Unfortunately, for the 15th January 2024 there is no such accurate embedding model. So, companies need to mix structured and unstructured search.
|
| 18 |
|
| 19 |
+
The very first step of mixing it - to parse a search query. Usual approaches are:
|
| 20 |
+
* Implement a bunch of rules, regexps, or grammar parsers (like [NLTK grammar parser](https://www.nltk.org/howto/grammar.html)).
|
| 21 |
+
* Collect search queries and to annotate some dataset for NER task.
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
It takes some time to do, but at the end you can get controllable and very accurate query parser.
|
| 24 |
+
EmbeddingStudio team decided to dive into LLM instruct fine-tuning for `Zero-Shot query parsing` task
|
| 25 |
+
to close the first gap while a company doesn't have any rules and data being collected, or even eliminate exhausted rules implementation, but in the future.
|
| 26 |
|
| 27 |
+
The main idea is to align an LLM to being to parse short search queries knowing just a company market and a schema of search filters. Moreover, being oriented on applied NLP,
|
| 28 |
+
we are trying to serve only light-weight LLMs a.k.a `not heavier than 7B parameters`.
|
| 29 |
|
| 30 |
+
## Model Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
### Model Description
|
| 33 |
|
| 34 |
+
This is only [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct) aligned to follow instructions like:
|
| 35 |
+
```markdown
|
| 36 |
+
### System: Master in Query Analysis
|
| 37 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
| 38 |
+
#### Category: Logistics and Supply Chain Management
|
| 39 |
+
#### Schema: ```[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]```
|
| 40 |
+
#### Query: Which logistics companies in the US have a perfect 5.0 rating ?
|
| 41 |
+
### Response:
|
| 42 |
+
[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
**Important:** Additionally, we are trying to fine-tune the Large Language Model (LLM) to not only parse unstructured search queries but also to correct spelling.
|
| 46 |
+
|
| 47 |
+
- **Developed by EmbeddingStudio team:**
|
| 48 |
+
* Aleksandr Iudaev | [LinkedIn](https://www.linkedin.com/in/alexanderyudaev/) | [Email](mailto:[email protected]) |
|
| 49 |
+
* Andrei Kostin | [LinkedIn](https://www.linkedin.com/in/andrey-kostin/) | [Email](mailto:[email protected]) |
|
| 50 |
+
* ML Doom | `AI Assistant`
|
| 51 |
+
- **Funded by EmbeddingStudio team**
|
| 52 |
+
- **Model type:** Instruct Fine-Tuned Large Language Model
|
| 53 |
+
- **Model task:** Zero-shot search query parsing
|
| 54 |
+
- **Language(s) (NLP):** English
|
| 55 |
+
- **License:** apache-2.0
|
| 56 |
+
- **Finetuned from model:** [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct)
|
| 57 |
+
- **!Maximal Length Size:** we used 1024 for fine-tuning, this is highly different from the original model `max_seq_length = 2048`
|
| 58 |
+
- **Tuning Epochs:** 3 for now, but will be more later.
|
| 59 |
+
|
| 60 |
+
**Disclaimer:** As a small startup, this direction forms a part of our Minimum Viable Product (MVP). It's more of
|
| 61 |
+
an attempt to test the 'product-market fit' rather than a well-structured scientific endeavor. Once we check it and go with a round, we definitely will:
|
| 62 |
+
* Curating a specific dataset for more precise analysis.
|
| 63 |
+
* Exploring various approaches and Large Language Models (LLMs) to identify the most effective solution.
|
| 64 |
+
* Publishing a detailed paper to ensure our findings and methodologies can be thoroughly reviewed and verified.
|
| 65 |
+
|
| 66 |
+
We acknowledge the complexity involved in utilizing Large Language Models, particularly in the context
|
| 67 |
+
of `Zero-Shot search query parsing` and `AI Alignment`. Given the intricate nature of this technology, we emphasize the importance of rigorous verification.
|
| 68 |
+
Until our work is thoroughly reviewed, we recommend being cautious and critical of the results.
|
| 69 |
+
|
| 70 |
+
### Model Sources
|
| 71 |
+
|
| 72 |
+
- **Repository:** code of inference the model will be [here](https://github.com/EulerSearch/embedding_studio/tree/main)
|
| 73 |
+
- **Paper:** Work In Progress
|
| 74 |
+
- **Demo:** Work In Progress
|
| 75 |
|
| 76 |
## Uses
|
| 77 |
|
| 78 |
+
We strongly recommend only the direct usage of this fine-tuned version of [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct):
|
| 79 |
+
* Zero-shot Search Query Parsing with porived company market name and filters schema
|
| 80 |
+
* Search Query Spell Correction
|
| 81 |
+
|
| 82 |
+
For any other needs the behaviour of the model in unpredictable, please utilize the [original mode](https://huggingface.co/tiiuae/falcon-7b-instruct) or fine-tune your own.
|
| 83 |
+
|
| 84 |
+
### Instruction format
|
| 85 |
+
|
| 86 |
+
```markdown
|
| 87 |
+
### System: Master in Query Analysis
|
| 88 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
| 89 |
+
#### Category: {your_company_category}
|
| 90 |
+
#### Schema: ```{filters_schema}```
|
| 91 |
+
#### Query: {query}
|
| 92 |
+
### Response:
|
| 93 |
+
```
|
| 94 |
+
|
| 95 |
+
Filters schema is JSON-readable line in the format (we highly recommend you to use it):
|
| 96 |
+
List of filters (dict):
|
| 97 |
+
* Name - name of filter (better to be meaningful).
|
| 98 |
+
* Representations - list of possible filter formats (dict):
|
| 99 |
+
* Name - name of representation (better to be meaningful).
|
| 100 |
+
* Type - python base type (int, float, str, bool).
|
| 101 |
+
* Examples - list of examples.
|
| 102 |
+
* Enum - if a representation is enumeration, provide a list of possible values, LLM should map parsed value into this list.
|
| 103 |
+
* Pattern - if a representation is pattern-like (datetime, regexp, etc.) provide a pattern text in any format.
|
| 104 |
+
|
| 105 |
+
Example:
|
| 106 |
+
```json
|
| 107 |
+
[{"Name": "Customer_Ratings", "Representations": [{"Name": "Exact_Rating", "Type": "float", "Examples": [4.5, 3.2, 5.0, "4.5", "Unstructured"]}, {"Name": "Minimum_Rating", "Type": "float", "Examples": [4.0, 3.0, 5.0, "4.5"]}, {"Name": "Star_Rating", "Type": "int", "Examples": [4, 3, 5], "Enum": [1, 2, 3, 4, 5]}]}, {"Name": "Date", "Representations": [{"Name": "Day_Month_Year", "Type": "str", "Examples": ["01.01.2024", "15.06.2023", "31.12.2022", "25.12.2021", "20.07.2024", "15.06.2023"], "Pattern": "dd.mm.YYYY"}, {"Name": "Day_Name", "Type": "str", "Examples": ["Monday", "Wednesday", "Friday", "Thursday", "Monday", "Tuesday"], "Enum": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]}]}, {"Name": "Date_Period", "Representations": [{"Name": "Specific_Period", "Type": "str", "Examples": ["01.01.2024 - 31.01.2024", "01.06.2023 - 30.06.2023", "01.12.2022 - 31.12.2022"], "Pattern": "dd.mm.YYYY - dd.mm.YYYY"}, {"Name": "Month", "Type": "str", "Examples": ["January", "June", "December"], "Enum": ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]}, {"Name": "Quarter", "Type": "str", "Examples": ["Q1", "Q2", "Q3"], "Enum": ["Q1", "Q2", "Q3", "Q4"]}, {"Name": "Season", "Type": "str", "Examples": ["Winter", "Summer", "Autumn"], "Enum": ["Winter", "Spring", "Summer", "Autumn"]}]}, {"Name": "Destination_Country", "Representations": [{"Name": "Country_Name", "Type": "str", "Examples": ["United States", "Germany", "China"]}, {"Name": "Country_Code", "Type": "str", "Examples": ["US", "DE", "CN"]}, {"Name": "Country_Abbreviation", "Type": "str", "Examples": ["USA", "GER", "CHN"]}]}]
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
As the result, response will be JSON-readable line in the format:
|
| 111 |
+
```json
|
| 112 |
+
[{"Value": "Corrected search phrase", "Name": "Correct"}, {"Name": "filter-name.representation", "Value": "some-value"}]
|
| 113 |
+
```
|
| 114 |
+
|
| 115 |
+
Field and representation names will be aligned with the provided schema. Example:
|
| 116 |
+
```json
|
| 117 |
+
[{"Value": "Which logistics companies in the US have a perfect 5.0 rating?", "Name": "Correct"}, {"Name": "Customer_Ratings.Exact_Rating", "Value": 5.0}, {"Name": "Destination_Country.Country_Code", "Value": "US"}]
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
|
| 121 |
+
Used for fine-tuning `system` phrases:
|
| 122 |
+
```python
|
| 123 |
+
[
|
| 124 |
+
"Expert at Deconstructing Search Queries",
|
| 125 |
+
"Master in Query Analysis",
|
| 126 |
+
"Premier Search Query Interpreter",
|
| 127 |
+
"Advanced Search Query Decoder",
|
| 128 |
+
"Search Query Parsing Genius",
|
| 129 |
+
"Search Query Parsing Wizard",
|
| 130 |
+
"Unrivaled Query Parsing Mechanism",
|
| 131 |
+
"Search Query Parsing Virtuoso",
|
| 132 |
+
"Query Parsing Maestro",
|
| 133 |
+
"Ace of Search Query Structuring"
|
| 134 |
+
]
|
| 135 |
+
```
|
| 136 |
+
|
| 137 |
+
Used for fine-tuning `instruction` phrases:
|
| 138 |
+
```python
|
| 139 |
+
[
|
| 140 |
+
"Convert queries to JSON, align with schema, ensure correct spelling.",
|
| 141 |
+
"Analyze and structure queries in JSON, maintain schema, check spelling.",
|
| 142 |
+
"Organize queries in JSON, adhere to schema, verify spelling.",
|
| 143 |
+
"Decode queries to JSON, follow schema, correct spelling.",
|
| 144 |
+
"Parse queries to JSON, match schema, spell correctly.",
|
| 145 |
+
"Transform queries to structured JSON, align with schema and spelling.",
|
| 146 |
+
"Restructure queries in JSON, comply with schema, accurate spelling.",
|
| 147 |
+
"Rearrange queries in JSON, strict schema adherence, maintain spelling.",
|
| 148 |
+
"Harmonize queries with JSON schema, ensure spelling accuracy.",
|
| 149 |
+
"Efficient JSON conversion of queries, schema compliance, correct spelling."
|
| 150 |
+
]
|
| 151 |
+
```
|
| 152 |
|
| 153 |
### Direct Use
|
| 154 |
|
| 155 |
+
```python
|
| 156 |
+
import json
|
| 157 |
+
|
| 158 |
+
from json import JSONDecodeError
|
| 159 |
+
|
| 160 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 161 |
+
|
| 162 |
+
INSTRUCTION_TEMPLATE = """
|
| 163 |
+
### System: Master in Query Analysis
|
| 164 |
+
### Instruction: Organize queries in JSON, adhere to schema, verify spelling.
|
| 165 |
+
#### Category: {0}
|
| 166 |
+
#### Schema: ```{1}```
|
| 167 |
+
#### Query: {2}
|
| 168 |
+
### Response:
|
| 169 |
+
"""
|
| 170 |
+
|
| 171 |
+
|
| 172 |
+
|
| 173 |
+
def parse(
|
| 174 |
+
query: str,
|
| 175 |
+
company_category: str,
|
| 176 |
+
filter_schema: dict,
|
| 177 |
+
model: AutoModelForCausalLM,
|
| 178 |
+
tokenizer: AutoTokenizer
|
| 179 |
+
):
|
| 180 |
+
input_text = INSTRUCTION_TEMPLATE.format(
|
| 181 |
+
company_category,
|
| 182 |
+
json.dumps(filter_schema),
|
| 183 |
+
query
|
| 184 |
+
)
|
| 185 |
+
input_ids = tokenizer.encode(input_text, return_tensors='pt')
|
| 186 |
+
|
| 187 |
+
# Generating text
|
| 188 |
+
output = model.generate(input_ids.to('cuda'),
|
| 189 |
+
max_new_tokens=1024,
|
| 190 |
+
do_sample=True,
|
| 191 |
+
temperature=0.05,
|
| 192 |
+
pad_token_id=50256
|
| 193 |
+
)
|
| 194 |
+
try:
|
| 195 |
+
parsed = json.loads(tokenizer.decode(output[0], skip_special_tokens=True).split('## Response:\n')[-1])
|
| 196 |
+
except JSONDecodeError as e:
|
| 197 |
+
parsed = dict()
|
| 198 |
+
|
| 199 |
+
return parsed
|
| 200 |
+
```
|
| 201 |
|
| 202 |
## Bias, Risks, and Limitations
|
| 203 |
|
|
|
|
| 312 |
|
| 313 |
[More Information Needed]
|
| 314 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
[More Information Needed]
|
| 316 |
|
| 317 |
## More Information [optional]
|
requirements.txt
CHANGED
|
@@ -1,9 +1,11 @@
|
|
|
|
|
| 1 |
datasets==2.16.1
|
| 2 |
nltk==3.8.1
|
| 3 |
huggingface-hub==0.19.4
|
|
|
|
| 4 |
torch==2.0.0+cu117
|
| 5 |
torchmetrics==1.2.0
|
| 6 |
torchsummary==1.5.1
|
| 7 |
torchtext==0.15.0+cpu
|
| 8 |
transformers==4.36.2
|
| 9 |
-
trl==0.7.7
|
|
|
|
| 1 |
+
bitsandbytes==0.41.0
|
| 2 |
datasets==2.16.1
|
| 3 |
nltk==3.8.1
|
| 4 |
huggingface-hub==0.19.4
|
| 5 |
+
peft==0.5.0
|
| 6 |
torch==2.0.0+cu117
|
| 7 |
torchmetrics==1.2.0
|
| 8 |
torchsummary==1.5.1
|
| 9 |
torchtext==0.15.0+cpu
|
| 10 |
transformers==4.36.2
|
| 11 |
+
trl==0.7.7
|