ibm-granite
/

rag-intrinsics-lib

Safetensors

Model card Files Files and versions

xet

Community

lucianpopa commited on Oct 7

Commit

eb2be91

verified ·

1 Parent(s): de77c9a

Upload README.md

Browse files

Files changed (1) hide show

query_rewrite/README.md +333 -0

query_rewrite/README.md ADDED Viewed

	@@ -0,0 +1,333 @@

+---
+license: apache-2.0
+---
+# Adapters for Query Rewrite
+## Model Summary
+<!-- Provide a quick summary of what the model is/does. -->
+This is a family of adapters that are fine-tuned for the following query rewrite task:
+    Given a multi-turn conversation between a user and an AI assistant, decontextualize the last
+    user utterance (query) by rewriting it (whenever necessary) into an equivalent version that
+    is standalone and can be understood by itself.
+While the adapters are general purpose, they are especially effective in RAG settings where the ability to rewrite a user query into a standalone version directly improves the retriever performance, which in turn improves the answer generation performance. We have tested the adapters in RAG settings as well as on a specialized enterprise (non-RAG) setup, and have shown their performance is significantly higher than when prompting out-of-the-box models, including open-source models such as gpt-oss as well as frontier models such as gpt-4o.
+The adapters released here work with both the IBM granite-3.3 family of models (2b and 8b) as well as gpt-oss-20b.
+- **Developer:** IBM Research
+- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Intended use
+The adapters gives the ability to rewrite the last user query in a multi-turn conversation. Typically, the rewrite is a form of expansion that inlines into the query any implicit references that are made to entities, concepts, or even parts of the conversation that occur in the previous turns (either by the user or the AI assistant). Such expansion can include coreference resolution (i.e., replacement of pronouns with the actual entities), handling of ellipsis, which is the common linguistic phenomenon where parts of a sentence or phrase are omitted by the user, but can be understood from the context (i.e., for whom, of what, with respect to something discussed above, etc.).
+As a result of the expansion, the query becomes a standalone query, still equivalent in meaning with what the user asked in the last turn. The rewritten query can be sent to downstream tasks (e.g., to a retriever in a RAG setting) as a better replacement for the original user query, and without the need for (a potentially very long) context.
+**Note**: Even though one main application for query rewrite is in RAG settings, this LoRA adapter can be used to rewrite user questions for other conversational use cases (e.g., to access a database, or other APIs, or tools). As such, the adapter does not need any RAG documents (that may be present in the context, in a RAG setting) and uses only the dialog turns with what is being said between the user and assistant.
+**Model input**: The input to the model consists of:
+- A list of conversational turns that can alternate between the `user` and `assistant` roles.
+- The last `user` turn in the list is assumed to be the query that needs to be rewritten (if not already standalone).
+## Quickstart Example Using Granite-Common
+The simplest way to invoke a query rewrite adapter is through the *granite-common* framework (https://github.com/ibm-granite/granite-common). This framework abstracts away the lower-level details of calling the adapters, and provides a notebook (https://github.com/ibm-granite/granite-common/blob/main/notebooks/intrinsics_openai.ipynb) which can be used to perform model inference with an OpenAI-compatible backend such as vLLM. The notebook is generic in the sense that it can be used with other intrinsics (e.g., answerability, hallucination detection, and others, in addition to query rewrite). The customization of the notebook to the particular intrinsic (i.e., query rewrite in this case) is done by first pointing to the appropriate model name and vllm server, and through data (i.e., an input json file with a sample conversation for query rewrite), and a yaml file with a minimal set of configuration options for query rewrite. See the details below. Also, see [ibm-granite/rag-intrinsics-lib/run_vllm.sh](https://huggingface.co/ibm-granite/rag-intrinsics-lib/blob/main/run_vllm.sh) for a shell script to instantiate a vllm server with the appropriate adapters.
+### Notebook Header
+``` python
+#Change constants to point to the Query Rewrite intrinsic
+intrinsic_name = "query_rewrite"
+base_model_name = "granite-3.3-8b-instruct"
+# Change the following two constants as needed to reflect the location of the inference
+# server, which is assumed to have already loaded the adapter.
+openai_base_url = "http://localhost:55555/v1"
+openai_api_key = "rag_intrinsics_1234"
+```
+### Sample JSON Input (tests/granite_common/intrinsics/rag/testdata/input_json/query_rewrite.json)
+```json
+{
+    "messages": [
+        {
+            "role": "assistant",
+            "content": "Welcome to pet questions!"
+        },
+        {
+            "role": "user",
+            "content": "I have two pets, a dog named Rex and a cat named Lucy."
+        },
+        {
+            "role": "assistant",
+            "content": "Great, what would you like to share about them?"
+        },
+        {
+            "role": "user",
+            "content": "Rex spends a lot of time in the backyard and outdoors, and Lucy is always inside."
+        },
+        {
+            "role": "assistant",
+            "content": "Sounds good! Rex must love exploring outside, while Lucy probably enjoys her cozy indoor life."
+        },
+        {
+            "role": "user",
+            "content": "But is he more likely to get fleas because of that?"
+        }
+    ],
+    "temperature": 0.0
+}
+```
+### Default yaml configuration (tests/granite_common/intrinsics/rag/testdata/input_yaml/query_rewrite.yaml)
+```yaml
+# Model name string, or null to use whatever is provided in the chat completion request
+model: ~
+# JSON schema of the model's output
+response_format: |
+  {
+    "properties": {
+      "rewritten_question": {
+        "title": "Rewritten Question",
+        "type": "string"
+      }
+    },
+    "required": [
+      "rewritten_question"
+    ],
+    "title": "QueryRewriteOutput",
+    "type": "object"
+  }
+transformations: ~
+instruction: ~
+parameters:
+  max_completion_tokens: 1024
+sentence_boundaries: ~
+```
+## Quickstart Example Using HuggingFace
+A more involved alternative is to use a query rewrite adapter directly instead of invoking it through granite-common.
+The invocation sequence is slightly more complex. The model expects:
+1. The conversation history formatted according to the particular model family (e.g., see granite below)
+2. An optional rewrite instruction with JSON formatting, appended as an additional `user` role after the
+last user question in the conversation history.
+The exact format for granite-3.x adapters:
+```
+<conversation history>
+<|start_of_role|>user<|end_of_role|>Reword the above user query into a single utterance that doesn't need the prior
+    conversation history to understand the user's intent. If the final utterance is a clear
+    and standalone question, please DO NOT attempt to rewrite it, rather output the last user
+    utterance as is. Your output should be a JSON structure with the rewritten question:
+{
+    "rewritten_question": "YOUR_REWRITTEN_QUESTION_HERE"
+}
+<|end_of_text|>
+```
+**Model output**: When prompted with the above format, the model generates a json object, which contains a field with the actual rewritten question.
+Use the code below to get started. This assumes the base model is ibm-granite/granite-3.x. A corresponding code snippet for gpt-oss can be easily constructed.
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import json, re
+INSTRUCTION_TEXT = """Reword the above user query into a single utterance that doesn't need the prior
+    conversation history to understand the user's intent. If the final utterance is a clear
+    and standalone question, please DO NOT attempt to rewrite it, rather output the last user
+    utterance as is."""
+JSON = """Your output should be a JSON structure with the rewritten question:
+```json
+{
+    "rewritten_question": "YOUR_REWRITTEN_QUESTION_HERE"
+}
+```"""
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+BASE_NAME = "ibm-granite/granite-3.3-8b-instruct"
+LORA_PATH = "my_models/ibm-granite/query_rewrite/lora/granite-3.3-8b-instruct"
+tokenizer = AutoTokenizer.from_pretrained(BASE_NAME, padding_side='left', trust_remote_code=True)
+model_base = AutoModelForCausalLM.from_pretrained(BASE_NAME, device_map='auto')
+model_rewrite = PeftModel.from_pretrained(model_base, LORA_PATH)
+# Input conversation
+conv = [
+    {
+        "role": "user",
+        "content": "Tim Cook is the CEO of Apple Inc."
+    },
+    {
+        "role": "assistant",
+        "content": "Yes, Tim Cook is the Chief Executive Officer of Apple Inc."
+    },
+    {
+        "role": "user",
+        "content": "and for Microsoft?"
+    }
+]
+# Generate the query rewrite for the last turn in the above conversation
+conv = [{"role": "system", "content": ""}] + conv
+conversation_text = tokenizer.apply_chat_template(conv, tokenize=False)
+# Add the instruction and the assistant prompt to the conversation string
+user_instruction = f"<|start_of_role|>user<|end_of_role|>{INSTRUCTION_TEXT} {JSON}<|end_of_text|>\n"
+generation_prompt = "<|start_of_role|>assistant<|end_of_role|>"
+input_text = conversation_text + user_instruction + generation_prompt
+inputs = tokenizer(input_text, return_tensors="pt")
+output = model_rewrite.generate(inputs["input_ids"].to(device),
+                               attention_mask=inputs["attention_mask"].to(device),
+                               max_new_tokens=80)
+output_text = tokenizer.decode(output[0])
+# Regex pattern to extract the JSON with the rewrite from the output of the model
+pattern = r'\{\s*"[^"]+"\s*:\s*"[^"]*"\s*\}'
+match_js = re.findall(pattern, output_text)[0]
+try:
+    # Parse the JSON and extract the rewrite
+    rewrite = json.loads(match_js)['rewritten_question']
+except Exception as e:
+    rewrite = match_js.split("\"rewritten_question\": ", 1)[1]
+print(f"Rewrite: {rewrite}\n")
+# Rewrite: Who is the CEO of Microsoft?
+```
+## Training Details
+The training data contains both: 1) standalone examples, which teach the adapter to refrain from rewriting user questions that are already standalone, and 2) non-standalone examples containing a diversity of patterns that are used to teach the adapter to expand the user turn so that it becomes standalone.
+### Training Data
+The training data uses the publicly available Cloud corpus of technical documentation pages from [MT-RAG](https://arxiv.org/abs/2501.03468). Based on this corpus of documents, we constructed a dataset consisting of high-quality, human-created conversations, where the last turn of the conversation comes into versions: non-standalone version, and corresponding standalone version. In addition, we have also used a synthetically generated set of training examples, to maximize the diversity across a variety of patterns.
+The training dataset is proprietary and was obtained in combination with a third-party company who contracted the human annotators.
+### Robustness to System Prompts
+In a typical Retrieval-Augmented Generation (RAG) setup, different researchers or practitioners may use various system prompts tailored to their specific use cases. To enhance the LoRA adapter's robustness against these variations, we generate three distinct versions of each training sample, each paired with a different system prompt. This expanded and diversified training dataset is then used to train the LoRA adapters, improving their ability to handle diverse prompt styles effectively.
+System Prompts Used:
+Version 1: <|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024. Today's Date: May 20, 2025. You are Granite, developed by IBM. You are a helpful AI assistant. <|end_of_text|>
+Version 2: <|start_of_role|>system<|end_of_role|> Knowledge Cutoff Date: April 2024. Today's Date: May 20, 2025. You are Granite, developed by IBM. Write the response to the user's input by strictly aligning with the facts in the provided documents. If the information needed to answer the question is not available in the documents, inform the user that the question cannot be answered based on the available data.<|end_of_text|>
+Version 3: An empty system prompt (no instructions provided).
+This approach ensures that our LoRA adapters remain effective and reliable across varying system prompt formats commonly encountered in real-world RAG applications.
+#### Training Hyperparameters
+The LoRA adapters were fine-tuned using PEFT under the following regime: rank = 32, learning rate = 3e-6, number of epochs = 25, and linear learning rate scheduler.
+## Evaluation
+We provide two types of evaluation: 1) evaluation that is specific for the task of query rewrite, 2) end-to-end evaluation in a RAG setting, where we evaluate how query rewrite impacts: a) the retriever performance, and b) the final answer generation.
+### Evaluation Specific for the Query Rewrite Task
+Here we evaluate the quality of the rewritten queries themselves on an enterprise internal dataset. This is a dataset with two turn conversations, where the last user turn may or may not be standalone. We do have the gold rewritten queries, and also make use of a LLM judge (with LLama-3.3-70b as the model) with a specific prompt to check whether the model-generate rewriting is equivalent to the gold rewriting. This is a challenging benchmark, with the specific requirement that the models minimally change the query with only the additions needed to make the query standalone (it not already standalone). If any changes are not minimal, the judge will penalize the rewritten query.
+<img src="image-1.png" alt="alt text" width="60%">
+In the chart above, towards the left, we show the performance of the LoRA adapters for three base models: gpt-oss-20b, granite-3.3-8b-instruct, and granite-3.3-2b-instruct. We show how these compare to the versions where we just prompt the same three base models, using just the query rewrite instruction. Towards the right, we show results when we prompt two frontier models (gpt-4o, and gpt-4o-mini) as well as the larger gpt-oss-120b, using just the query rewrite instruction. We also include, in the middle, the performance of an out-of-the-box granite base model when prompted with a custom prompt that includes 13 examples of query rewrites, in addition to the query rewrite instruction. We see that all of these are dominated by the fine-tuned LoRA adapters (even by the smaller 2b version). Furthermore, we see that the performance of the adapters is quite stable across the two families of models (gpt-oss and ibm-granite) and across the different model sizes.
+### Evaluation of Retriever
+We evaluate Recall@k on the [MT-RAG](https://arxiv.org/abs/2501.03468) benchmark, under various query rewrite strategies for the retriever. All retrieved passages are obtained using the Elser retriever with the same settings as in the above paper. In addition to the LoRA adapter, we include several other baselines, including no-rewrite (where we send the last user turn to the retriever as-is), Mixtral rewrites, as well as gold rewrites (human-created).
+We evaluate on three different testsets: a) full MT-RAG dataset (842 data points with last user turns); b) the non-standalone subset of MT-RAG dataset, which is a subset of 260 (out of 842) last user turns that were annotated by humans as non-standalone (i.e., they are dependent on the prior context); c) the standalone subset of MT-RAG dataset, which is the complementary subset, with all the last user turns that were annotated by humans as standalone.
+a. Evaluation of Recall@k on full MT-RAG dataset.
+| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
+| --------------------------- | --------------- | ----------------- | ------------ |
+| No rewrite                  |  0.486          | 0.587             |  0.665       |
+| Mixtral 8x7b rewrite        |  0.522          | 0.642             |  0.72        |
+| Gold rewrite                |  0.563          | 0.674             |  0.747       |
+| Granite 3.3-8b LoRA rewrite |  0.563          | 0.682             |  0.762       |
+b.  Evaluation of Recall@k on the non-standalone subset of MT-RAG.
+| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
+| --------------------------- | --------------- | ----------------- | ------------ |
+| No rewrite                  |  0.263          | 0.338             | 0.435      |
+| Mixtral 8x7b rewrite        |  0.362         | 0.488             | 0.574      |
+| Gold rewrite                |  0.479         | 0.582             | 0.662      |
+| Granite 3.3-8b LoRA rewrite |    0.445      |    0.556        |     0.648   |
+c.  Evaluation of Recall@k on the standalone subset of MT-RAG.
+| Strategy                    | Recall@5        | Recall@10         |  Recall@20   |
+| --------------------------- | --------------- | ----------------- | ------------ |
+| No rewrite                  |  0.609         | 0.723            | 0.792        |
+| Mixtral 8x7b rewrite        |  0.613         | 0.733             | 0.809        |
+| Gold rewrite                |  0.609         | 0.723            | 0.792        |
+| Granite 3.3-8b LoRA rewrite |   0.628     |    0.751       |   0.824    |
+If we focus on Recall@20 numbers, as one instance of the metric, there is an overall 9.7 percentage points jump when using query rewrite with the Granite 3.3-8b LoRA adapter versus when using the no rewrite strategy. This jump is more pronounced on the non-standalone fragment, where query rewrite with the Granite 3.3-8b LoRA adapter leads to almost 21 percentage points improvement over the no-rewrite strategy. Also, we can observe that the numbers with the LoRA rewrites are very close to what can be obtained with the gold rewrites on non-standalones (and slightly better on standalones for LoRA -- human annotators were instructed to leave the query unchanged when classifying it as standalone, however, the LoRA adapter may still perform some rewriting which turns out to further improve the recall).
+### Evaluation of Answer Generation
+We evaluate answer generation quality, with top-k passages retrieved under the various query rewrite strategies for the retriever. We choose here k = 20, but similar trends take place for other values of k. We used Granite-3.3-8b instruct as the answer generator, and [RAGAS](https://arxiv.org/abs/2309.15217) Faithfulness on the answerable subset of MT RAG data, [JAFS](https://arxiv.org/abs/2504.11704) that rewards the model for correctly abstaining on unanswerable queries (full credit) and for
+providing faithful answers on answerable queries (partial credit based on RAGAS Faithfulness), and [RAD-Bench](https://arxiv.org/abs/2409.12558) score as metrics for answer quality. We use the same three testsets as above.
+a. Evaluation of answer quality on full MT-RAG dataset.
+| Strategy                    | RAGAS-F   (Answerable Subset)       | RAD-Bench        |  JAFS            |
+| --------------------------- | ---------------- | ---------------- | ---------------- |
+| No rewrite                  |  0.793            |    0.678          |  0.664          |
+| Mixtral 8x7b rewrite        |  0.78            |    0.679         |   0.682          |
+| Gold rewrite                |  0.81            |   0.686         |  0.67          |
+| Granite 3.3-8b LoRA rewrite |  0.874           |     0.698         | 0.722        |
+b. Evaluation of answer quality on non-standalone MT-RAG subset.
+| Strategy                    | RAGAS-F   (Answerable Subset)        | RAD-Bench        |   JAFS            |
+| --------------------------- | ---------------- | ---------------- | ---------------- |
+| No rewrite                  |  0.695          |   0.618           |  0.581        |
+| Mixtral 8x7b rewrite        |  0.776          |   0.644           |  0.627      |
+| Gold rewrite                |  0.786          |   0.661           |  0.634   |
+| Granite 3.3-8b LoRA rewrite |  0.865          |     0.669         | 0.70     |
+c. Evaluation of answer quality on standalone subset of MT-RAG.
+| Strategy                    | RAGAS-F   (Answerable Subset)        | RAD-Bench        |   JAFS            |
+| --------------------------- | ---------------- | ---------------- | ---------------- |
+| No rewrite                  |  0.845          |   0.71           |  0.708        |
+| Mixtral 8x7b rewrite        |  0.854            |   0.697           |  0.71     |
+| Gold rewrite                  |  0.845          |   0.71           |  0.708        |
+| Granite 3.3-8b LoRA rewrite |   0.88               |   0.713               |     0.734    |
+As with Recall, similar observations can be made here as well. Specifically, on the full dataset, we see an 8.1 percentage points jump in RAGAS Faithfulness (from 0.793 to 0.874), a 2 percentage points jump in RAD-Bench score (from 0.678 to 0.698), and a 5.8 percentage points jump in JAFS (from 0.664 to 0.722) when using query rewrite with the Granite 3.3-8b LoRA adapter versus when using the no rewrite strategy. This improvement is more pronounced on the non-standalone subset, where query rewrite with the Granite 3.3-8b LoRA adapter leads to a 17 percentage points jump in RAGAS Faithfulness (from 0.695 to 0.865), a 5.1 percentage points jump in RAD-Bench score (from 0.618 to 0.669), and an 11.9 percentage points jump in JAFS (from 0.581 to 0.70).
+## Contact
+[Lucian Popa](mailto:[email protected])
+### Framework versions
+- PEFT 0.14.0