AlberBshara
/

scholara_matching

@@ -8,15 +8,105 @@ tags:
 - unsloth
 - llama
 - trl
-base_model: unsloth/llama-3-8b-bnb-4bit
 ---
-# Uploaded  model
 - **Developed by:** AlberBshara
 - **License:** apache-2.0
-- **Finetuned from model :** unsloth/llama-3-8b-bnb-4bit
-This llama model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 - unsloth
 - llama
 - trl
+base_model: llama-3-8b.
 ---
+# Uploaded Model
 - **Developed by:** AlberBshara
 - **License:** apache-2.0
+- **Finetuned from model:** llama-3-8b-bnb-4bit
+This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.
+Here I fine-tuned Llama3_8B to perform the matching task in my Scholara Virtual assistant. It matches the given student information with the provided scholarships list (which comes from my Vector DB and my AI Web agent), and then shows the student the most suitable scholarships based on their information and desires.
+## Example Usage
+The following example demonstrates how to use the model. It requires at least 1xL4 GPU to make the inference.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
+from typing import Tuple
+import torch
+class ScholaraMatcher:
+    def __init__(self, load_in_4bit: bool = True,
+                 load_cpu_mem_usage: bool = True,
+                 hf_model_path: str = "AlberBshara/scholara_matching",
+                 k: int = 2):
+        """
+        Args:
+            load_in_4bit (bool): Use 4-bit quantization. Defaults to True.
+            load_cpu_mem_usage (bool): Reduce CPU memory usage. Defaults to True.
+            hf_model_path (str): The path of your model on HuggingFace-Hub like "your-user-name/model-name".
+            k (int): The number of matched scholarships. Preferably [2 <= k <= 4].
+        """
+        assert torch.cuda.is_available(), "CUDA is not available. An NVIDIA GPU is required."
+        assert any("L4" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())), \
+            "An NVIDIA L4 GPU is required to initialize this class."
+        # Specify the quantization config
+        self._bnb_config = BitsAndBytesConfig(load_in_4bit=load_in_4bit)
+        # Load model directly with quantization config
+        self._model = AutoModelForCausalLM.from_pretrained(
+            hf_model_path,
+            low_cpu_mem_usage=load_cpu_mem_usage,
+            quantization_config=self._bnb_config,
+        )
+        # Load the tokenizer
+        self._tokenizer = AutoTokenizer.from_pretrained(hf_model_path)
+        self._hf_model_path = hf_model_path
+        self._instruction = f"Based on the student details, select the best {k} scholarships for them only from the following given scholarships"
+        self._EOS_TOKEN_ID = self._tokenizer.eos_token_id
+        self._alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+        ### Instruction:
+        {}
+        ### Input:
+        {}
+        ### Response:
+        {}
+        """
+    def invoke(self, student_info: str, scholarships: str) -> Tuple:
+        if not student_info.strip():
+            raise ValueError("student_info cannot be empty or None")
+        if not scholarships.strip():
+            raise ValueError("scholarships cannot be empty or None")
+        inputs = f"student details: \n [{student_info}]. \n scholarships list: \n {scholarships}"
+        inputs = self._tokenizer(
+            [
+                self._alpaca_prompt.format(
+                    self._instruction,  # instruction
+                    inputs,  # input
+                    "",  # output - leave this blank for generation.
+                )
+            ], return_tensors="pt"
+        ).to("cuda")
+        input_ids = inputs['input_ids']
+        attention_mask = inputs['attention_mask']
+        output_ids = self._model.generate(input_ids, pad_token_id=self._EOS_TOKEN_ID)
+        output_text = self._tokenizer.decode(output_ids[0], skip_special_tokens=True)
+        return output_text, output_ids, attention_mask, input_ids
+    def extract_answer(self, output: torch.Tensor) -> str:
+        """
+        Returns the required answer after getting rid of the instruction and inputs.
+        """
+        decoded_outputs = self._tokenizer.batch_decode(output)
+        response_text = decoded_outputs[0].split("### Response:")[1].strip()
+        return response_text