File size: 4,349 Bytes
679cf15
 
 
 
 
 
 
 
 
 
2f0b80c
679cf15
 
2f0b80c
679cf15
 
 
2f0b80c
679cf15
2f0b80c
679cf15
2f0b80c
 
6751c4f
 
2f0b80c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
language:
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
base_model: llama-3-8b.
---

# Uploaded Model

- **Developed by:** AlberBshara
- **License:** apache-2.0
- **Finetuned from model:** llama-3-8b-bnb-4bit

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

Here I fine-tuned Llama3_8B to perform the matching task in my Scholara Virtual assistant. It matches the given student information with the provided scholarships list (which comes from my Vector DB and my AI Web agent), and then shows the student the most suitable scholarships based on their information and desires.


- context window is 4k
## Example Usage

The following example demonstrates how to use the model. It requires at least 1xL4 GPU to make the inference.


```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from typing import Tuple
import torch

class ScholaraMatcher:
    def __init__(self, load_in_4bit: bool = True, 
                 load_cpu_mem_usage: bool = True,
                 hf_model_path: str = "AlberBshara/scholara_matching",
                 k: int = 2):
        """
        Args:
            load_in_4bit (bool): Use 4-bit quantization. Defaults to True.
            load_cpu_mem_usage (bool): Reduce CPU memory usage. Defaults to True.
            hf_model_path (str): The path of your model on HuggingFace-Hub like "your-user-name/model-name".
            k (int): The number of matched scholarships. Preferably [2 <= k <= 4].
        """
        assert torch.cuda.is_available(), "CUDA is not available. An NVIDIA GPU is required."
        assert any("L4" in torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())), \
            "An NVIDIA L4 GPU is required to initialize this class."
        
        # Specify the quantization config
        self._bnb_config = BitsAndBytesConfig(load_in_4bit=load_in_4bit)
        
        # Load model directly with quantization config 
        self._model = AutoModelForCausalLM.from_pretrained(
            hf_model_path,
            low_cpu_mem_usage=load_cpu_mem_usage,  
            quantization_config=self._bnb_config,  
        )
        
        # Load the tokenizer
        self._tokenizer = AutoTokenizer.from_pretrained(hf_model_path)
        self._hf_model_path = hf_model_path
        self._instruction = f"Based on the student details, select the best {k} scholarships for them only from the following given scholarships"
        self._EOS_TOKEN_ID = self._tokenizer.eos_token_id

        self._alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
        
        ### Instruction:
        {}
        
        ### Input:
        {}
        
        ### Response:
        {}
        """

    def invoke(self, student_info: str, scholarships: str) -> Tuple:
        if not student_info.strip():
            raise ValueError("student_info cannot be empty or None")
        
        if not scholarships.strip():
            raise ValueError("scholarships cannot be empty or None")
        
        inputs = f"student details: \n [{student_info}]. \n scholarships list: \n {scholarships}"
        inputs = self._tokenizer(
            [
                self._alpaca_prompt.format(
                    self._instruction,  # instruction
                    inputs,  # input
                    "",  # output - leave this blank for generation.
                )
            ], return_tensors="pt"
        ).to("cuda")
            
        input_ids = inputs['input_ids']
        attention_mask = inputs['attention_mask']
        
        output_ids = self._model.generate(input_ids, pad_token_id=self._EOS_TOKEN_ID)
        
        output_text = self._tokenizer.decode(output_ids[0], skip_special_tokens=True)
        
        return output_text, output_ids, attention_mask, input_ids

    def extract_answer(self, output: torch.Tensor) -> str:
        """
        Returns the required answer after getting rid of the instruction and inputs. 
        """
        decoded_outputs = self._tokenizer.batch_decode(output)
        response_text = decoded_outputs[0].split("### Response:")[1].strip()
        
        return response_text