# Setup

## Config
Set the tokens based on the numbers in [03-poe-token-count-exploration.ipynb](03-poe-token-count-exploration.ipynb). I like to give a little buffer in-case an explanation goes over.

In [1]:
INPUT_TOKENS = 300
OUTPUT_TOKENS = 1650

INPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-tokenized'
OUTPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-results'
BASE_MODEL = 'tiiuae/Falcon3-7B-Instruct'

# Setup
Here we create the pydantic models for each of our experiments. Note because of how you specify field names in pydantic, we need to use an `alias` and `populate_by_name`. Given that our `Final Answer` is always a letter between a-h we can use an enumeration.

In [2]:
from pydantic import BaseModel, Field
from typing import List
from enum import Enum
import json


class FinalAnswerEnum(str, Enum):
    a = "a"
    b = "b"
    c = "c"
    d = "d"
    e = "e"
    f = "f"
    g = "g"
    h = "h"

class RFAModel(BaseModel):
    reasoning: str = Field(...)
    final_answer: FinalAnswerEnum = Field(...)

    class Config:
        populate_by_name = True
        
class FARModel(BaseModel):
    final_answer: FinalAnswerEnum = Field(...)
    reasoning: str = Field(...)

    class Config:
        populate_by_name = True
        
class FAModel(BaseModel):
    final_answer: FinalAnswerEnum = Field(...)

    class Config:
        populate_by_name = True

We generated lots of experiments in [derek-thomas/labeled-multiple-choice-explained-falcon-tokenized](https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-tokenized/viewer?row=0). Now we will aggregate everything we need in `experiments` for convenience.

In [3]:

experiments = {
    'RFA-falcon': {
        'pydantic': RFAModel,
        "lora": "derek-thomas/falcon-v03-poe-RFA-falcon",
        "column": 'user_prompt_RFA',
    },
    'FAR-falcon': {
        'pydantic': FARModel,
        "lora": "derek-thomas/falcon-v03-poe-FAR-falcon",
        "column": 'user_prompt_FAR',
    },
    'RFA-gpt3-5': {
        'pydantic': RFAModel,
        "lora": "derek-thomas/falcon-v03-poe-RFA-gpt3-5",
        "column": 'user_prompt_RFA',
    },
    'FAR-gpt3-5': {
        'pydantic': FARModel,
        "lora": "derek-thomas/falcon-v03-poe-FAR-gpt3-5",
        "column": 'user_prompt_FAR',
    },
    'FA': {
        'pydantic': FAModel,
        "lora": "derek-thomas/falcon-v03-poe-FA",
        "column": 'user_prompt_FA',
    },
    'base': {
        'pydantic': FAModel,
        "lora": None,
        "column": 'user_prompt_FA',
    },
}

LORAS_STRING = ','.join([v['lora'] for _, v in experiments.items() if v and v.get('lora') is not None])
LORAS_STRING

'derek-thomas/falcon-v03-poe-RFA-falcon,derek-thomas/falcon-v03-poe-FAR-falcon,derek-thomas/falcon-v03-poe-RFA-gpt3-5,derek-thomas/falcon-v03-poe-FAR-gpt3-5,derek-thomas/falcon-v03-poe-FA'

In [4]:
from huggingface_hub import login, get_token

# Log in to your Hugging Face account
login()  

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [5]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL, token=get_token())

In [6]:
from datasets import load_dataset
import numpy as np

# Load dataset (test split)
dataset = load_dataset(INPUT_DATASET, split='test')
df = dataset.to_pandas()

columns_to_convert = [
    'user_prompt_RFA',
    'conversation_RFA_gpt3_5',
    'conversation_RFA_falcon',
    'user_prompt_FAR',
    'conversation_FAR_gpt3_5',
    'conversation_FAR_falcon',
    'user_prompt_FA',
    'conversation_FA'
]

# Convert specified columns from arrays to lists
for col in columns_to_convert:
    df[col] = df[col].apply(lambda x: x.tolist() if isinstance(x, (list, np.ndarray)) else x)


In [7]:
def is_valid_entry(entry):
    # Check if entry is a list with at least two elements
    if not isinstance(entry, list) or len(entry) < 2:
        print('here')
        return False
    # Check el0
    el0 = entry[0]
    if not (isinstance(el0, dict) and el0.get('role') == 'system' and isinstance(el0.get('content'), str)):
        print('system')
        return False
    # Check el1
    el1 = entry[1]
    if not (isinstance(el1, dict) and el1.get('role') == 'user' and isinstance(el1.get('content'), str)):
        print('user')
        return False
    return True
 

def check_column_structure(column):
    return column.apply(is_valid_entry)

# Apply checks
for col in ['user_prompt_RFA', 'user_prompt_FA', 'user_prompt_FAR']:
    print(f'{col}', sum(check_column_structure(df[col])))

user_prompt_RFA 1683
user_prompt_FA 1683
user_prompt_FAR 1683


# Evaluation

## Endpoint Configuration
Im using 8 replicas so I can move quickly! The try/except is in-case I need to make manual changes and I want to load the endpoint.

In [8]:
from huggingface_hub import create_inference_endpoint
from huggingface_hub import get_inference_endpoint


def get_my_endpoint(name=None):
    if name is None:
        name = f"prompt-order-experiment"
    namespace='HF-test-lab'
    try:
        endpoint = get_inference_endpoint(name, namespace=namespace)
    except:
        # Custom Docker image details
        custom_image = {
            "health_route": "/health",
            "url": "ghcr.io/huggingface/text-generation-inference:sha-caff779",  # Needs to be >=2.4.2 to get ordering of json outputs
            "env": {
                "LORA_ADAPTERS": LORAS_STRING,
                "MAX_BATCH_PREFILL_TOKENS": str(20*INPUT_TOKENS),
                "MAX_INPUT_TOKENS": str(INPUT_TOKENS), 
                "MAX_TOTAL_TOKENS": str(INPUT_TOKENS + OUTPUT_TOKENS), 
                "DISABLE_CUSTOM_KERNELS": 'false',
                "MODEL_ID": "/repository"
            },
        }
        
        secrets = {
            "HF_TOKEN": get_token()
        }
        
        # Creating the inference endpoint
        endpoint = create_inference_endpoint(
            name=name,
            namespace=namespace,
            repository=BASE_MODEL,
            framework="pytorch",
            accelerator="gpu",
            instance_size="x1",
            instance_type="nvidia-l4",
            region="us-east-1",
            vendor="aws",
            min_replica=8,
            max_replica=8,
            task="text-generation",
            custom_image=custom_image,
            secrets=secrets
        )
            
    endpoint.wait()
    print("Your model is ready to use!")
    return endpoint

In [9]:
%%time
endpoint = get_my_endpoint('prompt-order-experiment')

Your model is ready to use!
CPU times: user 8.29 ms, sys: 747 μs, total: 9.03 ms
Wall time: 98.6 ms


## Manual Evaluation
Since we havent seen our models in use yet, its a good time to check them out!

### Reasoning Final Answer
In both falcon and gpt3-5 we should see the **Reasoning** first and then the **Final Answer** in the prompt and the responses.

In [10]:
key = 'RFA-falcon'
user_prompt_RFA = df.iloc[0][experiments[key]['column']]
user_prompt_RFA

[{'content': 'Answer the Question and include your reasoning and the final answer in a json like: {"reasoning": <reasoning about the answer>, "final_answer": <letter corresponding to the answer>}.',
  'role': 'system'},
 {'content': 'Question: What are busses used for?\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',
  'role': 'user'}]

In [11]:
response = endpoint.client.chat_completion(
    messages=user_prompt_RFA,
    max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,
    model=experiments[key]['lora'],
    response_format={"type": "json", "value": experiments[key]['pydantic'].schema()},
)
json.loads(response.choices[0].message.content)

{'reasoning': 'Busses are primarily designed to transport people from one location to another. They are a common mode of public transportation used by many for commuting, school, work, and other activities. None of the other choices directly relate to the main function of a bus.',
 'final_answer': 'b'}

In [12]:
key = 'RFA-gpt3-5'
response = endpoint.client.chat_completion(
    messages=user_prompt_RFA,
    max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,
    model=experiments[key]['lora'],
    response_format={"type": "json", "value": experiments[key]['pydantic'].schema()},
)
json.loads(response.choices[0].message.content)

{'reasoning': "Busses are large vehicles designed to transport people from one place to another. They operate on fixed routes and schedules, offering a convenient mode of public transportation for many individuals. The choice of 'Transporting humans' best encapsulates the primary function of busses, as they are not intended for carrying other items or species, nor are they part of an airplane's structure.",
 'final_answer': 'b'}

### Final Answer Reasoning 
In both falcon and gpt3-5 we should see the **Final Answer** first and then the **Reasoning** in the prompt and the responses.

In [13]:
key = 'FAR-gpt3-5'
user_prompt_FAR = df.iloc[0][experiments[key]['column']]
user_prompt_FAR

[{'content': 'Answer the Question and include your Final Answer and the Reasoning in a json like: {"final_answer": <letter corresponding to the answer>, "reasoning": <reasoning about the answer>}.',
  'role': 'system'},
 {'content': 'Question: What are busses used for?\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',
  'role': 'user'}]

In [14]:
response = endpoint.client.chat_completion(
    messages=user_prompt_FAR,
    max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,
    model=experiments[key]['lora'],
    response_format={"type": "json", "value": experiments[key]['pydantic'].schema()},
)
json.loads(response.choices[0].message.content)

{'final_answer': 'b',
 'reasoning': 'Buses are vehicles primarily used for transporting humans from one place to another. They provide a convenient and efficient way for people to travel together on public transit. The other options are not accurate representations of the main purpose of a bus. Protective shelter, help other species benefit, transport airplanes, backbone, communication, safe operation, and safe driving are not the primary functions of a bus.'}

In [15]:
key = 'FAR-falcon'
response = endpoint.client.chat_completion(
    messages=user_prompt_FAR,
    max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,
    model=experiments[key]['lora'],
    response_format={"type": "json", "value": experiments[key]['pydantic'].schema()},
)
json.loads(response.choices[0].message.content)

{'final_answer': 'b',
 'reasoning': 'Busses are primarily used for transporting humans from one place to another, making option (b) the most accurate choice among the given answers.'}

### Final Answer 
Here we should juse see the **Final Answer** and no **Reasoning**.

In [16]:
key = 'FA'
user_prompt_FA = df.iloc[0][experiments[key]['column']]
user_prompt_FA

[{'content': 'Answer the Question and include your Final Answer in a json like: {"final_answer": <letter corresponding to the answer>}.',
  'role': 'system'},
 {'content': 'Question: What are busses used for?\nAnswer Choices: (a) Protective shelter (b) Transporting humans (c) Help other species benefit (d) Transporting airplanes (e) A backbone (f) Communication (g) Safe operation (h) Safe driving',
  'role': 'user'}]

In [17]:
response = endpoint.client.chat_completion(
    messages=user_prompt_FA,
    max_tokens=INPUT_TOKENS + OUTPUT_TOKENS,
    model=experiments[key]['lora'],
    response_format={"type": "json", "value": experiments[key]['pydantic'].schema()},
)
json.loads(response.choices[0].message.content)


{'final_answer': 'b'}

## Evaluation Loop
I used 20x the prefill than the input and 8 replicas so I should capacity for ~160 parallel requests. Im only using 128 but it should be pretty fast.

In [18]:
import nest_asyncio
import asyncio
from transformers import AutoTokenizer
from tqdm.auto import tqdm
import time

# Allow nested event loops in Jupyter
nest_asyncio.apply()

# Semaphore to limit concurrency
CONCURRENCY_LIMIT = 50 
MAX_NEW_TOKENS = INPUT_TOKENS + OUTPUT_TOKENS
semaphore = asyncio.Semaphore(CONCURRENCY_LIMIT)

# Progress bar
progress_bar = None  # Global to allow updates from within async functions

# Retry parameters
MAX_RETRIES = 3
BACKOFF_TIME = 2  # Time in seconds before retrying

# Function to send asynchronous requests to the endpoint with retries
async def fetch_response_async(async_client, prompt, lora_id, pydantic_model):
    retries = 0
    while retries < MAX_RETRIES:
        try:
            async with semaphore:  # Limit the number of concurrent requests
                response = await async_client.chat_completion(
                    messages=prompt,
                    max_tokens=MAX_NEW_TOKENS,
                    model=lora_id if lora_id else None,
                    response_format={"type": "json", "value": pydantic_model.schema()}
                )
                progress_bar.update(1)  # Update the progress bar when the request is complete
                return response.choices[0].message.content
        except Exception as e:
            retries += 1
            if retries >= MAX_RETRIES:
                raise e  # If we've exhausted retries, re-raise the error
            else:
                print(f"Error: {e}. Retrying... ({retries}/{MAX_RETRIES})")
                await asyncio.sleep(BACKOFF_TIME)  # Wait before retrying

# Function to process a single conversation type asynchronously
async def process_conversation_type(conversation_type, model_info, df, tokenizer, async_client):
    response_column = f"responses_{conversation_type.replace('-','_')}"
    responses = []  # Temporary list to hold responses for the current conversation type

    tasks = []
    for _, item in df.iterrows():
        prompt = item.get(model_info["column"])
        tasks.append(fetch_response_async(async_client, prompt, model_info["lora"], model_info["pydantic"]))

    # Wait for all tasks to complete
    responses = await asyncio.gather(*tasks)

    # If responses are strings, use them directly; otherwise, extract 'generated_text'
    try:
        df[response_column] = [resp["generated_text"] for resp in responses]
    except TypeError:  # Fallback in case responses are raw strings
        df[response_column] = responses

# Main function to handle all conversation types
async def main(df, models, tokenizer, async_client):
    global progress_bar
    total_requests = len(df) * len(models)  # Total number of requests across all conversation types
    progress_bar = tqdm(total=total_requests, desc="Processing requests")

    tasks = []
    for conversation_type, model_info in models.items():
        tasks.append(process_conversation_type(conversation_type, model_info, df, tokenizer, async_client))
    await asyncio.gather(*tasks)

    progress_bar.close()  # Close the progress bar when done

# Define parameters and run
await main(df, experiments, tokenizer, endpoint.async_client)

Processing requests:   0%|          | 0/10098 [00:00<?, ?it/s]

Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endpoints.huggingface.cloud/v1/chat/completions'. Retrying... (1/3)
Error: 424, message='Failed Dependency', url='https://f3b2osxktsilpdce.us-east4.gcp.endp

It took `00:17:02`. Not bad! That should be around `$1.14` total at `$80/gpu/hr`.

In [28]:
endpoint.pause()

InferenceEndpoint(name='prompt-order-experiment', namespace='HF-test-lab', repository='tiiuae/Falcon3-7B-Instruct', status='paused', url=None)

In [19]:
df_backup = df.copy()

In [20]:
df

Unnamed: 0,topic,question_text,answer_key,gpt3_5_reasoning,falcon_reasoning,answer_choices,user_prompt_RFA,conversation_RFA_gpt3_5,conversation_RFA_falcon,user_prompt_FAR,conversation_FAR_gpt3_5,conversation_FAR_falcon,user_prompt_FA,conversation_FA,responses_RFA_falcon,responses_FAR_falcon,responses_RFA_gpt3_5,responses_base,responses_FA,responses_FAR_gpt3_5
0,Transportation,What are busses used for?,b,a) Protective shelter: This option is incorrec...,(a) Protective shelter - \nErroneous. Busses a...,(a) Protective shelter (b) Transporting humans...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Busses are primarily used for t...","{""final_answer"": ""b"", ""reasoning"": ""Busses are...","{""reasoning"": ""Busses are vehicles used primar...","{""final_answer"": ""b""}","{""final_answer"": ""b""}","{""final_answer"": ""b"", ""reasoning"": ""Busses are..."
1,Climate change,Which of the following does not contribute to ...,g,a) Nucleus of a cell: This option is not relat...,(a) Nucleus of a cell: This option is incorrec...,(a) Nucleus of a cell (b) Flying in a plane (c...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Nucleus of a cell (a) does not ...","{""final_answer"": ""a"", ""reasoning"": ""The nucleu...","{""reasoning"": ""The question asks which of the ...","{""final_answer"": ""g""}","{""final_answer"": ""g""}","{""final_answer"": ""d"", ""reasoning"": ""The questi..."
2,Photography,What uses electrical energy converted from che...,b,a) Sunlight: Sunlight is a form of energy that...,(a) Sunlight: Sunlight is a form of energy tha...,(a) Sunlight (b) Cameras (c) Cells (d) Buses (...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Cells convert chemical energy s...","{""final_answer"": ""c"", ""reasoning"": ""Cells, spe...","{""reasoning"": ""Cells use electrical energy for...","{""final_answer"": ""c""}","{""final_answer"": ""f""}","{""final_answer"": ""f"", ""reasoning"": ""Cars use e..."
3,Microbiology,Bacteria causes what to be harmed?,a,"Now, let's go through each option and explain ...",1. **Plants (a) - Correct Answer:**\n - Bact...,(a) Plants (b) Electronics (c) Fossils (d) Hum...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Bacteria can cause harm to vari...","{""final_answer"": ""d"", ""reasoning"": ""Bacteria c...","{""reasoning"": ""Bacteria can harm various livin...","{""final_answer"": ""d""}","{""final_answer"": ""d""}","{""final_answer"": ""d"", ""reasoning"": ""The questi..."
4,Biology,Plants and snakes live _.?,a,b) Important habitats: This option is incorrec...,**Answer: (a) Almost everywhere**\n\n**Explana...,(a) Almost everywhere (b) Important habitats (...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Plants and snakes, as different...","{""final_answer"": ""g"", ""reasoning"": ""The correc...","{""reasoning"": ""Plants and snakes are both comm...","{""final_answer"": ""g""}","{""final_answer"": ""a""}","{""final_answer"": ""f"", ""reasoning"": ""Plants and..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,Biology,New resources required for creation can be red...,g,a) Mining: Mining involves extracting minerals...,(a) Mining: Mining is the process of extractin...,(a) Mining (b) Mutations (c) Fossil fuels (d) ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Recycling (g) is the option tha...","{""final_answer"": ""g"", ""reasoning"": ""Recycling ...","{""reasoning"": ""New resources required for crea...","{""final_answer"": ""g""}","{""final_answer"": ""g""}","{""final_answer"": ""g"", ""reasoning"": ""The correc..."
1679,Biology,A drought dehydrates an entire what?,d,a) Body water: This option is incorrect becaus...,The correct answer is (d) Environment. \n\nNow...,(a) Body water (b) Dried fruit (c) Bodily wate...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""A drought is a period of abnorm...","{""final_answer"": ""d"", ""reasoning"": ""A drought ...","{""reasoning"": ""Drought is a long-term lack of ...","{""final_answer"": ""d""}","{""final_answer"": ""d""}","{""final_answer"": ""d"", ""reasoning"": ""A drought ..."
1680,Biology,An animal requires ingestion to do what?,e,a) Aerobic capacity: This option is not logica...,(a) Aerobic capacity: This refers to an animal...,(a) Aerobic capacity (b) Die (c) Water conserv...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Ingestion is the process of tak...","{""final_answer"": ""e"", ""reasoning"": ""Ingestion ...","{""reasoning"": ""Ingestion is the process by whi...","{""final_answer"": ""e""}","{""final_answer"": ""c""}","{""final_answer"": ""e"", ""reasoning"": ""Animals re..."
1681,Biology,Ultraviolet light can cause what?,b,a) Ultraviolet light does not cause heat energ...,Let's examine each option and determine why so...,(a) Heat energy (b) Skin cancer (c) Killing in...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,"{""reasoning"": ""Ultraviolet light is known to h...","{""final_answer"": ""b"", ""reasoning"": ""Ultraviole...","{""reasoning"": ""Ultraviolet (UV) light is a typ...","{""final_answer"": ""b""}","{""final_answer"": ""b""}","{""final_answer"": ""b"", ""reasoning"": ""Ultraviole..."


In [21]:
def extract_final_answer(response):
    try:
        answer = json.loads(response).get("final_answer")
    except:
        answer = 'x'
    return answer

# Create new columns for predictions
df['predictions_base'] = df['responses_base'].apply(extract_final_answer)
df['predictions_FA'] = df['responses_FA'].apply(extract_final_answer)
df['predictions_RFA_falcon'] = df['responses_RFA_falcon'].apply(extract_final_answer)
df['predictions_FAR_falcon'] = df['responses_FAR_falcon'].apply(extract_final_answer)
df['predictions_RFA_gpt3_5'] = df['responses_RFA_gpt3_5'].apply(extract_final_answer)
df['predictions_FAR_gpt3_5'] = df['responses_FAR_gpt3_5'].apply(extract_final_answer)


In [22]:
prediction_cols = ['predictions_base',
 'predictions_FA',
 'predictions_RFA_falcon',
 'predictions_FAR_falcon',
 'predictions_RFA_gpt3_5',
 'predictions_FAR_gpt3_5']
percentages = {
    col: (df[col] == 'x').mean() * 100
    for col in prediction_cols
}
percentages

{'predictions_base': 0.0,
 'predictions_FA': 0.0,
 'predictions_RFA_falcon': 0.0,
 'predictions_FAR_falcon': 0.0,
 'predictions_RFA_gpt3_5': 0.0,
 'predictions_FAR_gpt3_5': 0.0}

In [25]:
from sklearn.metrics import accuracy_score

print(f"Base: \t\t\t\t\t\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_base']) * 100, 2)}%")
print(f"Final Answer: \t\t\t\t\t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FA']) * 100, 2)}%")
print(f"Reasoning and then the Final Answer (Falcon): \t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_RFA_falcon']) * 100, 2)}%")
print(f"Final Answer and then the Reasoning (Falcon): \t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FAR_falcon']) * 100, 2)}%")
print(f"Reasoning and then the Final Answer (GPT-3.5): \t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_RFA_gpt3_5']) * 100, 2)}%")
print(f"Final Answer and then the Reasoning (GPT-3.5): \t{round(accuracy_score(y_true=df['answer_key'], y_pred=df['predictions_FAR_gpt3_5']) * 100, 2)}%")

Base: 						55.02%
Final Answer: 					53.71%
Reasoning and then the Final Answer (Falcon): 	56.98%
Final Answer and then the Reasoning (Falcon): 	54.37%
Reasoning and then the Final Answer (GPT-3.5): 	57.52%
Final Answer and then the Reasoning (GPT-3.5): 	56.21%


In [26]:
df

Unnamed: 0,topic,question_text,answer_key,gpt3_5_reasoning,falcon_reasoning,answer_choices,user_prompt_RFA,conversation_RFA_gpt3_5,conversation_RFA_falcon,user_prompt_FAR,...,responses_RFA_gpt3_5,responses_base,responses_FA,responses_FAR_gpt3_5,predictions_base,predictions_FA,predictions_RFA_falcon,predictions_FAR_falcon,predictions_RFA_gpt3_5,predictions_FAR_gpt3_5
0,Transportation,What are busses used for?,b,a) Protective shelter: This option is incorrec...,(a) Protective shelter - \nErroneous. Busses a...,(a) Protective shelter (b) Transporting humans...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Busses are vehicles used primar...","{""final_answer"": ""b""}","{""final_answer"": ""b""}","{""final_answer"": ""b"", ""reasoning"": ""Busses are...",b,b,b,b,b,b
1,Climate change,Which of the following does not contribute to ...,g,a) Nucleus of a cell: This option is not relat...,(a) Nucleus of a cell: This option is incorrec...,(a) Nucleus of a cell (b) Flying in a plane (c...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""The question asks which of the ...","{""final_answer"": ""g""}","{""final_answer"": ""g""}","{""final_answer"": ""d"", ""reasoning"": ""The questi...",g,g,a,a,g,d
2,Photography,What uses electrical energy converted from che...,b,a) Sunlight: Sunlight is a form of energy that...,(a) Sunlight: Sunlight is a form of energy tha...,(a) Sunlight (b) Cameras (c) Cells (d) Buses (...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Cells use electrical energy for...","{""final_answer"": ""c""}","{""final_answer"": ""f""}","{""final_answer"": ""f"", ""reasoning"": ""Cars use e...",c,f,c,c,c,f
3,Microbiology,Bacteria causes what to be harmed?,a,"Now, let's go through each option and explain ...",1. **Plants (a) - Correct Answer:**\n - Bact...,(a) Plants (b) Electronics (c) Fossils (d) Hum...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Bacteria can harm various livin...","{""final_answer"": ""d""}","{""final_answer"": ""d""}","{""final_answer"": ""d"", ""reasoning"": ""The questi...",d,d,d,d,d,d
4,Biology,Plants and snakes live _.?,a,b) Important habitats: This option is incorrec...,**Answer: (a) Almost everywhere**\n\n**Explana...,(a) Almost everywhere (b) Important habitats (...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Plants and snakes are both comm...","{""final_answer"": ""g""}","{""final_answer"": ""a""}","{""final_answer"": ""f"", ""reasoning"": ""Plants and...",g,a,a,g,a,f
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1678,Biology,New resources required for creation can be red...,g,a) Mining: Mining involves extracting minerals...,(a) Mining: Mining is the process of extractin...,(a) Mining (b) Mutations (c) Fossil fuels (d) ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""New resources required for crea...","{""final_answer"": ""g""}","{""final_answer"": ""g""}","{""final_answer"": ""g"", ""reasoning"": ""The correc...",g,g,g,g,g,g
1679,Biology,A drought dehydrates an entire what?,d,a) Body water: This option is incorrect becaus...,The correct answer is (d) Environment. \n\nNow...,(a) Body water (b) Dried fruit (c) Bodily wate...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Drought is a long-term lack of ...","{""final_answer"": ""d""}","{""final_answer"": ""d""}","{""final_answer"": ""d"", ""reasoning"": ""A drought ...",d,d,d,d,d,d
1680,Biology,An animal requires ingestion to do what?,e,a) Aerobic capacity: This option is not logica...,(a) Aerobic capacity: This refers to an animal...,(a) Aerobic capacity (b) Die (c) Water conserv...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Ingestion is the process by whi...","{""final_answer"": ""e""}","{""final_answer"": ""c""}","{""final_answer"": ""e"", ""reasoning"": ""Animals re...",e,c,e,e,e,e
1681,Biology,Ultraviolet light can cause what?,b,a) Ultraviolet light does not cause heat energ...,Let's examine each option and determine why so...,(a) Heat energy (b) Skin cancer (c) Killing in...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,[{'content': 'Answer the Question and include ...,...,"{""reasoning"": ""Ultraviolet (UV) light is a typ...","{""final_answer"": ""b""}","{""final_answer"": ""b""}","{""final_answer"": ""b"", ""reasoning"": ""Ultraviole...",b,b,g,b,f,b


In [27]:
from datasets import Dataset, DatasetDict

# Create dataset from df
df.reset_index(drop=True, inplace=True)
dataset = Dataset.from_pandas(df)

# Push dataset to the hub
dataset.push_to_hub(OUTPUT_DATASET)

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/2 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-results/commit/c457e5db8fd2474a04c1338e5f03a0410e1dd10b', commit_message='Upload dataset', commit_description='', oid='c457e5db8fd2474a04c1338e5f03a0410e1dd10b', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/derek-thomas/labeled-multiple-choice-explained-falcon-results', endpoint='https://huggingface.co', repo_type='dataset', repo_id='derek-thomas/labeled-multiple-choice-explained-falcon-results'), pr_revision=None, pr_num=None)