File size: 2,733 Bytes
21d9130
 
 
 
 
 
 
 
 
 
3d08f1a
 
 
4569cb2
 
 
 
3d08f1a
 
 
 
 
 
 
 
4569cb2
3d08f1a
4569cb2
3d08f1a
 
4569cb2
 
3d08f1a
cf71611
3d08f1a
cf71611
3d08f1a
4569cb2
3d08f1a
 
 
4569cb2
3d08f1a
4569cb2
3d08f1a
4569cb2
3d08f1a
 
4569cb2
 
03508cf
4569cb2
3d08f1a
4569cb2
3d08f1a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
language:
- en
tags:
- testing
- llm
- rp
- discussion
---

# Why? What? TL;DR?

Simply put, I'm making my methodology to evaluate RP models public. While none of this is very scientific, it is consistent. I'm focusing on things I'm *personally* looking for in a model, like its ability to obey a character card and a system prompt accurately. Still, I think most of my tests are universal enough that other people might be interested in the results, or might want to run those tests on their own.


# Testing Environment

- All models are loaded in Q8_0 (GGUF) with all layers on the GPU (NVidia RTX3060 12GB)
- Backend is the latest version of KoboldCPP for Windows using CUDA 12.
- Using **CuBLAS** but **not using QuantMatMul (mmq)**.
- All models are extended to **16K context length** (auto rope from KCPP) with **Flash Attention** and **ContextShift** enabled.
- Frontend is staging version of Silly Tavern.
- Response size set to 1024 tokens max. 
- Fixed Seed for all tests: **123**


# System Prompt and Instruct Format

- The exact system prompt and instruct format files can be found in the [file repository](https://huggingface.co/SerialKicked/ModelTestingBed/).
- All models are tested in whichever instruct format they are supposed to be comfortable with (as long as it's ChatML or L3 Instruct)


# Available Tests

### DoggoEval

The goal of this test featuring Rex (a dog), and his master (EsKa) is to determine if a model is good at obeying a system prompt and character card. The trick being that dogs can't talk, but LLM love to. 

- [Results and discussions are hosted in this thread](https://huggingface.co/LWDCLS/LLM-Discussions/discussions/13)
- [Files, cards and settings can be found here](https://huggingface.co/SerialKicked/ModelTestingBed/tree/main/DoggoEval)
- TODO: Charts and screenshots

### MinotaurEval

TODO: The goal of this test is to check if a model is able of following a very specific prompting method and maintain situational awareness in the smallest labyrinth in the world.

- Discussions will be hosted here.
- Files and cards will be available soon (tm).


# Limitations 

I'm testing for things I'm interested in. Do not ask for ERP-specific tests. I do not pretend any of this is very scientific or accurate: as much as I try to reduce the amount of variables, a small LLM is still a small LLM at the end of the day. The results for other seeds, or with the smallest of change, are bound to give very different results. 

I usually give the different models I'm testing a fair shake in a more casual settings. I regen tons of outputs with random seeds, and while there are (large) variations, it tends to even out to the results shown in testing. Otherwise I'll make a note of it.