File size: 9,422 Bytes
5cd4b71
 
 
d5f3373
5cd4b71
3b9d639
 
 
 
0d555ec
3b9d639
 
 
0d555ec
3b9d639
0d555ec
 
 
 
 
56be7f8
 
 
 
 
 
 
 
 
 
 
 
4ce822e
1f9e699
 
 
 
 
 
 
 
fa74591
3b9d639
2737440
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3b9d639
6645da9
 
 
 
 
 
 
 
 
 
 
 
e9bdd8a
3b9d639
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3074107
 
57176cb
 
072b546
3b9d639
0496479
236166b
57176cb
 
 
 
072b546
 
57176cb
1f9e699
236166b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
031be54
236166b
e120728
236166b
 
 
48591d5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
pipeline_tag: summarization
---
<div align="center">
  <b style="font-size: 40px;">SummLlama3-8B</b>
</div>

Are you looking for a summarizer that can generate more **human-preferred summaries** across multiple domains?

Our **SummLlama3-8B** could be exactly what you need!

SummLlama3 is initialized from Llama3-8B-Instruct, with additional training using Direct Preference Optimization (DPO) based on large-scale (over 100K) summarization feedback. 

The feedback encompasses a wide range of input documents, from short to lengthy texts, including both dialogue and non-dialogue formats, and spans across seven distinct domains:

- Four non-dialouge domains: News, Lifestyle, Report, Medical
- Three dialogue domains: Daily Life, Interview, Meeting

Surprisingly, it outperforms the nearly 10x larger **Llama3-70B-Instruct** and also **GPT-4o** while offering much faster inference speed.

This is automated evaluation results:

| **Config.**        | **Faithfulness** | **Completeness** | **Conciseness** | **Average** |
|--------------------|------------|-----------|-----------|----------|
| Llama3-8B-Instruct          | 0.864      | 0.583     | 0.450     | 0.632    |
| Llama3-70B-Instruct        | 0.931      | 0.596     | 0.487     | 0.671    |
| GPT-4o        | 0.940      | 0.657     | 0.437     | 0.678    |
| SummLlama3-8B  | 0.931  | 0.614 | 0.659 | 0.735 |
| SummLlama3-70B  | 0.950  | 0.632 | 0.754 | 0.779 |


This is human evaluation results:

| **Config.**        | **Faithfulness** | **Completeness** | **Conciseness** | **Average** |
|--------------------|------------|-----------|-----------|----------|
| Llama3-8B-Instruct          | 0.902      | 0.636     | 0.784     | 0.774    |
| Llama3-70B-Instruct         | 0.953      | 0.659     | 0.792     | 0.801    |
| SummLlama3-8B  | 0.980  | 0.697 | 0.959 | 0.879 |

Please refer to [our paper](https://arxiv.org/abs/2410.13116) to catch up how to exploit LLM-generated feedback in the context of text summarization.

Here are other versions:

**SummLlama3-70B**, 

https://huggingface.co/DISLab/SummLlama3-70B

**SummLlama3.1-Series**

https://huggingface.co/DISLab/SummLlama3.1-8B

https://huggingface.co/DISLab/SummLlama3.1-70B

**SummLlama3.2-Series**

https://huggingface.co/DISLab/SummLlama3.2-3B


### *Recommended Prompt for Text Summarization:*

We recommend to use the prompt below to get the summary, since we trained the model using this.

```
def format_chat_template(document):
    instruction = "Please summarize the input documnet."
    row_json = [{"role": "user", "content": f"Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{document}\n\n### Response:\n"}]
    return tokenizer.apply_chat_template(row_json, tokenize=False)
```


---

Here is a brief overview of our summarizer:

Rather than relying on expensive human feedback, we utilize high-quality, multi-dimensional, and fine-grained feedback generated by large language models (LLMs).

This model excels at **faithfulness**, **completeness**, and **conciseness**, which are the three human-preferred aspects to judge what is a good summarizer.

- Faithfulness: a summarizer does not manipulate the information in the input text and add any information not directly inferable from the input text.
- Completeness: a summarizer ensures the inclusion of all key information from the input text in the output summary.
- Conciseness: a summarizer refrains from incorporating information outside the key information in the output, maintaining a succinct and focused summary.

Based on our comprehensive evaluation, which included both human and automated assessments of summary quality, SummLlama3 demonstrated significant improvements over the original Llama3 series.

Here is the results:

## Human Evaluation

| **Config.**        | **Faithfulness** | **Completeness** | **Conciseness** | **Average** |
|--------------------|------------|-----------|-----------|----------|
| Llama3-8B-Instruct          | 0.902      | 0.636     | 0.784     | 0.774    |
| Llama3-70B-Instruct         | 0.953      | 0.659     | 0.792     | 0.801    |
| SummLlama3-8B  | 0.980  | 0.697 | 0.959 | 0.879 |

## Autoamted Evaluation using [FineSurE](https://aclanthology.org/2024.acl-long.51.pdf)

| **Config.**        | **Faithfulness** | **Completeness** | **Conciseness** | **Average** |
|--------------------|------------|-----------|-----------|----------|
| Llama3-8B-Instruct          | 0.864      | 0.583     | 0.450     | 0.632    |
| Llama3-70B-Instruct        | 0.931      | 0.596     | 0.487     | 0.671    |
| SummLlama3-8B  | 0.931  | 0.614 | 0.659 | 0.735 |
| SummLlama3-70B  | 0.950  | 0.632 | 0.754 | 0.779 |

## Example

See an example how the summary improved by SummLlama3-8B over Llama3-8/70B-Instruct on the document below:

| **Speaker** | **Dialogue** |
|-------------|--------------|
| **Person 1** | Hey, Paul, you're still having Thanksgiving dinner at my house on Thursday, right? |
| **Person 2** | Yeah, thanks again for the invitation. I was worried I'd have to spend it alone after my parents announced they'd be in Finland. |
| **Person 1** | I think it's strange that you're not going with them. Is it because of the cold weather in Northern Europe? |
| **Person 2** | Not exactly. I have a lot of work to do next week. |
| **Person 1** | Well, I'm glad you won't be alone on a holiday. It'll be great to have you celebrate with us. The more the merrier! |
| **Person 2** | Exactly, how many people will be there? |
| **Person 1** | Let's see, my parents, my grandpa, my sister, my brother-in-law, my niece and my two cousins, plus us. So, ten people altogether. |
| **Person 2** | Should I bring anything? |
| **Person 1** | Nope, we've got it covered. |
| **Person 2** | Come on, I hate to show up empty-handed. You are being so generous to have me. I know! I'll bring pie. I'll make a great pumpkin pie. |
| **Person 1** | Sounds delicious, but my sister has already volunteered for dessert. Look, don't worry about it. There will be tons of food. Turkey, stuffing, green beans, sweet potatoes. |
| **Person 2** | Wow, I'm getting hungry just thinking about it. How about this, I'll bring a nice bottle of wine. You can never have too much of that. |
| **Person 1** | Well, isn't that the truth? Sounds good to me. |


| Model | Summary |
|----------|-----------|
| Llama3-8B-Instruct        | Person1 invites Person2 to join them for Thanksgiving dinner at their house. Person2 accepts the invitation, citing that they were worried they would have to spend the holiday alone after their parents announced they would be in Finland. Person1 reassures Person2 that they are welcome and that the more the merrier. Person2 asks how many people will be attending and is told that there will be 10 people, including their family and cousins. Person2 offers to bring a gift, but Person1 declines, saying that they have everything covered. Person2 persists and offers to bring a bottle of wine or a pie, but Person1 suggests that their sister has already volunteered to bring dessert.    |
| Llama3-70B-Instruct        | Person 1 invites Person 2 to their house for Thanksgiving dinner on Thursday. Person 2 is grateful for the invitation and explains they can't go to Finland with their parents due to work commitments. Person 1 is happy to have them celebrate with their family, which will consist of 10 people. When Person 2 offers to bring something, Person 1 declines, but eventually agrees to let them bring a bottle of wine.  |
| SummLlama3-8B        | The conversation is about Thanksgiving dinner arrangements, with Person2 (Paul) confirming his attendance at Person1's (host's) house on Thursday, discussing what to bring and what's already been taken care of, and finalizing the offer to bring wine instead of pie since someone else is handling desserts.  |

The summary of SummLlama3-8B can be considered a much human-preferred summary for the following reasons:

**Core Focus:** The summary accurately captures the main theme of the conversation, which revolves around the Thanksgiving dinner arrangements. It highlights how the two people confirm plans, discuss what to bring, and finalize the decision for Person2 to bring wine instead of pie. This maintains the core context.

**Inclusion of Key-facts:** The summary covers the important details of the conversation, including Person2's initial offer to bring dessert (pumpkin pie) and the shift to bringing wine due to another family member handling dessert. Other summaries tend to overlook or simplify this progression, while SummLlama3-8B fully captures the interaction’s key events.

**Clarity and Conciseness:** The summary is structured in a straightforward, concise manner, effectively summarizing the conversation without unnecessary details. It presents the flow and outcome of the discussion clearly, making it easy for readers to understand. The logical order of events is maintained, ensuring a smooth narrative.

**Accurate Role Depiction:** The summary clearly identifies Person1 as the host and Paul (Person2) as the guest, which helps clarify their relationship and the nature of the conversation. This distinction is more explicit in SummLlama3-8B compared to other summaries, which might leave these roles more ambiguous.