File size: 3,385 Bytes
569994c
 
e1e757c
 
 
 
 
 
 
 
 
 
 
569994c
 
 
 
 
e1e757c
569994c
1cee2bb
569994c
 
 
 
 
 
 
 
 
e1e757c
 
 
1cee2bb
 
569994c
 
 
 
 
9a7f8c8
1cee2bb
 
 
 
 
e1e757c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9a7f8c8
e1e757c
569994c
 
1cee2bb
569994c
 
 
1cee2bb
e1e757c
 
 
 
1cee2bb
569994c
 
 
e1e757c
569994c
 
 
e1e757c
569994c
 
 
 
1cee2bb
569994c
 
1cee2bb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
library_name: transformers
tags:
- lecture
- college
- university
- summarization
license: mit
language:
- en
metrics:
- rouge
pipeline_tag: summarization
---

# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
Academ is a fine-tuned BART model for summarizing academic lectures. 

To find out how the model was fine-tuned, you can check the notebook on Kaggle: https://www.kaggle.com/code/yousefr/college-lectures-summarization-bart-unsupervised/

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Yousef Gamaleldin
- **Model type:** Summarization
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model [optional]:** BART Large CNN

## How to Get Started with the Model

Use the code below to get started with the model.

```
from transformers import BartForConditionalGeneration, AutoTokenizer

model = BartForConditionalGeneration.from_pretrained('yousefg/Academ-0.5')
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-cnn')

def get_summary(input_ids, attention_mask, context_length):
    
    summaries = []
    for i in range(0, input_ids.shape[1], context_length):
        
        input_slice = input_ids[:, i:i + context_length] if i + context_length <= input_ids.size(1) else input_ids[:, i:]
        attention_mask_slice = attention_mask[:, i:i + context_length] if i + context_length <= attention_mask.size(1) else attention_mask[:, i:]
        
        summary = model.generate(input_slice, attention_mask = attention_mask_slice, max_new_tokens = 1654, min_new_tokens = 250, do_sample = True, renormalize_logits = True)
        summaries.extend(summary[0].tolist())
        
    summaries = tokenizer.decode(summaries, skip_special_tokens = True)
    
    return summaries
    
batch = tokenizer(texts, truncation = False) # make sure to get the transcript from the lecture

input_ids = torch.tensor(batch['input_ids']).unsqueeze(0).to(device)
attention_mask = torch.tensor(batch['attention_mask']).unsqueeze(0).to(device)

summary = get_summary(input_ids, attention_mask, 1654)
print(summary)
```

## Training Details

The model's training used a custom loss function for getting the model into an optimal length (35% chosen as the optimal length).

#### Training Hyperparameters

- **Training regime:** bf16 non-mixed precision<!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
- **Learning Rate:** 0.001
- **Weight Decay:** 0.01
- **Epochs:** 4
- **Batch Size:** 16
- 
## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The evaluation is based on ROUGE 1 with a change of discounting padding tokens. 

#### Testing Data

The model's test dataset had 289 lectures, mainly from MIT OpenCourseWare.
<!-- This should link to a Dataset Card if possible. -->

### Results

The model achieved 96% accuracy for ROUGUE 1 in the test dataset, and 93% in the evaluation dataset.

#### Summary
Academ is a summarization model trained on 2307 lectures, mainly from MIT OpenCourseWare. The model has a max sequence length of 1654, increasing 630 tokens from the original model.