Overview

This model is designed for the abstractive proposition segmentation task in Korean, as described in the paper Scalable and Domain-General Abstractive Proposition Segmentation. The model segments text into atomic and self-contained units (atomic facts).

Training Details

  • Base Model: yanolja/EEVE-Korean-Instruct-10.8B-v1.0
  • Fine-tuning Method: LoRA
  • Dataset: RoSE
    • Translation: The dataset was translated into Korean using GPT-4o.
      • GPT-4o was prompted to translate propositions using the vocabulary in the text.
    • Data Split: The dataset was randomly split into training, validation, and test sets (1900:100:500) for fine-tuning.

Usage

Data Preprocessing

from konlpy.tag import Kkma

sent_start_token = "<sent>"
sent_end_token = "</sent>"
instruction = "I will provide a passage split into sentences by <s> and </s> markers. For each sentence, generate its list of propositions. Each proposition contains a single fact mentioned in the corresponding sentence written as briefly and clearly as possible.\n\n"

kkma = Kkma()

def get_input(text, tokenizer):
  sentences = kkma.sentences(text)
  prompt = instruction + "Passage: " + sent_start_token + f"{sent_end_token}{sent_start_token}".join(sentences) + sent_end_token + "\nPropositions:\n"
  messages = [{"role": "system", "content": "You are a helpful assistant."},
              {"role": "user", "content": prompt}]
  input_text = tokenizer.apply_chat_template(
                      messages,
                      tokenize=False,
                      add_generation_prompt=True)
  return input_text

def get_output(text):
  results = []
  group = []

  if text.startswith("Propositions:"):
      lines = text[len("Propositions:"):].strip().split("\n")
  else:
      lines = text.strip().split("\n")
      
  for line in lines:
    if line.strip() == sent_start_token:
      continue
    elif line.strip() == sent_end_token:
      results.append(group)
      group = []
    else:
      if not line.strip().startswith("-"):
        break
      line = line[1:].strip()
      group.append(line)

  return results

Loading Model and Tokenizer

import peft, torch
from transformers import AutoModelForCausalLM, AutoTokenizer

LORA_PATH = "seonjeongh/Korean-Propositionalizer"

lora_config = peft.PeftConfig.from_pretrained(LORA_PATH)
base_model = AutoModelForCausalLM.from_pretrained(lora_config.base_model_name_or_path,
                                                  torch_dtype=torch.float16,
                                                  device_map="auto")
model = peft.PeftModel.from_pretrained(base_model, LORA_PATH)
model = model.merge_and_unload(progressbar=True)
tokenizer = AutoTokenizer.from_pretrained(lora_config.base_model_name_or_path)

Inference Example

device = "cuda"

text = "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ•œ ๊ฒฝ๊ธฐ์—์„œ 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค. ๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค. ์„ผํ„ฐ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธํ–„ 1๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค. ์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
inputs = tokenizer([get_input(text, tokenizer)], return_tensors='pt').to(device)
output = model.generate(**inputs, max_new_tokens=512, pad_token_id = tokenizer.pad_token_id, eos_token_id = tokenizer.eos_token_id, use_cache=True)
response = tokenizer.batch_decode(output[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0]
results = get_output(response)
print(results)
Example output
[
   [
    "์˜ฅ์Šคํฌ๋“œ๋Š” 21์„ธ ์ดํ•˜ ํŒ€์œผ๋กœ ๋“์ ํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ๋งจ์ฒด์Šคํ„ฐ ์œ ๋‚˜์ดํ‹ฐ๋“œ์™€์˜ ๊ฒฝ๊ธฐ์—์„œ 3-2๋กœ ํŒจํ–ˆ๋‹ค.",
    "์˜ฅ์Šคํฌ๋“œ๋Š” ํ™”์š”์ผ ๊ฒฝ๊ธฐ๋ฅผ ํ–ˆ๋‹ค.",
   ],
   [
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
    "๊ทธ ๊ณจ์€ 16์„ธ ์„ ์ˆ˜์˜ 1 ๊ตฐ ๋ฐ๋ท” ์ฃผ์žฅ์„ ๊ฐ•ํ™”ํ•  ๊ฒƒ์ด๋‹ค.",
   ],
   [
    "์„ผํ„ฐ ๋ฐฑ์€ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
    "์„ผํ„ฐ ๋ฐฑ์€ ์ด๋ฒˆ ์‹œ์ฆŒ ์›จ์ŠคํŠธ ํ–„ 1 ๊ตฐ๊ณผ ํ•จ๊ป˜ ํ›ˆ๋ จํ–ˆ๋‹ค.",
   ],
   [
    "์›จ์ŠคํŠธํ–„ ์œ ๋‚˜์ดํ‹ฐ๋“œ์˜ ์ตœ์‹  ๋‰ด์Šค๋Š” ์—ฌ๊ธฐ๋ฅผ ํด๋ฆญํ•˜์„ธ์š”."
   ]
]

Inputs and Outputs

  • Input: Text.
  • Output: List of propositions for all the sentences in the text passage. The propositions for each sentence are grouped separately.

Evaluation Results

  • Metric: Reference-less & reference-base metrics proposed in Scalable and Domain-General Abstractive Proposition Segmentation.
  • Models:
    • Dynamic 10-shot models: For each test example, the most similar 10 examples were selected from the training set using BM25.
    • Translate-test models: google/gemma-7b-aps-it model + EN->KO, KO->EN translation using GPT-4o or GPT-4o-mini.
    • Translate-train models: LoRA fine-tuned sLLMs using the Korean RoSE dataset.

Reference-less metric

Model Precision Recall F1
Gold 97.46 96.28 95.88
dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) 98.86 93.99 95.58
dynamic 10-shot GPT-4o 97.61 97.00 96.87
dynamic 10-shot GPT-4o-mini 98.51 97.12 97.17
Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) 97.38 96.93 96.52
Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) 97.24 96.26 95.73
Translate-Train (Qwen/Qwen2.5-7B-Instruct) 94.66 92.81 92.08
Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) 93.80 93.29 92.80
Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) 97.41 96.02 95.93

Reference-base metric

Model Precision Recall F1
Gold 100 100 100
dynamic 10-shot (Qwen/Qwen2.5-72B-Instruct) 48.49 40.27 42.99
dynamic 10-shot GPT-4o 49.16 44.72 46.05
dynamic 10-shot GPT-4o-mini 49.30 39.25 42.88
Translate-Test (google/gemma-7b-aps-it & GPT-4o Translation) 57.02 47.52 51.10
Translate-Test (google/gemma-7b-aps-it & GPT-4o-mini Translation) 57.19 47.68 51.26
Translate-Train (Qwen/Qwen2.5-7B-Instruct) 42.62 38.37 39.64
Translate-Train (LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct) 46.82 43.08 44.02
Translate-Train (yanolja/EEVE-Korean-Instruct-10.8B-v1.0) 50.82 45.89 47.44
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for seonjeongh/Korean-Propositionalizer

Dataset used to train seonjeongh/Korean-Propositionalizer