๐ŸŠ Korean Medical DPR(Dense Passage Retrieval)

1. Intro

์˜๋ฃŒ ๋ถ„์•ผ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” Bi-Encoder ๊ตฌ์กฐ์˜ ๊ฒ€์ƒ‰ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
ํ•œยท์˜ ํ˜ผ์šฉ์ฒด์˜ ์˜๋ฃŒ ๊ธฐ๋ก์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด SapBERT-KO-EN ์„ ๋ฒ ์ด์Šค ๋ชจ๋ธ๋กœ ์ด์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
์งˆ๋ฌธ์€ Question Encoder๋กœ, ํ…์ŠคํŠธ๋Š” Context Encoder๋ฅผ ์ด์šฉํ•ด ์ธ์ฝ”๋”ฉํ•ฉ๋‹ˆ๋‹ค.

(โ€ป ์ด ๋ชจ๋ธ์€ AI Hub์˜ ์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.)

2. Model

(1) Self Alignment Pretraining (SAP)

ํ•œ๊ตญ ์˜๋ฃŒ ๊ธฐ๋ก์€ ํ•œยท์˜ ํ˜ผ์šฉ์ฒด๋กœ ์“ฐ์—ฌ, ์˜์–ด ์šฉ์–ด๋„ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
Multi Similarity Loss๋ฅผ ์ด์šฉํ•ด ๋™์ผํ•œ ์ฝ”๋“œ์˜ ์šฉ์–ด ๊ฐ„์— ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๊ฐ–๋„๋ก ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ) C3843080 || ๊ณ ํ˜ˆ์•• ์งˆํ™˜ 
    C3843080 || Hypertension
    C3843080 || High Blood Pressure
    C3843080 || HTN
    C3843080 || HBP

(2) Dense Passage Retrieval (DPR)

SapBERT-KO-EN์„ ๊ฒ€์ƒ‰ ๋ชจ๋ธ๋กœ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ์ถ”๊ฐ€์ ์ธ Fine-tuning์„ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
Bi-Encoder ๊ตฌ์กฐ๋กœ ์งˆ์˜์™€ ํ…์ŠคํŠธ์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” DPR ๋ฐฉ์‹์œผ๋กœ Fine-tuning ํ–ˆ์Šต๋‹ˆ๋‹ค.
๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ธฐ์กด์˜ ๋ฐ์ดํ„ฐ ์…‹์— ํ•œยท์˜ ํ˜ผ์šฉ์ฒด ์ƒ˜ํ”Œ์„ ์ฆ๊ฐ•ํ•œ ๋ฐ์ดํ„ฐ ์…‹์„ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

์˜ˆ) ํ•œ๊ตญ์–ด ๋ณ‘๋ช…: ๊ณ ํ˜ˆ์••
    ์˜์–ด ๋ณ‘๋ช…: Hypertenstion
    ์งˆ์˜ (์›๋ณธ): ์•„๋ฒ„์ง€๊ฐ€ ๊ณ ํ˜ˆ์••์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. ๊ณ ํ˜ˆ์••์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.
    ์งˆ์˜ (์ฆ๊ฐ•): ์•„๋ฒ„์ง€๊ฐ€ Hypertenstion ์ธ๋ฐ ๊ทธ๊ฒŒ ๋ญ”์ง€ ๋ชจ๋ฅด๊ฒ ์–ด. Hypertenstion ์ด ๋ญ”์ง€ ์„ค๋ช…์ข€ ํ•ด์ค˜.

3. Training

(1) Self Alignment Pretraining (SAP)

SapBERT-KO-EN ํ•™์Šต์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
ํ•œยท์˜ ์˜๋ฃŒ ์šฉ์–ด๋ฅผ ์ˆ˜๋กํ•œ ์˜๋ฃŒ ์šฉ์–ด ์‚ฌ์ „์ธ KOSTOM์„ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

  • Model : klue/bert-base
  • Dataset : KOSTOM
  • Epochs : 1
  • Batch Size : 64
  • Max Length : 64
  • Dropout : 0.1
  • Pooler : 'cls'
  • Eval Step : 100
  • Threshold : 0.8
  • Scale Positive Sample : 1
  • Scale Negative Sample : 60

(2) Dense Passage Retrieval (DPR)

Fine-tuning์— ํ™œ์šฉํ•œ ๋ฒ ์ด์Šค ๋ชจ๋ธ ๋ฐ ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • Model : SapBERT-KO-EN(klue/bert-base)
  • Dataset : ์ดˆ๊ฑฐ๋Œ€ AI ํ—ฌ์Šค์ผ€์–ด ์งˆ์˜ ์‘๋‹ต ๋ฐ์ดํ„ฐ(AI Hub)
  • Epochs : 10
  • Batch Size : 64
  • Dropout : 0.1
  • Pooler : 'cls'

4. Example

์ด ๋ชจ๋ธ์€ Context๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ชจ๋ธ๋กœ, Question ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๋™์ผํ•œ ์งˆ๋ณ‘์— ๊ด€ํ•œ ์งˆ๋ฌธ๊ณผ ํ…์ŠคํŠธ๊ฐ€ ๋†’์€ ์œ ์‚ฌ๋„๋ฅผ ๋ณด์ธ๋‹ค๋Š” ์‚ฌ์‹ค์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(โ€ป ์•„๋ž˜ ์ฝ”๋“œ์˜ ์˜ˆ์‹œ๋Š” ChatGPT๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์˜๋ฃŒ ํ…์ŠคํŠธ์ž…๋‹ˆ๋‹ค.)
(โ€ป ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ํŠน์„ฑ ์ƒ ์˜ˆ์‹œ ๋ณด๋‹ค ์ •์ œ๋œ ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๋” ์ž˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค.)

import numpy as np
from transformers import AutoModel, AutoTokenizer

# Question Model
q_model_path = 'snumin44/medical-biencoder-ko-bert-question'
q_model = AutoModel.from_pretrained(q_model_path)
q_tokenizer = AutoTokenizer.from_pretrained(q_model_path)

# Context Model
c_model_path = 'snumin44/medical-biencoder-ko-bert-context'
c_model = AutoModel.from_pretrained(c_model_path)
c_tokenizer = AutoTokenizer.from_pretrained(c_model_path)


query = 'high blood pressure ์ฒ˜๋ฐฉ ์‚ฌ๋ก€'

targets = [
    """๊ณ ํ˜ˆ์•• ์ง„๋‹จ.
    ํ™˜์ž ์ƒ๋‹ด ๋ฐ ์ƒํ™œ์Šต๊ด€ ๊ต์ • ๊ถŒ๊ณ . ์ €์—ผ์‹, ๊ทœ์น™์ ์ธ ์šด๋™, ๊ธˆ์—ฐ, ๊ธˆ์ฃผ ์ง€์‹œ.
    ํ™˜์ž ์žฌ๋ฐฉ๋ฌธ. ํ˜ˆ์••: 150/95mmHg. ์•ฝ๋ฌผ์น˜๋ฃŒ ์‹œ์ž‘. Amlodipine 5mg 1์ผ 1ํšŒ ์ฒ˜๋ฐฉ.""",
    
    """์‘๊ธ‰์‹ค ๋„์ฐฉ ํ›„ ์œ„ ๋‚ด์‹œ๊ฒฝ ์ง„ํ–‰.
    ์†Œ๊ฒฌ: Gastric ulcer์—์„œ Forrest IIb ๊ด€์ฐฐ๋จ. ์ถœํ˜ˆ์€ ์†Œ๋Ÿ‰์˜ ์‚ผ์ถœ์„ฑ ์ถœํ˜ˆ ํ˜•ํƒœ.
    ์ฒ˜์น˜: ์—ํ”ผ๋„คํ”„๋ฆฐ ์ฃผ์‚ฌ๋กœ ์ถœํ˜ˆ ๊ฐ์†Œ ํ™•์ธ. Hemoclip 2๊ฐœ๋กœ ์ถœํ˜ˆ ๋ถ€์œ„ ํด๋ฆฌํ•‘ํ•˜์—ฌ ์ง€ํ˜ˆ ์™„๋ฃŒ.""",
    
    """ํ˜ˆ์ค‘ ๋†’์€ ์ง€๋ฐฉ ์ˆ˜์น˜ ๋ฐ ์ง€๋ฐฉ๊ฐ„ ์†Œ๊ฒฌ.
    ๋‹ค๋ฐœ์„ฑ gallstones ํ™•์ธ. ์ฆ์ƒ ์—†์„ ๊ฒฝ์šฐ ๊ฒฝ๊ณผ ๊ด€์ฐฐ ๊ถŒ์žฅ.
    ์šฐ์ธก renal cyst, ์–‘์„ฑ ๊ฐ€๋Šฅ์„ฑ ๋†’์œผ๋ฉฐ ์ถ”๊ฐ€์ ์ธ ์ฒ˜์น˜ ๋ถˆํ•„์š” ํ•จ."""
]

query_feature = q_tokenizer(query, return_tensors='pt')
query_outputs = q_model(**query_feature, return_dict=True)
query_embeddings = query_outputs.pooler_output.detach().numpy().squeeze()

def cos_sim(A, B):
    return np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))

for idx, target in enumerate(targets):
    target_feature = c_tokenizer(target, return_tensors='pt')
    target_outputs = c_model(**target_feature, return_dict=True)
    target_embeddings = target_outputs.pooler_output.detach().numpy().squeeze()
    similarity = cos_sim(query_embeddings, target_embeddings)
    print(f"Similarity between query and target {idx}: {similarity:.4f}")
Similarity between query and target 0: 0.2674
Similarity between query and target 1: 0.0416
Similarity between query and target 2: 0.0476

Citing

@inproceedings{liu2021self,
    title={Self-Alignment Pretraining for Biomedical Entity Representations},
    author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
    booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
    pages={4228--4238},
    month = jun,
    year={2021}
}
@article{karpukhin2020dense,
  title={Dense Passage Retrieval for Open-Domain Question Answering},
  author={Vladimir Karpukhin, Barlas OฤŸuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, Wen-tau Yih},
  journal={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2020}
}
Downloads last month
11
Safetensors
Model size
111M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for snumin44/medical-biencoder-ko-bert-context

Base model

klue/bert-base
Finetuned
(70)
this model