File size: 6,917 Bytes
a34b1b7 5b7dbe2 a34b1b7 5b7dbe2 a34b1b7 5b7dbe2 a34b1b7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
license: apache-2.0
language:
- en
base_model:
- aaditya/Llama3-OpenBioLLM-8B
---
# OpenBioLLM-Text2Graph-8B
This model is a biomedical annotation model designed to generate named entity annotations from unlabeled biomedical text.
It was introduced in the paper [GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition](https://arxiv.org/abs/2504.00676).
This model enables **high-throughput, cost-efficient synthetic biomedical NER data generation**, serving as the synthetic annotation backbone for [GLiNER-BioMed models](https://huggingface.co/collections/knowledgator/gliner-biomed-67ecf1b7cc62e673dbc8b57f).
## Usage
To use the model with `transformer` package, see the example below:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "Ihor/OpenBioLLM-Text2Graph-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|end_of_text|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16
)
MESSAGES = [
{
"role": "system",
"content": (
"You are an advanced assistant trained to process biomedical text for Named Entity Recognition (NER) and Relation Extraction (RE). "
"Your task is to analyze user-provided text, identify all unique and contextually relevant entities, and infer directed relationships "
"between these entities based on the context. Ensure that all relations exist only between annotated entities. "
"Entities and relationships should be human-readable and natural, reflecting real-world concepts and connections. "
"Output the annotated data in JSON format, structured as follows:\n\n"
"""{"entities": [{"id": 0, "text": "ner_string_0", "type": "ner_type_string_0"}, {"id": 1, "text": "ner_string_1", "type": "ner_type_string_1"}], "relations": [{"head": 0, "tail": 1, "type": "re_type_string_0"}]}"""
"\n\nEnsure that the output captures all significant entities and their directed relationships in a clear and concise manner."
),
},
{
"role": "user",
"content": (
'Here is a text input: "Subjects will receive a 100mL dose of IV saline every 6 hours for 24 hours. The first dose will be administered prior to anesthesia induction, approximately 30 minutes before skin incision. A total of 4 doses will be given." '
"Analyze this text, select and classify the entities, and extract their relationships as per your instructions."
),
},
]
# Build prompt text
chat_prompt = tokenizer.apply_chat_template(
MESSAGES, tokenize=False, add_generation_prompt=True
)
# Tokenize
inputs = tokenizer(chat_prompt, return_tensors="pt").to(model.device)
# Generate
outputs = model.generate(
**inputs,
max_new_tokens=3000,
do_sample=True,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.eos_token_id,
return_dict_in_generate=True
)
# Decode ONLY the new tokens (skip the prompt tokens)
prompt_len = inputs["input_ids"].shape[-1]
generated_ids = outputs.sequences[0][prompt_len:]
response = tokenizer.decode(generated_ids, skip_special_tokens=True)
print(response)
```
To use the model with `vllm` package, please refer to the example below:
```python
# !pip install vllm
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
MODEL_ID = "Ihor/OpenBioLLM-Text2Graph-8B"
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
tokenizer.chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|end_of_text|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}"
llm = LLM(model=MODEL_ID)
sampling_params = SamplingParams(
max_tokens=3000,
n=1,
best_of=1,
presence_penalty=0.0,
frequency_penalty=0.0,
repetition_penalty=1.0,
temperature=0.0,
top_p=1.0,
top_k=-1,
min_p=0.0,
seed=42,
)
MESSAGES = [
{
"role": "system",
"content": (
"You are an advanced assistant trained to process biomedical text for Named Entity Recognition (NER) and Relation Extraction (RE). "
"Your task is to analyze user-provided text, identify all unique and contextually relevant entities, and infer directed relationships "
"between these entities based on the context. Ensure that all relations exist only between annotated entities. "
"Entities and relationships should be human-readable and natural, reflecting real-world concepts and connections. "
"Output the annotated data in JSON format, structured as follows:\n\n"
"""{"entities": [{"id": 0, "text": "ner_string_0", "type": "ner_type_string_0"}, {"id": 1, "text": "ner_string_1", "type": "ner_type_string_1"}], "relations": [{"head": 0, "tail": 1, "type": "re_type_string_0"}]}"""
"\n\nEnsure that the output captures all significant entities and their directed relationships in a clear and concise manner."
),
},
{
"role": "user",
"content": (
'Here is a text input: "Subjects will receive a 100mL dose of IV saline every 6 hours for 24 hours. The first dose will be administered prior to anesthesia induction, approximately 30 minutes before skin incision. A total of 4 doses will be given." '
"Analyze this text, select and classify the entities, and extract their relationships as per your instructions."
),
},
]
chat_prompt = tokenizer.apply_chat_template(
MESSAGES,
tokenize=False,
add_generation_prompt=True,
add_special_tokens=False,
)
outputs = llm.generate([chat_prompt], sampling_params)
response_text = outputs[0].outputs[0].text
print(response_text)
```
## Citation
If you use this model, please cite:
```bibtex
@misc{yazdani2025glinerbiomedsuiteefficientmodels,
title={GLiNER-BioMed: A Suite of Efficient Models for Open Biomedical Named Entity Recognition},
author={Anthony Yazdani and Ihor Stepanov and Douglas Teodoro},
year={2025},
eprint={2504.00676},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2504.00676},
}
``` |