ATOMIS / README.md

Update README.md

37e8e6e verified 4 days ago

6.96 kB

	---
	license: gemma
	language:
	- en
	- ko
	tags:
	- gemma-2
	- KINS-ai
	base_model:
	- google/gemma-2-27b
	pipeline_tag: text-generation
	library_name: transformers
	---

	# Introduction

	### About the Model

	We introduce ATOMIS, developed by the Korea Institute of Nuclear Safety (KINS). This model is specifically designed for the nuclear field and is a large language model (LLM) with 32 billion parameters. It achieves state-of-the-art performance among its peers on Logickor, a real-world Korean task benchmark; NuclearQA, a nuclear-domain benchmark; and RAGEval, a RAG benchmark. Please refer to the evaluation results table for details.

	## Key Features

	- Korean Real-World use cases: The model can understand and generate Korean text with high accuracy, making it suitable for practical scenarios.
	- Specialized in the Nuclear Domain: The model has been specifically trained on a vast, specialized corpus of nuclear data.
	- RAG: The model delivers accurate answers based on real documents through its high RAG performance.


	### Pre-Training

	We created the base model by expanding layers using a passthrough method, building on the gemma-2-27b model. Additionally, we extended the context length to 32K with RoPE and performed continuous pretraining to restore the model’s performance.
	In particular, to train specialized knowledge in the nuclear domain, we included the following data.

	- Atomic Wiki (https://atomic.snu.ac.kr)
	- NText (https://paperswithcode.com/dataset/ntext)
	- in-house data from KINS (Korea Institute of Nuclear Safety)

	### Post-Training

	The fine-tuning data includes over 1M publicly available instruction datasets as well as high-quality synthetic data. We use this dataset to perform supervised fine-tuning (SFT) and direct preference optimization (DPO).

	# How to use

	```python
	# pip install transformers==4.43.4 or later
	import torch
	from transformers import AutoModelForCausalLM, AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained("KINS-ai/ATOMIS")
	model = AutoModelForCausalLM.from_pretrained(
	"KINS-ai/ATOMIS",
	device_map="auto",
	torch_dtype=torch.bfloat16,
	)

	messages = [
	{"role": "user", "content": "안녕하세요?"},
	]

	input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda")

	outputs = model.generate(**input_ids, max_new_tokens=256)
	print(tokenizer.decode(outputs[0]))
	```

	# Evaluation

	### Overall

	\| Model \| LogicKor \| NuclearQA \| RAGEval \| Avg \|
	\|--------------------------------------\|----- \|-----\|-----\|-----\|
	\| c4ai-command-r-08-2024 \| 8.27 \| 7.82 \| 9.41 \| 8.50 \|
	\| gemma-2-27b-it \| 8.66 \| 8.18 \| 8.97 \| 8.60 \|
	\| Qwen2.5-32B-instruct \| 8.93 \| 8.61 \| 9.36 \| 8.97 \|
	\| phi-4 \| 8.62 \| 8.67 \| 9.55 \| 8.95 \|
	\| Mistral-Small-24B-Instruct-2501 \| 8.36 \| 8.68 \| 9.04 \| 8.69 \|
	\| Llama-3.3-70b-instruct \| 7.94 \| 8.42 \| 9.25 \| 8.54 \|
	\| ATOMIS \| 9.00 \| 8.72 \| 9.65 \| 9.12 \|


	### LogicKor
	We evaluated the performance using the [LogicKor](https://github.com/instructkr/LogicKor) code. As the judge model, we employed the officially recommended GPT-4-1106-preview. These scores reflect only the default zero-shot evaluation.

	\| Model \| Math \| Reasoning \| Coding \| Writing \| Understanding \| Grammar \| Single-turn \| Multi-turn \| Avg \|
	\|--------------------------------------\|----- \|-----\|-----\|-----\|-----\|-----\|-----\|-----\|-----\|
	\| c4ai-command-r-08-2024 \| 6.14 \| 7.36 \| 9.43 \| 9.64 \| 9.21 \| 7.86 \| 8.05 \| 8.52 \| 8.27 \|
	\| gemma-2-27b-it \| 8.93 \| 8.29 \| 8.43 \| 9.29 \| 9.43 \| 7.57 \| 8.43 \| 8.88 \| 8.66 \|
	\| Qwen2.5-32B-instruct \| 8.79 \| 8.64 \| 9.36 \| 9.50 \| 9.29 \| 8.00 \| 8.79 \| 9.10 \| 8.93 \|
	\| phi-4 \| 8.79 \| 9.21 \| 9.86 \| 9.21 \| 9.00 \| 5.64 \| 8.50 \| 8.74 \| 8.62 \|
	\| Mistral-Small-24B-Instruct-2501 \| 8.00 \| 8.14 \| 9.36 \| 9.43 \| 8.50 \| 6.71 \| 8.29 \| 8.43 \| 8.36 \|
	\| Llama-3.3-70b-instruct \| 7.43 \| 6.50 \| 8.79 \| 8.43 \| 8.64 \| 7.86 \| 8.14 \| 7.74 \| 7.94 \|
	\| ATOMIS \| 8.36 \| 8.71 \| 9.79 \| 9.64 \| 8.29 \| 9.21 \| 9.14 \| 8.86 \| 9.00 \|

	### NuclearQA
	We employed NuclearQA [1], a human-made benchmark consisting of 100 questions designed by experts to evaluate language models in the nuclear domain.

	We then used this question set to assess the LLM’s responses in a manner similar to the Logickor benchmark.

	[1] Acharya, A., Munikoti, S., Hellinger, A., Smith, S., Wagle, S. and Horawalavithana, S., 2023. NuclearQA: A Human-Made Benchmark for Language Models for the Nuclear Domain. arXiv:2310.10920.

	\| Model \| Easy \| Medium \| Hard \| General \| Scientific \| Numerical \| Num+Sci \| Avg \|
	\|--------------------------------------\|----- \|-----\|-----\|-----\|-----\|-----\|-----\|-----\|
	\| c4ai-command-r-08-2024 \| 8.77 \| 8.21 \| 6.47 \| 7.73 \| 8.38 \| 7.35 \| 7.35 \| 7.82 \|
	\| gemma-2-27b-it \| 8.97 \| 8.24 \| 7.33 \| 7.92 \| 8.23 \| 8.12 \| 8.45 \| 8.18 \|
	\| Qwen2.5-32B-instruct \| 8.97 \| 8.42 \| 8.38 \| 8.54 \| 8.15 \| 8.76 \| 9.03 \| 8.61 \|
	\| phi-4 \| 8.94 \| 8.97 \| 8.11 \| 8.46 \| 8.73 \| 9.00 \| 8.50 \| 8.67 \|
	\| Mistral-Small-24B-Instruct-2501 \| 9.13 \| 8.76 \| 8.14 \| 8.41 \| 8.81 \| 8.59 \| 8.95 \| 8.68 \|
	\| Llama-3.3-70b-instruct \| 9.29 \| 8.58 \| 7.44 \| 8.22 \| 8.62 \| 8.47 \| 8.35 \| 8.42 \|
	\| ATOMIS \| 9.10 \| 8.64 \| 8.31 \| 8.16 \| 9.00 \| 8.71 \| 9.10 \| 8.72 \|

	### RAGEval
	We used RAGEval [2], a benchmark designed to evaluate RAG performance in terms of factual accuracy, using three novel metrics: Completeness, Hallucination, and Irrelevance.

	We evaluated performance using the [RAGEval](https://github.com/OpenBMB/RAGEval) code. As the judge model, we employed the officially recommended gpt-4o. These scores reflect only the completeness metric of the single-document QA evaluation.

	[2] Zhu, K., Luo, Y., Xu, D., Wang, R., Yu, S., Wang, S., Yan, Y., Liu, Z., Han, X., Liu, Z. and Sun, M., 2024. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv:2408.01262.

	\| Model \| Factual \| Summarization \| Multi-hop Reasoning \| Avg \|
	\|--------------------------------------\|----- \|-----\|-----\|-----\|
	\| c4ai-command-r-08-2024 \| 1.000 \| 0.913 \| 0.908 \| 0.941 \|
	\| gemma-2-27b-it \| 0.987 \| 0.890 \| 0.814 \| 0.897 \|
	\| Qwen2.5-32B-instruct \| 0.980 \| 0.906 \| 0.923 \| 0.936 \|
	\| phi-4 \| 1.000 \| 0.931 \| 0.934 \| 0.955 \|
	\| Mistral-Small-24B-Instruct-2501 \| 0.980 \| 0.951 \| 0.781 \| 0.904 \|
	\| Llama-3.3-70b-instruct \| 0.977 \| 0.907 \| 0.893 \| 0.925 \|
	\| ATOMIS \| 0.993 \| 0.942 \| 0.960 \| 0.965 \|