lightblue
/

kurage-ja

Text Generation

Model card Files Files and versions Community

ptrdvn commited on Sep 13, 2024

Commit

7749d67

·

verified ·

1 Parent(s): 17ef67a

Update README.md

Files changed (1) hide show

README.md +45 -0

README.md CHANGED Viewed

@@ -1,6 +1,51 @@
 # Kurage
 <p align="center">
     <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/_SkPhhsg40juscfv9dU4v.jpeg" alt="An anime image of a pink and blue jellyfish surrounded by bubbles" width=500 style="border: 5px solid #3d3c3c"/>
 </p>

+---
+license: apache-2.0
+language:
+- ja
+pipeline_tag: text-generation
+tags:
+- RAG
+---
 # Kurage
 <p align="center">
     <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/_SkPhhsg40juscfv9dU4v.jpeg" alt="An anime image of a pink and blue jellyfish surrounded by bubbles" width=500 style="border: 5px solid #3d3c3c"/>
 </p>
+Kurage is a multipurpose RAG model from [Lightblue](https://huggingface.co/lightblue).
+This version of the model has been trained to perform RAG in Japanese.
+# Features / How to use
+* **Multi-chunk RAG**
+This model can take multiple contexts and a question as input, and it will first output the references of the relevant contexts before outputting an answer to the question.
+* **Single-chunk RAG**
+This model can also take a single context and a question as input, and it will determine whether it can answer the question based on the context, outputting an answer if it can. This allows for parallel computing of multiple contexts at the same time.
+* **Answer extension**
+By default, this model is trained to output the shortest possible answer to a question. However, if you require a longer answer, you can prompt the model to write a longer answer by writing " <<Long>>" after your question.
+* **Multilinguality**
+We have trained our model to be able to answer questions in Japanese based on texts in other languages too!
+* **Q&A generation**
+This model can also generate questions and answers based on a piece of text. This can be useful for pre-indexing a database or fine-tuning IR models that will then be used for RAG.
+# Training data
+We trained on chunks sourced from the documents in [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) dataset that
+had been evaluated to contain a higher amount of educational information according to a state-of-the-art LLM.
+We took chunks of size 250 tokens, 500 tokens, and 1000 tokens randomly for each document.
+We then used these chunks to generate questions and answers based on this text using a state-of-the-art LLM.
+Finally, we selected negatives for each chunk using the similarity from the dense embeddings of the [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) model.