dptrsa
/

STAR-QA

@@ -4,12 +4,12 @@ tags:
 - sentence-transformers
 - feature-extraction
 - sentence-similarity
 ---
 # {MODEL_NAME}
-This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
 <!--- Describe your model here -->
@@ -32,17 +32,20 @@ embeddings = model.encode(sentences)
 print(embeddings)
 ```
-## Evaluation Results
-<!--- Describe how your model was evaluated -->
-For an automated evaluation of this model, see the *Sentence Embeddings Benchmark*: [https://seb.sbert.net](https://seb.sbert.net?model_name={MODEL_NAME})
 ## Training
-The model was trained with the parameters:
 **DataLoader**:
@@ -76,7 +79,6 @@ Parameters of the fit()-Method:
 }
 ```
 ## Full Model Architecture
 ```
 SentenceTransformer(
@@ -88,4 +90,4 @@ SentenceTransformer(
 ## Citing & Authors
-<!--- Describe where people can find more information -->

 - sentence-transformers
 - feature-extraction
 - sentence-similarity
+license: apache-2.0
 ---
 # {MODEL_NAME}
+Sentence Transformer for Assurance & Risk Question-Answering (STAR-QA) is a fine-tuned [sentence-transformers](https://www.SBERT.net) model based on ALL-MPNET-BASE-V2. It has been developed to produce **SOTA embeddings for audit, risk-management, compliance and associated regulatory documents**. The model maps sentence pairs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search as part of retrieval-augmented generation pipelines.
 <!--- Describe your model here -->
 print(embeddings)
 ```
+## Evaluation Results
+The model was evaluated on a held-out sample from the STAR-QA dataset (see below) using `sentence-transformers.InformationRetrievalEvaluator`. Reported metrics include P/R @ 3 candidates, as well as MRR @ 10, MAP @ 10 and NDCG @ 100. This fine-tuned model was also benchmarked against its base model using the same methodology.
+## Training Data
+The model was fine-tuned from a corpus of audit, risk-management, compliance and associated regulatory documents sourced from the public internet. Documents were cleaned and chunked into 2-sentence blocks. Each block was then sent to a state-of-the-art LLM with the following prompt:
+"Write a question about {document_topic} for which this is the answer: {block}"
+The resulting question and its associated ground-truth answer (collectively a "pair") constitute a single training example for the fine-tuning step.
 ## Training
+The model was fine-tuned with the parameters:
 **DataLoader**:
 }
 ```
 ## Full Model Architecture
 ```
 SentenceTransformer(
 ## Citing & Authors
+@misc{Theron_2024, title={Sentence Transformer for Assurance &#38; Risk Question-Answering (STAR-QA)}, url={https://huggingface.co/dptrsa/STAR-QA}, author={Theron, Daniel}, year={2024}, month={Feb} }