Update README.md
Browse files
README.md
CHANGED
|
@@ -17,23 +17,23 @@ We evaluated granite-vision-3.3-2b-embedding alongside other top colBERT style m
|
|
| 17 |
## **NDCG@5 - ViDoRe V2**
|
| 18 |
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|
| 19 |
|----------------------------------------|--------------|------------------|-------------|-------------------|-----------
|
| 20 |
-
| ESG Restaurant Human | 51.
|
| 21 |
-
| Economics Macro Multilingual | 49.
|
| 22 |
-
| MIT Biomedical | 59.
|
| 23 |
-
| ESG Restaurant Synthetic | 57.
|
| 24 |
-
| ESG Restaurant Synthetic Multilingual | 55.
|
| 25 |
-
| MIT Biomedical Multilingual | 56.
|
| 26 |
-
| Economics Macro | 51.
|
| 27 |
-
| **Avg (ViDoRe2)** | **54.
|
| 28 |
|
| 29 |
## **NDCG@5 - REAL-MM-RAG**
|
| 30 |
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|
| 31 |
|----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
|
| 32 |
-
| FinReport |
|
| 33 |
-
| FinSlides |
|
| 34 |
-
| TechReport |
|
| 35 |
-
| TechSlides |
|
| 36 |
-
| **Avg (REAL-MM-RAG)** | **
|
| 37 |
|
| 38 |
- **Release Date**: June 11th 2025
|
| 39 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
@@ -105,7 +105,13 @@ print("=" * 50)
|
|
| 105 |
For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](......).
|
| 106 |
|
| 107 |
**Model Architecture:**
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
**Training Data:**
|
| 111 |
Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
|
|
|
|
| 17 |
## **NDCG@5 - ViDoRe V2**
|
| 18 |
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|
| 19 |
|----------------------------------------|--------------|------------------|-------------|-------------------|-----------
|
| 20 |
+
| ESG Restaurant Human | 51.1 | 68.4 | 65.8 | 62.4 | 62.3 |
|
| 21 |
+
| Economics Macro Multilingual | 49.9 | 56.5 | 55.4 | 47.4 | 48.3 |
|
| 22 |
+
| MIT Biomedical | 59.7 | 63.6 | 63.5 | 58.1 |60.0 |
|
| 23 |
+
| ESG Restaurant Synthetic | 57.0 | 57.4 | 56.6 | 51.1 |54.0 |
|
| 24 |
+
| ESG Restaurant Synthetic Multilingual | 55.7 | 57.4 | 57.2 | 47.6 |53.5 |
|
| 25 |
+
| MIT Biomedical Multilingual | 56.5 | 61.1 | 62.5 | 50.5 | 53.6 |
|
| 26 |
+
| Economics Macro | 51.6 | 59.8 | 60.2 | 60.9 |60.0 |
|
| 27 |
+
| **Avg (ViDoRe2)** | **54.5** | **60.6** | **60.2** | **54.0**. |**56.0** |
|
| 28 |
|
| 29 |
## **NDCG@5 - REAL-MM-RAG**
|
| 30 |
| Collection \ Model | ColPali-v1.3 | ColQwen2.5-v0.2 | ColNomic-3b | ColSmolvlm-v0.1 | granite-vision-3.3-2b-embedding |
|
| 31 |
|----------------------------------------|--------------|------------------|-------------|--------------------------| ------------------
|
| 32 |
+
| FinReport | 55 | 66 | 78 | 65 |70
|
| 33 |
+
| FinSlides | 68 | 79 | 81 | 55 |74
|
| 34 |
+
| TechReport | 78 | 86 | 88 | 83 |84
|
| 35 |
+
| TechSlides | 90 | 93 | 92 | 91 |93
|
| 36 |
+
| **Avg (REAL-MM-RAG)** | **73** | **81** | **85** | **74** |**80**
|
| 37 |
|
| 38 |
- **Release Date**: June 11th 2025
|
| 39 |
- **License:** [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
|
|
|
|
| 105 |
For an example of MM-RAG using granite-vision-3.3-2b-embedding refer to [this notebook](......).
|
| 106 |
|
| 107 |
**Model Architecture:**
|
| 108 |
+
The architecture of granite-vision-3.3-2b-embedding follows ColPali(https://arxiv.org/abs/2407.01449) approach and consists of the following components:
|
| 109 |
+
|
| 110 |
+
(1) Vision-Language model : granite-vision-3.3-2b (https://huggingface.co/ibm-granite/granite-vision-3.3-2b).
|
| 111 |
+
|
| 112 |
+
(2) Projection layer: linear layer that projects the hidden layer dimension of Vision-Language model to 128 and outputs 729 embedding vectors per image.
|
| 113 |
+
|
| 114 |
+
The scoring is computed using MaxSim-based late interaction mechanism.
|
| 115 |
|
| 116 |
**Training Data:**
|
| 117 |
Our training data is entirly comprised from DocFM. DocFM is a large-scale comprehensive dataset effort at IBM consisting of 85 million document pages extracted from unique PDF
|