Update README.md
Browse files
README.md
CHANGED
@@ -10,14 +10,13 @@ language:
|
|
10 |
-->
|
11 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
12 |
|
13 |
-
|
14 |
[Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
|
15 |
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
|
16 |
|
17 |
## Model Details
|
18 |
Jellyfish-8B is a large language model equipped with 8 billion parameters.
|
19 |
-
We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using
|
20 |
-
The training data is used a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)
|
21 |
|
22 |
<!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
|
23 |
|
@@ -68,16 +67,16 @@ If you find our work useful, please give us credit by citing:
|
|
68 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
69 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
70 |
|
71 |
-
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets.
|
72 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
73 |
|
74 |
1.
|
75 |
-
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
76 |
-
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
77 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
78 |
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
79 |
-
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
80 |
-
|
|
|
|
|
81 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
82 |
|
83 |
## Performance on unseen tasks
|
@@ -112,7 +111,7 @@ _Few-shot is disabled for Jellyfish models._
|
|
112 |
|
113 |
## Prompts
|
114 |
|
115 |
-
We provide the prompts used for both
|
116 |
You can structure your data according to these prompts.
|
117 |
|
118 |
### System Message
|
@@ -121,36 +120,6 @@ You are an AI assistant that follows instruction extremely well.
|
|
121 |
User will give you a question. Your task is to answer as faithfully as you can.
|
122 |
```
|
123 |
|
124 |
-
### For Entity Matching
|
125 |
-
```
|
126 |
-
You are tasked with determining whether two records listed below are the same based on the information provided.
|
127 |
-
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
128 |
-
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
129 |
-
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
130 |
-
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
131 |
-
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
132 |
-
```
|
133 |
-
|
134 |
-
### For Data Imputation
|
135 |
-
```
|
136 |
-
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
137 |
-
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
138 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
139 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
140 |
-
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
141 |
-
Answer only the value of {attribute X}.
|
142 |
-
```
|
143 |
-
|
144 |
-
### For Data Imputation
|
145 |
-
```
|
146 |
-
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
147 |
-
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
148 |
-
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
149 |
-
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
150 |
-
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
151 |
-
Answer only the value of {attribute X}.
|
152 |
-
```
|
153 |
-
|
154 |
### For Error Detection
|
155 |
_There are two forms of the error detection task.
|
156 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
@@ -172,6 +141,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
|
|
172 |
Attribute for Verification: [{attribute X}: {attribute X value}]
|
173 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
174 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
175 |
|
176 |
### For Schema Matching
|
177 |
```
|
@@ -183,6 +161,16 @@ Attribute B is [name: {value of name}, description: {value of description}].
|
|
183 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
184 |
```
|
185 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
186 |
### For Column Type Annotation
|
187 |
|
188 |
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
|
|
|
10 |
-->
|
11 |
<img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
|
12 |
|
13 |
+
Jellyfish models with other sizes are available here:
|
14 |
[Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
|
15 |
[Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
|
16 |
|
17 |
## Model Details
|
18 |
Jellyfish-8B is a large language model equipped with 8 billion parameters.
|
19 |
+
We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
|
|
|
20 |
|
21 |
<!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
|
22 |
|
|
|
67 |
| Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
|
68 |
| Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
|
69 |
|
70 |
+
_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
|
71 |
_Accuracy as the metric for data imputation and the F1 score for other tasks._
|
72 |
|
73 |
1.
|
|
|
|
|
74 |
[HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
|
75 |
[RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
|
76 |
+
[IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
|
77 |
+
[SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
|
78 |
+
[Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
|
79 |
+
3.
|
80 |
[Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
|
81 |
|
82 |
## Performance on unseen tasks
|
|
|
111 |
|
112 |
## Prompts
|
113 |
|
114 |
+
We provide the prompts used for both fine-tuning and inference.
|
115 |
You can structure your data according to these prompts.
|
116 |
|
117 |
### System Message
|
|
|
120 |
User will give you a question. Your task is to answer as faithfully as you can.
|
121 |
```
|
122 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
123 |
### For Error Detection
|
124 |
_There are two forms of the error detection task.
|
125 |
In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
|
|
|
141 |
Attribute for Verification: [{attribute X}: {attribute X value}]
|
142 |
Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
|
143 |
```
|
144 |
+
### For Data Imputation
|
145 |
+
```
|
146 |
+
You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
|
147 |
+
Your task is to deduce or infer the value of {attribute X} using the available information in the record.
|
148 |
+
You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
|
149 |
+
Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
150 |
+
Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
|
151 |
+
Answer only the value of {attribute X}.
|
152 |
+
```
|
153 |
|
154 |
### For Schema Matching
|
155 |
```
|
|
|
161 |
Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
|
162 |
```
|
163 |
|
164 |
+
### For Entity Matching
|
165 |
+
```
|
166 |
+
You are tasked with determining whether two records listed below are the same based on the information provided.
|
167 |
+
Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
|
168 |
+
Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
|
169 |
+
Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
170 |
+
Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
|
171 |
+
Are record A and record B the same entity? Choose your answer from: [Yes, No].
|
172 |
+
```
|
173 |
+
|
174 |
### For Column Type Annotation
|
175 |
|
176 |
We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
|