chuanxiao1983 commited on
Commit
17e64aa
·
verified ·
1 Parent(s): 7c29a62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -39
README.md CHANGED
@@ -10,14 +10,13 @@ language:
10
  -->
11
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
12
 
13
- Other versions of Jellyfish:
14
  [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
15
  [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
16
 
17
  ## Model Details
18
  Jellyfish-8B is a large language model equipped with 8 billion parameters.
19
- We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using the datasets pertinent to data preprocessing tasks.
20
- The training data is used a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)
21
 
22
  <!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
23
 
@@ -68,16 +67,16 @@ If you find our work useful, please give us credit by citing:
68
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
69
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
70
 
71
- _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
72
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
73
 
74
  1.
75
- [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
76
- [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
77
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
78
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
79
- [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
80
- 2.
 
 
81
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
82
 
83
  ## Performance on unseen tasks
@@ -112,7 +111,7 @@ _Few-shot is disabled for Jellyfish models._
112
 
113
  ## Prompts
114
 
115
- We provide the prompts used for both the model's fine-tuning and inference.
116
  You can structure your data according to these prompts.
117
 
118
  ### System Message
@@ -121,36 +120,6 @@ You are an AI assistant that follows instruction extremely well.
121
  User will give you a question. Your task is to answer as faithfully as you can.
122
  ```
123
 
124
- ### For Entity Matching
125
- ```
126
- You are tasked with determining whether two records listed below are the same based on the information provided.
127
- Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
128
- Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
129
- Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
130
- Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
131
- Are record A and record B the same entity? Choose your answer from: [Yes, No].
132
- ```
133
-
134
- ### For Data Imputation
135
- ```
136
- You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
137
- Your task is to deduce or infer the value of {attribute X} using the available information in the record.
138
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
139
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
140
- Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
141
- Answer only the value of {attribute X}.
142
- ```
143
-
144
- ### For Data Imputation
145
- ```
146
- You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
147
- Your task is to deduce or infer the value of {attribute X} using the available information in the record.
148
- You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
149
- Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
150
- Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
151
- Answer only the value of {attribute X}.
152
- ```
153
-
154
  ### For Error Detection
155
  _There are two forms of the error detection task.
156
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
@@ -172,6 +141,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
172
  Attribute for Verification: [{attribute X}: {attribute X value}]
173
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
174
  ```
 
 
 
 
 
 
 
 
 
175
 
176
  ### For Schema Matching
177
  ```
@@ -183,6 +161,16 @@ Attribute B is [name: {value of name}, description: {value of description}].
183
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
184
  ```
185
 
 
 
 
 
 
 
 
 
 
 
186
  ### For Column Type Annotation
187
 
188
  We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).
 
10
  -->
11
  <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
12
 
13
+ Jellyfish models with other sizes are available here:
14
  [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
15
  [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
16
 
17
  ## Model Details
18
  Jellyfish-8B is a large language model equipped with 8 billion parameters.
19
+ We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
 
20
 
21
  <!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
22
 
 
67
  | Entity Matching | Unseen | Walmart-Amazon | 86.89 | 87.00 | **90.27** | 79.19 | 82.40 | 84.91 | 85.24 | *89.42* |
68
  | Avg | | | 80.44 | - | *84.17* | 72.58 | - | 82.74 | 81.55 | **86.02** |
69
 
70
+ _For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
71
  _Accuracy as the metric for data imputation and the F1 score for other tasks._
72
 
73
  1.
 
 
74
  [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
75
  [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
76
+ [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
77
+ [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
78
+ [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
79
+ 3.
80
  [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
81
 
82
  ## Performance on unseen tasks
 
111
 
112
  ## Prompts
113
 
114
+ We provide the prompts used for both fine-tuning and inference.
115
  You can structure your data according to these prompts.
116
 
117
  ### System Message
 
120
  User will give you a question. Your task is to answer as faithfully as you can.
121
  ```
122
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
  ### For Error Detection
124
  _There are two forms of the error detection task.
125
  In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
 
141
  Attribute for Verification: [{attribute X}: {attribute X value}]
142
  Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
143
  ```
144
+ ### For Data Imputation
145
+ ```
146
+ You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
147
+ Your task is to deduce or infer the value of {attribute X} using the available information in the record.
148
+ You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
149
+ Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
150
+ Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
151
+ Answer only the value of {attribute X}.
152
+ ```
153
 
154
  ### For Schema Matching
155
  ```
 
161
  Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
162
  ```
163
 
164
+ ### For Entity Matching
165
+ ```
166
+ You are tasked with determining whether two records listed below are the same based on the information provided.
167
+ Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
168
+ Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
169
+ Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
170
+ Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
171
+ Are record A and record B the same entity? Choose your answer from: [Yes, No].
172
+ ```
173
+
174
  ### For Column Type Annotation
175
 
176
  We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).