NECOUDBFM
/

Jellyfish-8B

@@ -10,14 +10,13 @@ language:
 -->
 <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
-Other versions of Jellyfish:
 [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
 [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
 ## Model Details
 Jellyfish-8B is a large language model equipped with 8 billion parameters.
-We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using the datasets pertinent to data preprocessing tasks.
-The training data is used a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct)
 <!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
@@ -68,16 +67,16 @@ If you find our work useful, please give us credit by citing:
 | Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
 | Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |
-_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. However, for Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
 _Accuracy as the metric for data imputation and the F1 score for other tasks._
 1.
-  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
-  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
   [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
   [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
-  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
-2.
   [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 ## Performance on unseen tasks
@@ -112,7 +111,7 @@ _Few-shot is disabled for Jellyfish models._
 ## Prompts
-We provide the prompts used for both the model's fine-tuning and inference.
 You can structure your data according to these prompts.
 ### System Message
@@ -121,36 +120,6 @@ You are an AI assistant that follows instruction extremely well.
 User will give you a question. Your task is to answer as faithfully as you can.
 ```
-### For Entity Matching
-```
-You are tasked with determining whether two records listed below are the same based on the information provided.
-Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
-Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
-Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Are record A and record B the same entity? Choose your answer from: [Yes, No].
-```
-### For Data Imputation
-```
-You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
-Your task is to deduce or infer the value of {attribute X} using the available information in the record.
-You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
-Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
-Answer only the value of {attribute X}.
-```
-### For Data Imputation
-```
-You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
-Your task is to deduce or infer the value of {attribute X} using the available information in the record.
-You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
-Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
-Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
-Answer only the value of {attribute X}.
-```
 ### For Error Detection
 _There are two forms of the error detection task.
 In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
@@ -172,6 +141,15 @@ Note: Missing values (N/A or \"nan\") are not considered errors.
 Attribute for Verification: [{attribute X}: {attribute X value}]
 Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
 ```
 ### For Schema Matching
 ```
@@ -183,6 +161,16 @@ Attribute B is [name: {value of name}, description: {value of description}].
 Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
 ```
 ### For Column Type Annotation
 We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).

 -->
 <img src="https://i.imgur.com/E1vqCIw.png" alt="PicToModel" width="330"/>
+Jellyfish models with other sizes are available here:
 [Jellyfish-7B](https://huggingface.co/NECOUDBFM/Jellyfish-7B)
 [Jellyfish-13B](https://huggingface.co/NECOUDBFM/Jellyfish-13B)
 ## Model Details
 Jellyfish-8B is a large language model equipped with 8 billion parameters.
+We fine-tuned the [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model using a subset of the [Jellyfish-Instruct](https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct) dataset.
 <!-- Jellyfish-7B vs GPT-3.5-turbo wining rate by GPT4 evaluation is 56.36%. -->
 | Entity Matching | Unseen | Walmart-Amazon    | 86.89           | 87.00  | **90.27** | 79.19  | 82.40     | 84.91        | 85.24        | *89.42*       |
 | Avg             |        |                   | 80.44           | -      | *84.17* | 72.58  | -         | 82.74        | 81.55        | **86.02**     |
+_For GPT-3.5 and GPT-4, we used the few-shot approach on all datasets. For Jellyfish models, the few-shot approach is disabled on seen datasets and enabled on unseen datasets._
 _Accuracy as the metric for data imputation and the F1 score for other tasks._
 1.
   [HoloDetect](https://arxiv.org/abs/1904.02285) for Error Detection seen datasets
   [RAHA](https://dl.acm.org/doi/10.1145/3299869.3324956) for Error Detection unseen datasets
+  [IPM](https://ieeexplore.ieee.org/document/9458712) for Data Imputation
+  [SMAT](https://www.researchgate.net/publication/353920530_SMAT_An_Attention-Based_Deep_Learning_Solution_to_the_Automation_of_Schema_Matching) for Schema Matching
+  [Ditto](https://arxiv.org/abs/2004.00584) for Entity Matching
+3.
   [Large Language Models as Data Preprocessors](https://arxiv.org/abs/2308.16361)
 ## Performance on unseen tasks
 ## Prompts
+We provide the prompts used for both fine-tuning and inference.
 You can structure your data according to these prompts.
 ### System Message
 User will give you a question. Your task is to answer as faithfully as you can.
 ```
 ### For Error Detection
 _There are two forms of the error detection task.
 In the first form, a complete record row is provided, and the task is to determine if a specific value is erroneous.
 Attribute for Verification: [{attribute X}: {attribute X value}]
 Question: Is there an error in the value of {attribute X}? Choose your answer from: [Yes, No].
 ```
+### For Data Imputation
+```
+You are presented with a {keyword} record that is missing a specific attribute: {attribute X}.
+Your task is to deduce or infer the value of {attribute X} using the available information in the record.
+You may be provided with fields like {attribute 1}, {attribute 2}, ... to help you in the inference.
+Record: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Based on the provided record, what would you infer is the value for the missing attribute {attribute X}?
+Answer only the value of {attribute X}.
+```
 ### For Schema Matching
 ```
 Are Attribute A and Attribute B semantically equivalent? Choose your answer from: [Yes, No].
 ```
+### For Entity Matching
+```
+You are tasked with determining whether two records listed below are the same based on the information provided.
+Carefully compare the {attribute 1}, {attribute 2}... for each record before making your decision.
+Note that missing values (N/A or \"nan\") should not be used as a basis for your decision.
+Record A: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Record B: [{attribute 1}: {attribute 1 value}, {attribute 2}: {attribute 2 value}, ...]
+Are record A and record B the same entity? Choose your answer from: [Yes, No].
+```
 ### For Column Type Annotation
 We follow the prompt in [Column Type Annotation using ChatGPT](https://arxiv.org/abs/2306.00745) (text+inst+2-step).