Captain-1337 commited on
Commit
a1cd8d1
·
verified ·
1 Parent(s): 606866e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -12
README.md CHANGED
@@ -1,27 +1,122 @@
1
- # Master Thesis
2
- ## Predictive Value of Sentiment Analysis from Headlines for Crude Oil Prices
3
  ### Understanding and Exploiting Deep Learning-based Sentiment Analysis from News Headlines for Predicting Price Movements of WTI Crude Oil
4
 
5
- The focus of this thesis deals with the task of research and development of state-of-the-art sentiment analysis methods, which can potentially provide helpful quantification of news that can be used to assess the future price movements of crude oil.
 
6
 
7
- CrudeBERT is a pre-trained NLP model to analyze sentiment of news headlines relevant to crude oil.
8
- It was developed by fine tuning [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/pdf/1908.10063.pdf).
9
 
10
  ![CrudeBERT comparison_white_2](https://user-images.githubusercontent.com/42164041/135273552-4a9c4457-70e4-48d0-ac97-169daefab79e.png)
11
 
12
  Performing sentiment analysis on the news regarding a specific asset requires domain adaptation.
13
- Domain adaptation requires training data made up of examples with text and its associated polarity of sentiment.
14
  The experiments show that pre-trained deep learning-based sentiment analysis can be further fine-tuned, and the conclusions of these experiments are as follows:
15
 
16
- * Deep learning-based sentiment analysis models from the general financial world such as FinBERT are of little or hardly any significance concerning the price development of crude oil. The reason behind this is a lack of domain adaptation of the sentiment. Moreover, the polarity of sentiment cannot be generalized and is highly dependent on the properties of its target.
17
 
18
  * The properties of crude oil prices are, according to the literature, determined by changes in supply and demand.
19
- News can convey information about these direction changes and can broadly be identified through query searches and serve as a foundation for creating a training dataset to perform domain adaptation. For this purpose, news headlines tend to be rich enough in content to provide insights into supply and demand changes.
20
  Even when significantly reducing the number of headlines to more reputable sources.
21
 
22
- * Domain adaptation can be achieved to some extend by analyzing the properties of the target through literature review and creating a corresponding training dataset to fine-tune the model. For example, considering supply and demand changes regarding crude oil seems to be a suitable component for a domain adaptation.
23
 
24
- In order to advance sentiment analysis applications in the domain of crude oil, this paper presents CrudeBERT.
25
- In general, sentiment analysis of headlines from crude oil through CrudeBERT could be a viable source of insight for the price behaviour of WTI crude oil.
26
  However, further research is required to see if CrudeBERT can serve as beneficial for predicting oil prices.
27
- For this matter, the codes and the thesis is made publicly available on [GitHub] (https://github.com/Captain-1337/Master-Thesis).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Predictive Power of Sentiment Analysis from Headlines for Crude Oil Prices
 
2
  ### Understanding and Exploiting Deep Learning-based Sentiment Analysis from News Headlines for Predicting Price Movements of WTI Crude Oil
3
 
4
+ This language model called CrudeBERT emerged during my master's thesis and introduced a novel sentiment analysis method.
5
+ It was developed by fine-tuning [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/pdf/1908.10063.pdf).
6
 
7
+ In essence, CrudeBERT is a pre-trained NLP model that analyzes the sentiment of news headlines relevant to the value of crude oil.
8
+ Here is an award-winning paper derived from this thesis which describes it in more detail: [CrudeBERT: Applying Economic Theory towards fine-tuning Transformer-based Sentiment Analysis Models to the Crude Oil Market](https://arxiv.org/abs/2305.06140.pdf)
9
 
10
  ![CrudeBERT comparison_white_2](https://user-images.githubusercontent.com/42164041/135273552-4a9c4457-70e4-48d0-ac97-169daefab79e.png)
11
 
12
  Performing sentiment analysis on the news regarding a specific asset requires domain adaptation.
13
+ Domain adaptation requires training data from examples with text and its associated sentiment polarity.
14
  The experiments show that pre-trained deep learning-based sentiment analysis can be further fine-tuned, and the conclusions of these experiments are as follows:
15
 
16
+ * Deep learning-based sentiment analysis models from the general financial world, such as FinBERT, are of little or hardly any significance concerning the price development of crude oil. The reason behind this is a lack of domain adaptation of the sentiment. Moreover, the polarity of sentiment cannot be generalized and is highly dependent on the properties of its target.
17
 
18
  * The properties of crude oil prices are, according to the literature, determined by changes in supply and demand.
19
+ News can convey information about these direction changes, can be broadly identified through query searches, and serve as a foundation for creating a training dataset to perform domain adaptation. For this purpose, news headlines tend to be rich enough in content to provide insights into supply and demand changes.
20
  Even when significantly reducing the number of headlines to more reputable sources.
21
 
22
+ * Domain adaptation can be achieved to some extent by analyzing the properties of the target through a literature review and creating a corresponding training dataset to fine-tune the model. For example, considering supply and demand changes regarding crude oil seems to be a suitable component for a domain adaptation.
23
 
24
+ To advance sentiment analysis applications in the domain of crude oil, this paper presents CrudeBERT.
25
+ In general, sentiment analysis of crude oil headlines through CrudeBERT could be a viable source of insight into the price behavior of WTI crude oil.
26
  However, further research is required to see if CrudeBERT can serve as beneficial for predicting oil prices.
27
+ For this reason, the codes and the thesis are publicly available on [GitHub] (https://github.com/Captain-1337/Master-Thesis).
28
+
29
+
30
+ Here is a quick guide on how you can use CrudeBERT
31
+
32
+ # Step one:
33
+ Download the two files (crude_bert_config.json and crude_bert_model.bin)
34
+ from https://huggingface.co/Captain-1337/CrudeBERT/tree/main
35
+
36
+ # Step two:
37
+ Create a Jupyter Notebook script in the same folder where the files are stored and include the code mentioned below:
38
+
39
+ ## Code:
40
+ import torch
41
+ from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
42
+ import numpy as np
43
+ import pandas as pd
44
+
45
+ ### List of example headlines
46
+ headlines = [
47
+ "Major Explosion, Fire at Oil Refinery in Southeast Philadelphia",
48
+ "PETROLEOS confirms Gulf of Mexico oil platform accident",
49
+ "CASUALTIES FEARED AT OIL ACCIDENT NEAR IRANS BORDER",
50
+ "EIA Chief expects Global Oil Demand Growth 1 M B/D to 2011",
51
+ "Turkey Jan-Oct Crude Imports +98.5% To 57.9M MT",
52
+ "China’s crude oil imports up 78.30% in February 2019",
53
+ "Russia Energy Agency: Sees Oil Output put Flat In 2005",
54
+ "Malaysia Oil Production Steady This Year At 700,000 B/D",
55
+ "ExxonMobil:Nigerian Oil Output Unaffected By Union Threat",
56
+ "Yukos July Oil Output Flat On Mo, 1.73M B/D - Prime-Tass",
57
+ "2nd UPDATE: Mexico’s Oil Output Unaffected By Hurricane",
58
+ "UPDATE: Ecuador July Oil Exports Flat On Mo At 337,000 B/D",
59
+ "China February Crude Imports -16.0% On Year",
60
+ "Turkey May Crude Imports down 11.0% On Year",
61
+ "Japan June Crude Oil Imports decrease 10.9% On Yr",
62
+ "Iran’s Feb Oil Exports +20.9% On Mo at 1.56M B/D - Official",
63
+ "Apache announces large petroleum discovery in Philadelphia",
64
+ "Turkey finds oil near Syria, Iraq border"
65
+ ]
66
+ example_headlines = pd.DataFrame(headlines, columns=["Headline"])
67
+
68
+ config_path = './crude_bert_config.json'
69
+ model_path = './crude_bert_model.bin'
70
+
71
+ #### Load the configuration
72
+ config = AutoConfig.from_pretrained(config_path)
73
+
74
+ #### Create the model from the configuration
75
+ model = AutoModelForSequenceClassification.from_config(config)
76
+
77
+ #### Load the model's state dictionary
78
+ state_dict = torch.load(model_path)
79
+
80
+ #### Inspect keys, if "bert.embeddings.position_ids" is unexpected, remove or adjust it
81
+ state_dict.pop("bert.embeddings.position_ids", None)
82
+
83
+ #### Load the adjusted state dictionary into the model
84
+ model.load_state_dict(state_dict, strict=False) # Using strict=False to ignore non-critical mismatches
85
+
86
+ #### Load the tokenizer
87
+ tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
88
+
89
+ ### Define the prediction function
90
+ def predict_to_df(texts, model, tokenizer):
91
+ model.eval()
92
+ data = []
93
+ for text in texts:
94
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
95
+ with torch.no_grad():
96
+ outputs = model(**inputs)
97
+ logits = outputs.logits
98
+ softmax_scores = torch.nn.functional.softmax(logits, dim=-1)
99
+ pred_label_id = torch.argmax(softmax_scores, dim=-1).item()
100
+ class_names = ['positive', 'negative', 'neutral']
101
+ predicted_label = class_names[pred_label_id]
102
+ data.append([text, predicted_label])
103
+ df = pd.DataFrame(data, columns=["Headline", "Classification"])
104
+ return df
105
+
106
+
107
+ ### Create DataFrame
108
+ example_headlines = pd.DataFrame(headlines, columns=["Headline"])
109
+
110
+ ### Apply classification
111
+ result_df = predict_to_df(example_headlines['Headline'].tolist(), model, tokenizer)
112
+ result_df
113
+
114
+ # Step three:
115
+ Execute the cells of the Jupyter Notebook.
116
+
117
+ If you face any difficulties or have other questions, contact me here or on LinkedIn.
118
+
119
+ FYI: I took the example headlines from one of our recent publications:
120
+ ![image.png](https://cdn-uploads.huggingface.co/production/uploads/6115fd952999876a45605b05/rFMJjRIxsNqPqinqiq5QY.png)
121
+
122
+ So, your classification output should reflect this as well.