Captain-1337
/

CrudeBERT

Text Classification

Transformers

PyTorch

bert

Inference Endpoints

Model card Files Files and versions Community

Captain-1337 commited on Oct 1, 2024

Commit

a1cd8d1

verified ·

1 Parent(s): 606866e

Update README.md

Browse files

Files changed (1) hide show

README.md +107 -12

README.md CHANGED Viewed

@@ -1,27 +1,122 @@
-# Master Thesis
-## Predictive Value of Sentiment Analysis from Headlines for Crude Oil Prices
 ### Understanding and Exploiting Deep Learning-based Sentiment Analysis from News Headlines for Predicting Price Movements of WTI Crude Oil
-The focus of this thesis deals with the task of research and development of state-of-the-art sentiment analysis methods, which can potentially provide helpful quantification of news that can be used to assess the future price movements of crude oil.
-CrudeBERT is a pre-trained NLP model to analyze sentiment of news headlines relevant to crude oil.
-It was developed by fine tuning [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/pdf/1908.10063.pdf).
 ![CrudeBERT comparison_white_2](https://user-images.githubusercontent.com/42164041/135273552-4a9c4457-70e4-48d0-ac97-169daefab79e.png)
 Performing sentiment analysis on the news regarding a specific asset requires domain adaptation.
-Domain adaptation requires training data made up of examples with text and its associated polarity of sentiment.
 The experiments show that pre-trained deep learning-based sentiment analysis can be further fine-tuned, and the conclusions of these experiments are as follows:
-* Deep learning-based sentiment analysis models from the general financial world such as FinBERT are of little or hardly any significance concerning the price development of crude oil. The reason behind this is a lack of domain adaptation of the sentiment. Moreover, the polarity of sentiment cannot be generalized and is highly dependent on the properties of its target.
 * The properties of crude oil prices are, according to the literature, determined by changes in supply and demand.
-News can convey information about these direction changes and can broadly be identified through query searches and serve as a foundation for creating a training dataset to perform domain adaptation. For this purpose, news headlines tend to be rich enough in content to provide insights into supply and demand changes.
 Even when significantly reducing the number of headlines to more reputable sources.
-* Domain adaptation can be achieved to some extend by analyzing the properties of the target through literature review and creating a corresponding training dataset to fine-tune the model. For example, considering supply and demand changes regarding crude oil seems to be a suitable component for a domain adaptation.
-In order to advance sentiment analysis applications in the domain of crude oil, this paper presents CrudeBERT.
-In general, sentiment analysis of headlines from crude oil through CrudeBERT could be a viable source of insight for the price behaviour of WTI crude oil.
 However, further research is required to see if CrudeBERT can serve as beneficial for predicting oil prices.
-For this matter, the codes and the thesis is made publicly available on [GitHub] (https://github.com/Captain-1337/Master-Thesis).

+## Predictive Power of Sentiment Analysis from Headlines for Crude Oil Prices
 ### Understanding and Exploiting Deep Learning-based Sentiment Analysis from News Headlines for Predicting Price Movements of WTI Crude Oil
+This language model called CrudeBERT emerged during my master's thesis and introduced a novel sentiment analysis method.
+It was developed by fine-tuning [FinBERT: Financial Sentiment Analysis with Pre-trained Language Models](https://arxiv.org/pdf/1908.10063.pdf).
+In essence, CrudeBERT is a pre-trained NLP model that analyzes the sentiment of news headlines relevant to the value of crude oil.
+Here is an award-winning paper derived from this thesis which describes it in more detail: [CrudeBERT: Applying Economic Theory towards fine-tuning Transformer-based Sentiment Analysis Models to the Crude Oil Market](https://arxiv.org/abs/2305.06140.pdf)
 ![CrudeBERT comparison_white_2](https://user-images.githubusercontent.com/42164041/135273552-4a9c4457-70e4-48d0-ac97-169daefab79e.png)
 Performing sentiment analysis on the news regarding a specific asset requires domain adaptation.
+Domain adaptation requires training data from examples with text and its associated sentiment polarity.
 The experiments show that pre-trained deep learning-based sentiment analysis can be further fine-tuned, and the conclusions of these experiments are as follows:
+* Deep learning-based sentiment analysis models from the general financial world, such as FinBERT, are of little or hardly any significance concerning the price development of crude oil. The reason behind this is a lack of domain adaptation of the sentiment. Moreover, the polarity of sentiment cannot be generalized and is highly dependent on the properties of its target.
 * The properties of crude oil prices are, according to the literature, determined by changes in supply and demand.
+News can convey information about these direction changes, can be broadly identified through query searches, and serve as a foundation for creating a training dataset to perform domain adaptation. For this purpose, news headlines tend to be rich enough in content to provide insights into supply and demand changes.
 Even when significantly reducing the number of headlines to more reputable sources.
+* Domain adaptation can be achieved to some extent by analyzing the properties of the target through a literature review and creating a corresponding training dataset to fine-tune the model. For example, considering supply and demand changes regarding crude oil seems to be a suitable component for a domain adaptation.
+To advance sentiment analysis applications in the domain of crude oil, this paper presents CrudeBERT.
+In general, sentiment analysis of crude oil headlines through CrudeBERT could be a viable source of insight into the price behavior of WTI crude oil.
 However, further research is required to see if CrudeBERT can serve as beneficial for predicting oil prices.
+For this reason, the codes and the thesis are publicly available on [GitHub] (https://github.com/Captain-1337/Master-Thesis).
+Here is a quick guide on how you can use CrudeBERT
+# Step one:
+Download the two files (crude_bert_config.json and crude_bert_model.bin)
+from https://huggingface.co/Captain-1337/CrudeBERT/tree/main
+# Step two:
+Create a Jupyter Notebook script in the same folder where the files are stored and include the code mentioned below:
+## Code:
+import torch
+from transformers import AutoConfig, AutoModelForSequenceClassification, AutoTokenizer
+import numpy as np
+import pandas as pd
+### List of example headlines
+headlines = [
+ "Major Explosion, Fire at Oil Refinery in Southeast Philadelphia",
+ "PETROLEOS confirms Gulf of Mexico oil platform accident",
+ "CASUALTIES FEARED AT OIL ACCIDENT NEAR IRANS BORDER",
+ "EIA Chief expects Global Oil Demand Growth 1 M B/D to 2011",
+ "Turkey Jan-Oct Crude Imports +98.5% To 57.9M MT",
+ "China’s crude oil imports up 78.30% in February 2019",
+ "Russia Energy Agency: Sees Oil Output put Flat In 2005",
+ "Malaysia Oil Production Steady This Year At 700,000 B/D",
+ "ExxonMobil:Nigerian Oil Output Unaffected By Union Threat",
+ "Yukos July Oil Output Flat On Mo, 1.73M B/D - Prime-Tass",
+ "2nd UPDATE: Mexico’s Oil Output Unaffected By Hurricane",
+ "UPDATE: Ecuador July Oil Exports Flat On Mo At 337,000 B/D",
+ "China February Crude Imports -16.0% On Year",
+ "Turkey May Crude Imports down 11.0% On Year",
+ "Japan June Crude Oil Imports decrease 10.9% On Yr",
+ "Iran’s Feb Oil Exports +20.9% On Mo at 1.56M B/D - Official",
+ "Apache announces large petroleum discovery in Philadelphia",
+ "Turkey finds oil near Syria, Iraq border"
+]
+example_headlines = pd.DataFrame(headlines, columns=["Headline"])
+config_path = './crude_bert_config.json'
+model_path = './crude_bert_model.bin'
+#### Load the configuration
+config = AutoConfig.from_pretrained(config_path)
+#### Create the model from the configuration
+model = AutoModelForSequenceClassification.from_config(config)
+#### Load the model's state dictionary
+state_dict = torch.load(model_path)
+#### Inspect keys, if "bert.embeddings.position_ids" is unexpected, remove or adjust it
+state_dict.pop("bert.embeddings.position_ids", None)
+#### Load the adjusted state dictionary into the model
+model.load_state_dict(state_dict, strict=False) # Using strict=False to ignore non-critical mismatches
+#### Load the tokenizer
+tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
+### Define the prediction function
+def predict_to_df(texts, model, tokenizer):
+ model.eval()
+ data = []
+ for text in texts:
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=64)
+ with torch.no_grad():
+ outputs = model(**inputs)
+ logits = outputs.logits
+ softmax_scores = torch.nn.functional.softmax(logits, dim=-1)
+ pred_label_id = torch.argmax(softmax_scores, dim=-1).item()
+ class_names = ['positive', 'negative', 'neutral']
+ predicted_label = class_names[pred_label_id]
+ data.append([text, predicted_label])
+ df = pd.DataFrame(data, columns=["Headline", "Classification"])
+ return df
+### Create DataFrame
+example_headlines = pd.DataFrame(headlines, columns=["Headline"])
+### Apply classification
+result_df = predict_to_df(example_headlines['Headline'].tolist(), model, tokenizer)
+result_df
+# Step three:
+Execute the cells of the Jupyter Notebook.
+If you face any difficulties or have other questions, contact me here or on LinkedIn.
+FYI: I took the example headlines from one of our recent publications:
+![image.png](https://cdn-uploads.huggingface.co/production/uploads/6115fd952999876a45605b05/rFMJjRIxsNqPqinqiq5QY.png)
+So, your classification output should reflect this as well.