Model Card: Spanish Binary Text Classifier using BETO
This model is a binary text classifier fine-tuned from BETO (a BERT-based model pre-trained on Spanish). It is designed to predict whether a given text prompt requires a search query, enabling applications such as intelligent search systems, content recommendation, and automated query handling in Spanish-language environments.
Model Details
- Model Name:
text_classification_beto_tf_9_24_2024
- Architecture: BETO (BERT-base Spanish WWM)
- Language: Spanish
- Task: Binary Text Classification
- Objective: Given a prompt, the model predicts a binary label indicating whether the prompt requires a search query (
1 = requires search
,0 = no search
).
- Objective: Given a prompt, the model predicts a binary label indicating whether the prompt requires a search query (
Intended Use and Applications
- Intelligent Search Systems: Enhance search engines by determining when a user prompt necessitates a search query, improving search relevance and user experience.
- Content Recommendation: Automatically categorize content requests to provide appropriate recommendations or resources.
- Automated Query Handling: Streamline customer support or chatbot systems by identifying when additional information retrieval is needed.
- Information Filtering: Sort or prioritize user inputs based on the necessity of executing a search, optimizing backend processing.
- Educational Tools: Assist in language learning applications by categorizing prompts for tailored responses or resources.
How It Was Trained
1. Data Source
- Dataset: The model was trained on a dataset sourced from an internal SQL Server database containing:
- Prompts (
input_text
): Text inputs requiring classification. - Requires Search (
requires_search
): Binary labels indicating whether the prompt necessitates a search (1
) or not (0
).
- Prompts (
- Data Selection: The top 5,000
(prompt, requires_search)
pairs were selected where bothprompt
andrequires_search
fields are non-empty, ensuring high-quality and relevant training data.
2. Preprocessing
- Data Loading: Utilized
pandas
to load data from a CSV file containing the necessary columns. - Data Splitting:
- Training Set: 80% of the data.
- Validation Set: 20% of the remaining data.
- Test Set: 1% of the training data (resulting in approximately 10% of the total data for testing).
- Tokenization:
- Employed BETO Tokenizer (
BertTokenizer
) suitable for Spanish text. - Configured with:
truncation=True
: Truncate sequences longer than the maximum length.padding=True
: Pad shorter sequences to the maximum length.max_length=512
: Set maximum token length to 512 tokens.
- Employed BETO Tokenizer (
3. Training Setup
- Base Model:
dccuchile/bert-base-spanish-wwm-cased
- Framework: TensorFlow with Keras API.
- Model Architecture:
TFBertForSequenceClassification
adapted for binary classification (num_labels=2
). - Loss Function:
SparseCategoricalCrossentropy
withfrom_logits=True
to handle integer labels directly. - Optimizer: Adam optimizer with a learning rate of
5e-5
and weight decay of0.01
. - Metrics:
SparseCategoricalAccuracy
to monitor classification accuracy during training. - Training Parameters:
- Epochs: 4
- Batch Size: 16
- Early Stopping: Implemented via Keras callbacks to prevent overfitting by monitoring validation loss.
4. Data Splits
- Training Set: 80%
- Validation Set: 19%
- Test Set: 1%
This split ensures that the model has ample data for learning while retaining sufficient data for unbiased evaluation.
Model Performance
- Training Metrics:
- Loss: Monitored using
SparseCategoricalCrossentropy
on both training and validation sets. - Accuracy: Tracked to evaluate the proportion of correct predictions.
- Loss: Monitored using
- Final Evaluation:
- Test Set Performance: The model's performance on the test set is logged as
Test Loss
andTest Sparse Categorical Accuracy
. - Performance Notes: Specific numerical results (e.g., exact loss and accuracy values) depend on the data distribution and training conditions. Users are encouraged to evaluate the model on their own datasets to assess performance in their specific contexts.
- Test Set Performance: The model's performance on the test set is logged as
Usage Example
Below is a Python example demonstrating how to use the fine-tuned BETO model for binary text classification in Spanish. Ensure you have installed the necessary libraries (transformers
, tensorflow
, pandas
, etc.) and have the model saved in the specified output_dir
.
import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification
# Load the trained model and tokenizer
model_dir = "./text_classification_beto_tf_9_24_2024"
tokenizer = BertTokenizer.from_pretrained(model_dir)
model = TFBertForSequenceClassification.from_pretrained(model_dir)
# Prepare the input
prompt = "¿Cómo puedo mejorar la eficiencia energética en mi hogar?"
# Tokenize the input
inputs = tokenizer(
prompt,
return_tensors="tf",
max_length=512,
truncation=True,
padding=True
)
# Perform prediction
outputs = model(inputs)
logits = outputs.logits
predicted_class = tf.argmax(logits, axis=1).numpy()[0]
# Interpret the result
if predicted_class == 1:
print("Requiere búsqueda: Sí")
else:
print("Requiere búsqueda: No")
Output:
Requiere búsqueda: Sí
This script loads the fine-tuned model and tokenizer, tokenizes a sample prompt, performs a prediction, and interprets the result by indicating whether the prompt requires a search query.
Limitations and Ethical Considerations
Bias and Fairness:
- The model's predictions are influenced by the training data. If the dataset contains biases (e.g., overrepresentation of certain topics), the model may inadvertently reflect those biases. Users should ensure the training data is balanced and representative of diverse prompts.
Data Privacy:
- Ensure that the data used for training does not contain sensitive or personal information unless appropriate consent has been obtained. Compliance with data protection regulations (e.g., GDPR) is essential.
Domain Specificity:
- The model was trained on specific prompts and may perform optimally within similar contexts. Its performance may degrade when applied to highly specialized or unfamiliar domains.
Misclassification Risks:
- Incorrect predictions (false positives or false negatives) can impact user experience. Implement additional checks or human-in-the-loop systems for critical applications.
Responsible Usage:
- Prevent misuse by ensuring the model is employed ethically, avoiding applications that could harm individuals or groups. Regularly monitor and evaluate the model's outputs to maintain ethical standards.
Intended Users
- Developers building intelligent search and recommendation systems in Spanish.
- Content Managers seeking to automate the categorization of user prompts for content delivery.
- Researchers exploring text classification and natural language processing tasks in Spanish.
- Businesses integrating automated query handling or customer support systems.
- Educational Institutions developing tools for language learning and information retrieval.
- Downloads last month
- 16
Model tree for profelyndoncarlson/text_classification_beto_tf_9_24_2024
Base model
dccuchile/bert-base-spanish-wwm-cased