Model Card: Spanish Binary Text Classifier using BETO

This model is a binary text classifier fine-tuned from BETO (a BERT-based model pre-trained on Spanish). It is designed to predict whether a given text prompt requires a search query, enabling applications such as intelligent search systems, content recommendation, and automated query handling in Spanish-language environments.

Model Details

Model Name: text_classification_beto_tf_9_24_2024
Architecture: BETO (BERT-base Spanish WWM)
Language: Spanish
Task: Binary Text Classification
- Objective: Given a prompt, the model predicts a binary label indicating whether the prompt requires a search query (1 = requires search, 0 = no search).

Intended Use and Applications

Intelligent Search Systems: Enhance search engines by determining when a user prompt necessitates a search query, improving search relevance and user experience.
Content Recommendation: Automatically categorize content requests to provide appropriate recommendations or resources.
Automated Query Handling: Streamline customer support or chatbot systems by identifying when additional information retrieval is needed.
Information Filtering: Sort or prioritize user inputs based on the necessity of executing a search, optimizing backend processing.
Educational Tools: Assist in language learning applications by categorizing prompts for tailored responses or resources.

How It Was Trained

1. Data Source

Dataset: The model was trained on a dataset sourced from an internal SQL Server database containing:
- Prompts (input_text): Text inputs requiring classification.
- Requires Search (requires_search): Binary labels indicating whether the prompt necessitates a search (1) or not (0).
Data Selection: The top 5,000 (prompt, requires_search) pairs were selected where both prompt and requires_search fields are non-empty, ensuring high-quality and relevant training data.

2. Preprocessing

Data Loading: Utilized pandas to load data from a CSV file containing the necessary columns.
Data Splitting:
- Training Set: 80% of the data.
- Validation Set: 20% of the remaining data.
- Test Set: 1% of the training data (resulting in approximately 10% of the total data for testing).
Tokenization:
- Employed BETO Tokenizer (BertTokenizer) suitable for Spanish text.
- Configured with:
  - truncation=True: Truncate sequences longer than the maximum length.
  - padding=True: Pad shorter sequences to the maximum length.
  - max_length=512: Set maximum token length to 512 tokens.

3. Training Setup

Base Model: dccuchile/bert-base-spanish-wwm-cased
Framework: TensorFlow with Keras API.
Model Architecture: TFBertForSequenceClassification adapted for binary classification (num_labels=2).
Loss Function: SparseCategoricalCrossentropy with from_logits=True to handle integer labels directly.
Optimizer: Adam optimizer with a learning rate of 5e-5 and weight decay of 0.01.
Metrics: SparseCategoricalAccuracy to monitor classification accuracy during training.
Training Parameters:
- Epochs: 4
- Batch Size: 16
- Early Stopping: Implemented via Keras callbacks to prevent overfitting by monitoring validation loss.

4. Data Splits

Training Set: 80%
Validation Set: 19%
Test Set: 1%

This split ensures that the model has ample data for learning while retaining sufficient data for unbiased evaluation.

Model Performance

Training Metrics:
- Loss: Monitored using SparseCategoricalCrossentropy on both training and validation sets.
- Accuracy: Tracked to evaluate the proportion of correct predictions.
Final Evaluation:
- Test Set Performance: The model's performance on the test set is logged as Test Loss and Test Sparse Categorical Accuracy.
- Performance Notes: Specific numerical results (e.g., exact loss and accuracy values) depend on the data distribution and training conditions. Users are encouraged to evaluate the model on their own datasets to assess performance in their specific contexts.

Usage Example

Below is a Python example demonstrating how to use the fine-tuned BETO model for binary text classification in Spanish. Ensure you have installed the necessary libraries (transformers, tensorflow, pandas, etc.) and have the model saved in the specified output_dir.

import tensorflow as tf
from transformers import BertTokenizer, TFBertForSequenceClassification

# Load the trained model and tokenizer
model_dir = "./text_classification_beto_tf_9_24_2024"
tokenizer = BertTokenizer.from_pretrained(model_dir)
model = TFBertForSequenceClassification.from_pretrained(model_dir)

# Prepare the input
prompt = "¿Cómo puedo mejorar la eficiencia energética en mi hogar?"

# Tokenize the input
inputs = tokenizer(
    prompt,
    return_tensors="tf",
    max_length=512,
    truncation=True,
    padding=True
)

# Perform prediction
outputs = model(inputs)
logits = outputs.logits
predicted_class = tf.argmax(logits, axis=1).numpy()[0]

# Interpret the result
if predicted_class == 1:
    print("Requiere búsqueda: Sí")
else:
    print("Requiere búsqueda: No")

Output:

Requiere búsqueda: Sí

This script loads the fine-tuned model and tokenizer, tokenizes a sample prompt, performs a prediction, and interprets the result by indicating whether the prompt requires a search query.

Limitations and Ethical Considerations

Bias and Fairness:
- The model's predictions are influenced by the training data. If the dataset contains biases (e.g., overrepresentation of certain topics), the model may inadvertently reflect those biases. Users should ensure the training data is balanced and representative of diverse prompts.
Data Privacy:
- Ensure that the data used for training does not contain sensitive or personal information unless appropriate consent has been obtained. Compliance with data protection regulations (e.g., GDPR) is essential.
Domain Specificity:
- The model was trained on specific prompts and may perform optimally within similar contexts. Its performance may degrade when applied to highly specialized or unfamiliar domains.
Misclassification Risks:
- Incorrect predictions (false positives or false negatives) can impact user experience. Implement additional checks or human-in-the-loop systems for critical applications.
Responsible Usage:
- Prevent misuse by ensuring the model is employed ethically, avoiding applications that could harm individuals or groups. Regularly monitor and evaluate the model's outputs to maintain ethical standards.

Intended Users

Developers building intelligent search and recommendation systems in Spanish.
Content Managers seeking to automate the categorization of user prompts for content delivery.
Researchers exploring text classification and natural language processing tasks in Spanish.
Businesses integrating automated query handling or customer support systems.
Educational Institutions developing tools for language learning and information retrieval.

profelyndoncarlson
/

text_classification_beto_tf_9_24_2024