Mitchins's picture
Upload folder using huggingface_hub
4abce2d verified
---
language:
- en
license: apache-2.0
library_name: pytorch
tags:
- text-classification
- fiction-detection
- byte-level
- cnn
datasets:
- HuggingFaceTB/cosmopedia
- BEE-spoke-data/gutenberg-en-v1-clean
- common-pile/arxiv_abstracts
- ccdv/cnn_dailymail
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: TinyByteCNN-Fiction-Classifier
results:
- task:
type: text-classification
name: Fiction vs Non-Fiction Classification
dataset:
name: Custom Fiction/Non-Fiction Dataset (85k samples)
type: custom
split: validation
metrics:
- type: accuracy
value: 99.91
name: Validation Accuracy
- type: f1
value: 99.91
name: F1 Score
- type: roc_auc
value: 99.99
name: ROC AUC
- task:
type: text-classification
name: Curated Test Samples
dataset:
name: 18 Diverse Fiction/Non-Fiction Samples
type: curated
split: test
metrics:
- type: accuracy
value: 100.0
name: Test Accuracy
- type: confidence_avg
value: 96.3
name: Average Confidence
---
# TinyByteCNN Fiction vs Non-Fiction Detector
A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.
## Model Description
TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.
### Architecture Highlights
- **Model Size**: 942,313 parameters (~3.6MB)
- **Input**: Raw UTF-8 bytes (max 4096 bytes β‰ˆ 512 words)
- **Architecture**: Depthwise-separable 1D CNN with Squeeze-Excitation
- **Receptive Field**: ~2.8KB covering multi-paragraph context
- **Key Features**:
- 4 stages with progressive downsampling (32x reduction)
- Dilated convolutions for larger receptive field
- SE attention modules for channel recalibration
- Global average + max pooling head
## Intended Uses & Limitations
### Intended Uses
- Automated content categorization for libraries and archives
- Fiction/non-fiction filtering for content platforms
- Educational content classification
- Writing style analysis
- Content recommendation systems
### Limitations
- **Personal narratives**: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
- **Mixed content**: Struggles with creative non-fiction and narrative journalism
- **Length**: Optimized for 512-4096 byte inputs; longer texts should be chunked
- **Language**: Primarily trained on English text
## Training Data
The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:
### Fiction Sources (50%)
1. **Cosmopedia Stories** (HuggingFaceTB/cosmopedia)
- Synthetic fiction stories
- License: Apache 2.0
2. **Project Gutenberg** (BEE-spoke-data/gutenberg-en-v1-clean)
- Classic literature
- License: Public Domain
3. **Reddit WritingPrompts**
- Community-generated creative writing
- Via synthetic alternatives
### Non-Fiction Sources (50%)
1. **Cosmopedia Educational** (HuggingFaceTB/cosmopedia)
- Textbooks, WikiHow, educational blogs
- License: Apache 2.0
2. **Scientific Papers** (common-pile/arxiv_abstracts)
- Academic abstracts and introductions
- License: Various (permissive)
3. **News Articles** (ccdv/cnn_dailymail)
- CNN and Daily Mail articles
- License: Apache 2.0
## Training Procedure
### Preprocessing
- Unicode NFC normalization
- Whitespace normalization (max 2 consecutive spaces)
- UTF-8 byte encoding
- Padding/truncation to 4096 bytes
### Training Hyperparameters
- **Optimizer**: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
- **Schedule**: Cosine decay with 5% warmup
- **Batch Size**: 32
- **Epochs**: 10
- **Label Smoothing**: 0.05
- **Gradient Clipping**: 1.0
- **Device**: Apple M-series (MPS)
## Evaluation Results
### Validation Set (15,000 samples)
| Metric | Value |
|--------|-------|
| Accuracy | 99.91% |
| F1 Score | 0.9991 |
| ROC AUC | 0.9999 |
| Loss | 0.1194 |
### Detailed Test Results on 18 Curated Samples
The model achieved **100% accuracy** across all categories, but shows interesting confidence patterns:
| Category | Sample Title/Type | True Label | Predicted | Confidence | Analysis |
|----------|------------------|------------|-----------|------------|----------|
| **FICTION - General** | | | | | |
| Literary | Lighthouse Keeper Storm | Fiction | Fiction | **79.8%** | ⚠️ **Lowest confidence** - realistic setting |
| Sci-Fi | Time Travel Bedroom | Fiction | Fiction | 97.2% | βœ… Clear fantastical elements |
| Mystery | Detective Rose Case | Fiction | Fiction | 97.3% | βœ… Strong narrative structure |
| **FICTION - Children's** | | | | | |
| Animal Tale | Benny's Carrot Problem | Fiction | Fiction | 97.1% | βœ… Clear storytelling markers |
| Fantasy | Princess Luna's Paintings | Fiction | Fiction | 97.3% | βœ… Magical elements detected |
| Magical | Tommy's Dream Sprites | Fiction | Fiction | **96.0%** | ⚠️ Lower confidence - whimsical tone |
| **FICTION - Fantasy** | | | | | |
| Epic Fantasy | Shadowgate & Void Lords | Fiction | Fiction | 97.4% | βœ… High fantasy vocabulary |
| Magic System | Moonlight Weaver Elara | Fiction | Fiction | 96.8% | βœ… Complex world-building |
| Urban Fantasy | Dragon Memory Markets | Fiction | Fiction | 97.3% | βœ… Supernatural commerce |
| **NON-FICTION - Academic** | | | | | |
| Biology | Photosynthesis Process | Non-Fiction | Non-Fiction | 97.8% | βœ… Technical terminology |
| Mathematics | Calculus Theorem | Non-Fiction | Non-Fiction | 97.8% | βœ… Mathematical concepts |
| Economics | Market Equilibrium | Non-Fiction | Non-Fiction | 97.9% | βœ… Economic theory |
| **NON-FICTION - News** | | | | | |
| Financial | Federal Reserve Decision | Non-Fiction | Non-Fiction | 97.8% | βœ… Factual reporting style |
| Local Gov | Homeless Crisis Plan | Non-Fiction | Non-Fiction | 97.9% | βœ… Policy announcement format |
| Science | Exoplanet Discovery | Non-Fiction | Non-Fiction | 97.9% | βœ… Research reporting |
| **NON-FICTION - Journals** | | | | | |
| Financial | Wall Street Journal Market | Non-Fiction | Non-Fiction | 97.7% | βœ… Professional journalism |
| Scientific | Nature Research Report | Non-Fiction | Non-Fiction | 97.7% | βœ… Academic publication style |
| Personal | Kyoto Travel Log | Non-Fiction | Non-Fiction | **97.5%** | ⚠️ Slightly lower - personal narrative |
### Key Insights:
- **Weakest Performance**: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements
- **Strongest Performance**: Academic/news content (97.8-97.9% confidence) - clear technical/factual language
- **Edge Cases**: Personal narratives and whimsical children's stories show slightly lower confidence
- **Perfect Accuracy**: 18/18 samples correctly classified despite confidence variations
### Detailed Test Results
#### βœ… All 12 Samples Correctly Classified
**Fiction Samples (3/3):**
1. Lighthouse keeper narrative β†’ Fiction (79.8% conf)
2. Time travel story β†’ Fiction (97.2% conf)
3. Detective mystery β†’ Fiction (97.3% conf)
**Textbook Samples (3/3):**
1. Photosynthesis (Biology) β†’ Non-Fiction (97.8% conf)
2. Fundamental theorem (Calculus) β†’ Non-Fiction (97.8% conf)
3. Market equilibrium (Economics) β†’ Non-Fiction (97.9% conf)
**News Articles (3/3):**
1. Federal Reserve decision β†’ Non-Fiction (97.8% conf)
2. City homeless initiative β†’ Non-Fiction (97.9% conf)
3. Exoplanet discovery β†’ Non-Fiction (97.9% conf)
**Journal Articles (3/3):**
1. Wall Street Journal (Financial) β†’ Non-Fiction (97.7% conf)
2. Nature Scientific Reports β†’ Non-Fiction (97.7% conf)
3. Personal Travel Journal β†’ Non-Fiction (97.5% conf)
## How to Use
### PyTorch
```python
import torch
import numpy as np
from model import TinyByteCNN, preprocess_text
# Load model
model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
model.eval()
# Prepare text
text = "Your text here..."
input_bytes = preprocess_text(text) # Returns tensor of shape [1, 4096]
# Predict
with torch.no_grad():
logits = model(input_bytes)
probability = torch.sigmoid(logits).item()
if probability > 0.5:
print(f"Non-Fiction (confidence: {probability:.1%})")
else:
print(f"Fiction (confidence: {1-probability:.1%})")
```
### Batch Processing
```python
def classify_texts(texts, model, batch_size=32):
results = []
for i in range(0, len(texts), batch_size):
batch = texts[i:i+batch_size]
inputs = torch.stack([preprocess_text(t) for t in batch])
with torch.no_grad():
logits = model(inputs)
probs = torch.sigmoid(logits)
for text, prob in zip(batch, probs):
results.append({
'text': text[:100] + '...',
'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
'confidence': prob.item() if prob > 0.5 else 1-prob.item()
})
return results
```
## Training Infrastructure
- **Hardware**: Apple M-series with 8GB MPS memory limit
- **Training Time**: ~20 minutes
- **Framework**: PyTorch 2.0+
## Environmental Impact
- **Hardware Type**: Apple Silicon M-series
- **Hours used**: 0.33
- **Carbon Emitted**: Minimal (ARM-based efficiency, ~10W average)
## Citation
```bibtex
@model{tinybytecnn-fiction-2024,
title={TinyByteCNN Fiction vs Non-Fiction Detector},
author={Mitchell Currie},
year={2024},
publisher={HuggingFace},
url={https://huggingface.co/username/tinybytecnn-fiction-detector}
}
```
## Acknowledgments
This model uses data from:
- HuggingFace Team (Cosmopedia dataset)
- Project Gutenberg
- Common Pile contributors
- CNN/Daily Mail dataset creators
## License
Apache 2.0
## Contact
For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector).