|
--- |
|
language: |
|
- en |
|
license: apache-2.0 |
|
library_name: pytorch |
|
tags: |
|
- text-classification |
|
- fiction-detection |
|
- byte-level |
|
- cnn |
|
datasets: |
|
- HuggingFaceTB/cosmopedia |
|
- BEE-spoke-data/gutenberg-en-v1-clean |
|
- common-pile/arxiv_abstracts |
|
- ccdv/cnn_dailymail |
|
metrics: |
|
- accuracy |
|
- f1 |
|
- roc_auc |
|
model-index: |
|
- name: TinyByteCNN-Fiction-Classifier |
|
results: |
|
- task: |
|
type: text-classification |
|
name: Fiction vs Non-Fiction Classification |
|
dataset: |
|
name: Custom Fiction/Non-Fiction Dataset (85k samples) |
|
type: custom |
|
split: validation |
|
metrics: |
|
- type: accuracy |
|
value: 99.91 |
|
name: Validation Accuracy |
|
- type: f1 |
|
value: 99.91 |
|
name: F1 Score |
|
- type: roc_auc |
|
value: 99.99 |
|
name: ROC AUC |
|
- task: |
|
type: text-classification |
|
name: Curated Test Samples |
|
dataset: |
|
name: 18 Diverse Fiction/Non-Fiction Samples |
|
type: curated |
|
split: test |
|
metrics: |
|
- type: accuracy |
|
value: 100.0 |
|
name: Test Accuracy |
|
- type: confidence_avg |
|
value: 96.3 |
|
name: Average Confidence |
|
--- |
|
|
|
# TinyByteCNN Fiction vs Non-Fiction Detector |
|
|
|
A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy. |
|
|
|
## Model Description |
|
|
|
TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages. |
|
|
|
### Architecture Highlights |
|
|
|
- **Model Size**: 942,313 parameters (~3.6MB) |
|
- **Input**: Raw UTF-8 bytes (max 4096 bytes β 512 words) |
|
- **Architecture**: Depthwise-separable 1D CNN with Squeeze-Excitation |
|
- **Receptive Field**: ~2.8KB covering multi-paragraph context |
|
- **Key Features**: |
|
- 4 stages with progressive downsampling (32x reduction) |
|
- Dilated convolutions for larger receptive field |
|
- SE attention modules for channel recalibration |
|
- Global average + max pooling head |
|
|
|
## Intended Uses & Limitations |
|
|
|
### Intended Uses |
|
- Automated content categorization for libraries and archives |
|
- Fiction/non-fiction filtering for content platforms |
|
- Educational content classification |
|
- Writing style analysis |
|
- Content recommendation systems |
|
|
|
### Limitations |
|
- **Personal narratives**: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries) |
|
- **Mixed content**: Struggles with creative non-fiction and narrative journalism |
|
- **Length**: Optimized for 512-4096 byte inputs; longer texts should be chunked |
|
- **Language**: Primarily trained on English text |
|
|
|
## Training Data |
|
|
|
The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from: |
|
|
|
### Fiction Sources (50%) |
|
1. **Cosmopedia Stories** (HuggingFaceTB/cosmopedia) |
|
- Synthetic fiction stories |
|
- License: Apache 2.0 |
|
|
|
2. **Project Gutenberg** (BEE-spoke-data/gutenberg-en-v1-clean) |
|
- Classic literature |
|
- License: Public Domain |
|
|
|
3. **Reddit WritingPrompts** |
|
- Community-generated creative writing |
|
- Via synthetic alternatives |
|
|
|
### Non-Fiction Sources (50%) |
|
1. **Cosmopedia Educational** (HuggingFaceTB/cosmopedia) |
|
- Textbooks, WikiHow, educational blogs |
|
- License: Apache 2.0 |
|
|
|
2. **Scientific Papers** (common-pile/arxiv_abstracts) |
|
- Academic abstracts and introductions |
|
- License: Various (permissive) |
|
|
|
3. **News Articles** (ccdv/cnn_dailymail) |
|
- CNN and Daily Mail articles |
|
- License: Apache 2.0 |
|
|
|
## Training Procedure |
|
|
|
### Preprocessing |
|
- Unicode NFC normalization |
|
- Whitespace normalization (max 2 consecutive spaces) |
|
- UTF-8 byte encoding |
|
- Padding/truncation to 4096 bytes |
|
|
|
### Training Hyperparameters |
|
- **Optimizer**: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01) |
|
- **Schedule**: Cosine decay with 5% warmup |
|
- **Batch Size**: 32 |
|
- **Epochs**: 10 |
|
- **Label Smoothing**: 0.05 |
|
- **Gradient Clipping**: 1.0 |
|
- **Device**: Apple M-series (MPS) |
|
|
|
## Evaluation Results |
|
|
|
### Validation Set (15,000 samples) |
|
| Metric | Value | |
|
|--------|-------| |
|
| Accuracy | 99.91% | |
|
| F1 Score | 0.9991 | |
|
| ROC AUC | 0.9999 | |
|
| Loss | 0.1194 | |
|
|
|
### Detailed Test Results on 18 Curated Samples |
|
|
|
The model achieved **100% accuracy** across all categories, but shows interesting confidence patterns: |
|
|
|
| Category | Sample Title/Type | True Label | Predicted | Confidence | Analysis | |
|
|----------|------------------|------------|-----------|------------|----------| |
|
| **FICTION - General** | | | | | | |
|
| Literary | Lighthouse Keeper Storm | Fiction | Fiction | **79.8%** | β οΈ **Lowest confidence** - realistic setting | |
|
| Sci-Fi | Time Travel Bedroom | Fiction | Fiction | 97.2% | β
Clear fantastical elements | |
|
| Mystery | Detective Rose Case | Fiction | Fiction | 97.3% | β
Strong narrative structure | |
|
| **FICTION - Children's** | | | | | | |
|
| Animal Tale | Benny's Carrot Problem | Fiction | Fiction | 97.1% | β
Clear storytelling markers | |
|
| Fantasy | Princess Luna's Paintings | Fiction | Fiction | 97.3% | β
Magical elements detected | |
|
| Magical | Tommy's Dream Sprites | Fiction | Fiction | **96.0%** | β οΈ Lower confidence - whimsical tone | |
|
| **FICTION - Fantasy** | | | | | | |
|
| Epic Fantasy | Shadowgate & Void Lords | Fiction | Fiction | 97.4% | β
High fantasy vocabulary | |
|
| Magic System | Moonlight Weaver Elara | Fiction | Fiction | 96.8% | β
Complex world-building | |
|
| Urban Fantasy | Dragon Memory Markets | Fiction | Fiction | 97.3% | β
Supernatural commerce | |
|
| **NON-FICTION - Academic** | | | | | | |
|
| Biology | Photosynthesis Process | Non-Fiction | Non-Fiction | 97.8% | β
Technical terminology | |
|
| Mathematics | Calculus Theorem | Non-Fiction | Non-Fiction | 97.8% | β
Mathematical concepts | |
|
| Economics | Market Equilibrium | Non-Fiction | Non-Fiction | 97.9% | β
Economic theory | |
|
| **NON-FICTION - News** | | | | | | |
|
| Financial | Federal Reserve Decision | Non-Fiction | Non-Fiction | 97.8% | β
Factual reporting style | |
|
| Local Gov | Homeless Crisis Plan | Non-Fiction | Non-Fiction | 97.9% | β
Policy announcement format | |
|
| Science | Exoplanet Discovery | Non-Fiction | Non-Fiction | 97.9% | β
Research reporting | |
|
| **NON-FICTION - Journals** | | | | | | |
|
| Financial | Wall Street Journal Market | Non-Fiction | Non-Fiction | 97.7% | β
Professional journalism | |
|
| Scientific | Nature Research Report | Non-Fiction | Non-Fiction | 97.7% | β
Academic publication style | |
|
| Personal | Kyoto Travel Log | Non-Fiction | Non-Fiction | **97.5%** | β οΈ Slightly lower - personal narrative | |
|
|
|
### Key Insights: |
|
- **Weakest Performance**: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements |
|
- **Strongest Performance**: Academic/news content (97.8-97.9% confidence) - clear technical/factual language |
|
- **Edge Cases**: Personal narratives and whimsical children's stories show slightly lower confidence |
|
- **Perfect Accuracy**: 18/18 samples correctly classified despite confidence variations |
|
|
|
### Detailed Test Results |
|
|
|
#### β
All 12 Samples Correctly Classified |
|
|
|
**Fiction Samples (3/3):** |
|
1. Lighthouse keeper narrative β Fiction (79.8% conf) |
|
2. Time travel story β Fiction (97.2% conf) |
|
3. Detective mystery β Fiction (97.3% conf) |
|
|
|
**Textbook Samples (3/3):** |
|
1. Photosynthesis (Biology) β Non-Fiction (97.8% conf) |
|
2. Fundamental theorem (Calculus) β Non-Fiction (97.8% conf) |
|
3. Market equilibrium (Economics) β Non-Fiction (97.9% conf) |
|
|
|
**News Articles (3/3):** |
|
1. Federal Reserve decision β Non-Fiction (97.8% conf) |
|
2. City homeless initiative β Non-Fiction (97.9% conf) |
|
3. Exoplanet discovery β Non-Fiction (97.9% conf) |
|
|
|
**Journal Articles (3/3):** |
|
1. Wall Street Journal (Financial) β Non-Fiction (97.7% conf) |
|
2. Nature Scientific Reports β Non-Fiction (97.7% conf) |
|
3. Personal Travel Journal β Non-Fiction (97.5% conf) |
|
|
|
## How to Use |
|
|
|
### PyTorch |
|
|
|
```python |
|
import torch |
|
import numpy as np |
|
from model import TinyByteCNN, preprocess_text |
|
|
|
# Load model |
|
model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector") |
|
model.eval() |
|
|
|
# Prepare text |
|
text = "Your text here..." |
|
input_bytes = preprocess_text(text) # Returns tensor of shape [1, 4096] |
|
|
|
# Predict |
|
with torch.no_grad(): |
|
logits = model(input_bytes) |
|
probability = torch.sigmoid(logits).item() |
|
|
|
if probability > 0.5: |
|
print(f"Non-Fiction (confidence: {probability:.1%})") |
|
else: |
|
print(f"Fiction (confidence: {1-probability:.1%})") |
|
``` |
|
|
|
### Batch Processing |
|
|
|
```python |
|
def classify_texts(texts, model, batch_size=32): |
|
results = [] |
|
for i in range(0, len(texts), batch_size): |
|
batch = texts[i:i+batch_size] |
|
inputs = torch.stack([preprocess_text(t) for t in batch]) |
|
|
|
with torch.no_grad(): |
|
logits = model(inputs) |
|
probs = torch.sigmoid(logits) |
|
|
|
for text, prob in zip(batch, probs): |
|
results.append({ |
|
'text': text[:100] + '...', |
|
'class': 'Non-Fiction' if prob > 0.5 else 'Fiction', |
|
'confidence': prob.item() if prob > 0.5 else 1-prob.item() |
|
}) |
|
|
|
return results |
|
``` |
|
|
|
## Training Infrastructure |
|
|
|
- **Hardware**: Apple M-series with 8GB MPS memory limit |
|
- **Training Time**: ~20 minutes |
|
- **Framework**: PyTorch 2.0+ |
|
|
|
## Environmental Impact |
|
|
|
- **Hardware Type**: Apple Silicon M-series |
|
- **Hours used**: 0.33 |
|
- **Carbon Emitted**: Minimal (ARM-based efficiency, ~10W average) |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@model{tinybytecnn-fiction-2024, |
|
title={TinyByteCNN Fiction vs Non-Fiction Detector}, |
|
author={Mitchell Currie}, |
|
year={2024}, |
|
publisher={HuggingFace}, |
|
url={https://huggingface.co/username/tinybytecnn-fiction-detector} |
|
} |
|
``` |
|
|
|
## Acknowledgments |
|
|
|
This model uses data from: |
|
- HuggingFace Team (Cosmopedia dataset) |
|
- Project Gutenberg |
|
- Common Pile contributors |
|
- CNN/Daily Mail dataset creators |
|
|
|
## License |
|
|
|
Apache 2.0 |
|
|
|
## Contact |
|
|
|
For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector). |