File size: 10,135 Bytes

---
language:
- en
license: apache-2.0
library_name: pytorch
tags:
- text-classification
- fiction-detection
- byte-level
- cnn
datasets:
- HuggingFaceTB/cosmopedia
- BEE-spoke-data/gutenberg-en-v1-clean
- common-pile/arxiv_abstracts
- ccdv/cnn_dailymail
metrics:
- accuracy
- f1
- roc_auc
model-index:
- name: TinyByteCNN-Fiction-Classifier
  results:
  - task:
      type: text-classification
      name: Fiction vs Non-Fiction Classification
    dataset:
      name: Custom Fiction/Non-Fiction Dataset (85k samples)
      type: custom
      split: validation
    metrics:
    - type: accuracy
      value: 99.91
      name: Validation Accuracy
    - type: f1
      value: 99.91
      name: F1 Score
    - type: roc_auc
      value: 99.99
      name: ROC AUC
  - task:
      type: text-classification
      name: Curated Test Samples
    dataset:
      name: 18 Diverse Fiction/Non-Fiction Samples
      type: curated
      split: test
    metrics:
    - type: accuracy
      value: 100.0
      name: Test Accuracy
    - type: confidence_avg
      value: 96.3
      name: Average Confidence
---

# TinyByteCNN Fiction vs Non-Fiction Detector

A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.

## Model Description

TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.

### Architecture Highlights

- **Model Size**: 942,313 parameters (~3.6MB)
- **Input**: Raw UTF-8 bytes (max 4096 bytes ≈ 512 words)
- **Architecture**: Depthwise-separable 1D CNN with Squeeze-Excitation
- **Receptive Field**: ~2.8KB covering multi-paragraph context
- **Key Features**:
  - 4 stages with progressive downsampling (32x reduction)
  - Dilated convolutions for larger receptive field
  - SE attention modules for channel recalibration
  - Global average + max pooling head

## Intended Uses & Limitations

### Intended Uses
- Automated content categorization for libraries and archives
- Fiction/non-fiction filtering for content platforms
- Educational content classification
- Writing style analysis
- Content recommendation systems

### Limitations
- **Personal narratives**: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
- **Mixed content**: Struggles with creative non-fiction and narrative journalism
- **Length**: Optimized for 512-4096 byte inputs; longer texts should be chunked
- **Language**: Primarily trained on English text

## Training Data

The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:

### Fiction Sources (50%)
1. **Cosmopedia Stories** (HuggingFaceTB/cosmopedia)
   - Synthetic fiction stories
   - License: Apache 2.0

2. **Project Gutenberg** (BEE-spoke-data/gutenberg-en-v1-clean)
   - Classic literature
   - License: Public Domain

3. **Reddit WritingPrompts**
   - Community-generated creative writing
   - Via synthetic alternatives

### Non-Fiction Sources (50%)
1. **Cosmopedia Educational** (HuggingFaceTB/cosmopedia)
   - Textbooks, WikiHow, educational blogs
   - License: Apache 2.0

2. **Scientific Papers** (common-pile/arxiv_abstracts)
   - Academic abstracts and introductions
   - License: Various (permissive)

3. **News Articles** (ccdv/cnn_dailymail)
   - CNN and Daily Mail articles
   - License: Apache 2.0

## Training Procedure

### Preprocessing
- Unicode NFC normalization
- Whitespace normalization (max 2 consecutive spaces)
- UTF-8 byte encoding
- Padding/truncation to 4096 bytes

### Training Hyperparameters
- **Optimizer**: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
- **Schedule**: Cosine decay with 5% warmup
- **Batch Size**: 32
- **Epochs**: 10
- **Label Smoothing**: 0.05
- **Gradient Clipping**: 1.0
- **Device**: Apple M-series (MPS)

## Evaluation Results

### Validation Set (15,000 samples)
| Metric | Value |
|--------|-------|
| Accuracy | 99.91% |
| F1 Score | 0.9991 |
| ROC AUC | 0.9999 |
| Loss | 0.1194 |

### Detailed Test Results on 18 Curated Samples

The model achieved **100% accuracy** across all categories, but shows interesting confidence patterns:

| Category | Sample Title/Type | True Label | Predicted | Confidence | Analysis |
|----------|------------------|------------|-----------|------------|----------|
| **FICTION - General** | | | | | |
| Literary | Lighthouse Keeper Storm | Fiction | Fiction | **79.8%** | ⚠️ **Lowest confidence** - realistic setting |
| Sci-Fi | Time Travel Bedroom | Fiction | Fiction | 97.2% | ✅ Clear fantastical elements |
| Mystery | Detective Rose Case | Fiction | Fiction | 97.3% | ✅ Strong narrative structure |
| **FICTION - Children's** | | | | | |
| Animal Tale | Benny's Carrot Problem | Fiction | Fiction | 97.1% | ✅ Clear storytelling markers |
| Fantasy | Princess Luna's Paintings | Fiction | Fiction | 97.3% | ✅ Magical elements detected |
| Magical | Tommy's Dream Sprites | Fiction | Fiction | **96.0%** | ⚠️ Lower confidence - whimsical tone |
| **FICTION - Fantasy** | | | | | |
| Epic Fantasy | Shadowgate & Void Lords | Fiction | Fiction | 97.4% | ✅ High fantasy vocabulary |
| Magic System | Moonlight Weaver Elara | Fiction | Fiction | 96.8% | ✅ Complex world-building |
| Urban Fantasy | Dragon Memory Markets | Fiction | Fiction | 97.3% | ✅ Supernatural commerce |
| **NON-FICTION - Academic** | | | | | |
| Biology | Photosynthesis Process | Non-Fiction | Non-Fiction | 97.8% | ✅ Technical terminology |
| Mathematics | Calculus Theorem | Non-Fiction | Non-Fiction | 97.8% | ✅ Mathematical concepts |
| Economics | Market Equilibrium | Non-Fiction | Non-Fiction | 97.9% | ✅ Economic theory |
| **NON-FICTION - News** | | | | | |
| Financial | Federal Reserve Decision | Non-Fiction | Non-Fiction | 97.8% | ✅ Factual reporting style |
| Local Gov | Homeless Crisis Plan | Non-Fiction | Non-Fiction | 97.9% | ✅ Policy announcement format |
| Science | Exoplanet Discovery | Non-Fiction | Non-Fiction | 97.9% | ✅ Research reporting |
| **NON-FICTION - Journals** | | | | | |
| Financial | Wall Street Journal Market | Non-Fiction | Non-Fiction | 97.7% | ✅ Professional journalism |
| Scientific | Nature Research Report | Non-Fiction | Non-Fiction | 97.7% | ✅ Academic publication style |
| Personal | Kyoto Travel Log | Non-Fiction | Non-Fiction | **97.5%** | ⚠️ Slightly lower - personal narrative |

### Key Insights:
- **Weakest Performance**: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements
- **Strongest Performance**: Academic/news content (97.8-97.9% confidence) - clear technical/factual language
- **Edge Cases**: Personal narratives and whimsical children's stories show slightly lower confidence
- **Perfect Accuracy**: 18/18 samples correctly classified despite confidence variations

### Detailed Test Results

#### ✅ All 12 Samples Correctly Classified

**Fiction Samples (3/3):**
1. Lighthouse keeper narrative → Fiction (79.8% conf)
2. Time travel story → Fiction (97.2% conf)
3. Detective mystery → Fiction (97.3% conf)

**Textbook Samples (3/3):**
1. Photosynthesis (Biology) → Non-Fiction (97.8% conf)
2. Fundamental theorem (Calculus) → Non-Fiction (97.8% conf)
3. Market equilibrium (Economics) → Non-Fiction (97.9% conf)

**News Articles (3/3):**
1. Federal Reserve decision → Non-Fiction (97.8% conf)
2. City homeless initiative → Non-Fiction (97.9% conf)
3. Exoplanet discovery → Non-Fiction (97.9% conf)

**Journal Articles (3/3):**
1. Wall Street Journal (Financial) → Non-Fiction (97.7% conf)
2. Nature Scientific Reports → Non-Fiction (97.7% conf)
3. Personal Travel Journal → Non-Fiction (97.5% conf)

## How to Use

### PyTorch

```python
import torch
import numpy as np
from model import TinyByteCNN, preprocess_text

# Load model
model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
model.eval()

# Prepare text
text = "Your text here..."
input_bytes = preprocess_text(text)  # Returns tensor of shape [1, 4096]

# Predict
with torch.no_grad():
    logits = model(input_bytes)
    probability = torch.sigmoid(logits).item()
    
    if probability > 0.5:
        print(f"Non-Fiction (confidence: {probability:.1%})")
    else:
        print(f"Fiction (confidence: {1-probability:.1%})")
```

### Batch Processing

```python
def classify_texts(texts, model, batch_size=32):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = torch.stack([preprocess_text(t) for t in batch])
        
        with torch.no_grad():
            logits = model(inputs)
            probs = torch.sigmoid(logits)
            
        for text, prob in zip(batch, probs):
            results.append({
                'text': text[:100] + '...',
                'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
                'confidence': prob.item() if prob > 0.5 else 1-prob.item()
            })
    
    return results
```

## Training Infrastructure

- **Hardware**: Apple M-series with 8GB MPS memory limit
- **Training Time**: ~20 minutes
- **Framework**: PyTorch 2.0+

## Environmental Impact

- **Hardware Type**: Apple Silicon M-series
- **Hours used**: 0.33
- **Carbon Emitted**: Minimal (ARM-based efficiency, ~10W average)

## Citation

```bibtex
@model{tinybytecnn-fiction-2024,
  title={TinyByteCNN Fiction vs Non-Fiction Detector},
  author={Mitchell Currie},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/username/tinybytecnn-fiction-detector}
}
```

## Acknowledgments

This model uses data from:
- HuggingFace Team (Cosmopedia dataset)
- Project Gutenberg
- Common Pile contributors
- CNN/Daily Mail dataset creators

## License

Apache 2.0

## Contact

For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector).