Mitchins's picture
Upload folder using huggingface_hub
4abce2d verified
metadata
language:
  - en
license: apache-2.0
library_name: pytorch
tags:
  - text-classification
  - fiction-detection
  - byte-level
  - cnn
datasets:
  - HuggingFaceTB/cosmopedia
  - BEE-spoke-data/gutenberg-en-v1-clean
  - common-pile/arxiv_abstracts
  - ccdv/cnn_dailymail
metrics:
  - accuracy
  - f1
  - roc_auc
model-index:
  - name: TinyByteCNN-Fiction-Classifier
    results:
      - task:
          type: text-classification
          name: Fiction vs Non-Fiction Classification
        dataset:
          name: Custom Fiction/Non-Fiction Dataset (85k samples)
          type: custom
          split: validation
        metrics:
          - type: accuracy
            value: 99.91
            name: Validation Accuracy
          - type: f1
            value: 99.91
            name: F1 Score
          - type: roc_auc
            value: 99.99
            name: ROC AUC
      - task:
          type: text-classification
          name: Curated Test Samples
        dataset:
          name: 18 Diverse Fiction/Non-Fiction Samples
          type: curated
          split: test
        metrics:
          - type: accuracy
            value: 100
            name: Test Accuracy
          - type: confidence_avg
            value: 96.3
            name: Average Confidence

TinyByteCNN Fiction vs Non-Fiction Detector

A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.

Model Description

TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.

Architecture Highlights

  • Model Size: 942,313 parameters (~3.6MB)
  • Input: Raw UTF-8 bytes (max 4096 bytes β‰ˆ 512 words)
  • Architecture: Depthwise-separable 1D CNN with Squeeze-Excitation
  • Receptive Field: ~2.8KB covering multi-paragraph context
  • Key Features:
    • 4 stages with progressive downsampling (32x reduction)
    • Dilated convolutions for larger receptive field
    • SE attention modules for channel recalibration
    • Global average + max pooling head

Intended Uses & Limitations

Intended Uses

  • Automated content categorization for libraries and archives
  • Fiction/non-fiction filtering for content platforms
  • Educational content classification
  • Writing style analysis
  • Content recommendation systems

Limitations

  • Personal narratives: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
  • Mixed content: Struggles with creative non-fiction and narrative journalism
  • Length: Optimized for 512-4096 byte inputs; longer texts should be chunked
  • Language: Primarily trained on English text

Training Data

The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:

Fiction Sources (50%)

  1. Cosmopedia Stories (HuggingFaceTB/cosmopedia)

    • Synthetic fiction stories
    • License: Apache 2.0
  2. Project Gutenberg (BEE-spoke-data/gutenberg-en-v1-clean)

    • Classic literature
    • License: Public Domain
  3. Reddit WritingPrompts

    • Community-generated creative writing
    • Via synthetic alternatives

Non-Fiction Sources (50%)

  1. Cosmopedia Educational (HuggingFaceTB/cosmopedia)

    • Textbooks, WikiHow, educational blogs
    • License: Apache 2.0
  2. Scientific Papers (common-pile/arxiv_abstracts)

    • Academic abstracts and introductions
    • License: Various (permissive)
  3. News Articles (ccdv/cnn_dailymail)

    • CNN and Daily Mail articles
    • License: Apache 2.0

Training Procedure

Preprocessing

  • Unicode NFC normalization
  • Whitespace normalization (max 2 consecutive spaces)
  • UTF-8 byte encoding
  • Padding/truncation to 4096 bytes

Training Hyperparameters

  • Optimizer: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
  • Schedule: Cosine decay with 5% warmup
  • Batch Size: 32
  • Epochs: 10
  • Label Smoothing: 0.05
  • Gradient Clipping: 1.0
  • Device: Apple M-series (MPS)

Evaluation Results

Validation Set (15,000 samples)

Metric Value
Accuracy 99.91%
F1 Score 0.9991
ROC AUC 0.9999
Loss 0.1194

Detailed Test Results on 18 Curated Samples

The model achieved 100% accuracy across all categories, but shows interesting confidence patterns:

Category Sample Title/Type True Label Predicted Confidence Analysis
FICTION - General
Literary Lighthouse Keeper Storm Fiction Fiction 79.8% ⚠️ Lowest confidence - realistic setting
Sci-Fi Time Travel Bedroom Fiction Fiction 97.2% βœ… Clear fantastical elements
Mystery Detective Rose Case Fiction Fiction 97.3% βœ… Strong narrative structure
FICTION - Children's
Animal Tale Benny's Carrot Problem Fiction Fiction 97.1% βœ… Clear storytelling markers
Fantasy Princess Luna's Paintings Fiction Fiction 97.3% βœ… Magical elements detected
Magical Tommy's Dream Sprites Fiction Fiction 96.0% ⚠️ Lower confidence - whimsical tone
FICTION - Fantasy
Epic Fantasy Shadowgate & Void Lords Fiction Fiction 97.4% βœ… High fantasy vocabulary
Magic System Moonlight Weaver Elara Fiction Fiction 96.8% βœ… Complex world-building
Urban Fantasy Dragon Memory Markets Fiction Fiction 97.3% βœ… Supernatural commerce
NON-FICTION - Academic
Biology Photosynthesis Process Non-Fiction Non-Fiction 97.8% βœ… Technical terminology
Mathematics Calculus Theorem Non-Fiction Non-Fiction 97.8% βœ… Mathematical concepts
Economics Market Equilibrium Non-Fiction Non-Fiction 97.9% βœ… Economic theory
NON-FICTION - News
Financial Federal Reserve Decision Non-Fiction Non-Fiction 97.8% βœ… Factual reporting style
Local Gov Homeless Crisis Plan Non-Fiction Non-Fiction 97.9% βœ… Policy announcement format
Science Exoplanet Discovery Non-Fiction Non-Fiction 97.9% βœ… Research reporting
NON-FICTION - Journals
Financial Wall Street Journal Market Non-Fiction Non-Fiction 97.7% βœ… Professional journalism
Scientific Nature Research Report Non-Fiction Non-Fiction 97.7% βœ… Academic publication style
Personal Kyoto Travel Log Non-Fiction Non-Fiction 97.5% ⚠️ Slightly lower - personal narrative

Key Insights:

  • Weakest Performance: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements
  • Strongest Performance: Academic/news content (97.8-97.9% confidence) - clear technical/factual language
  • Edge Cases: Personal narratives and whimsical children's stories show slightly lower confidence
  • Perfect Accuracy: 18/18 samples correctly classified despite confidence variations

Detailed Test Results

βœ… All 12 Samples Correctly Classified

Fiction Samples (3/3):

  1. Lighthouse keeper narrative β†’ Fiction (79.8% conf)
  2. Time travel story β†’ Fiction (97.2% conf)
  3. Detective mystery β†’ Fiction (97.3% conf)

Textbook Samples (3/3):

  1. Photosynthesis (Biology) β†’ Non-Fiction (97.8% conf)
  2. Fundamental theorem (Calculus) β†’ Non-Fiction (97.8% conf)
  3. Market equilibrium (Economics) β†’ Non-Fiction (97.9% conf)

News Articles (3/3):

  1. Federal Reserve decision β†’ Non-Fiction (97.8% conf)
  2. City homeless initiative β†’ Non-Fiction (97.9% conf)
  3. Exoplanet discovery β†’ Non-Fiction (97.9% conf)

Journal Articles (3/3):

  1. Wall Street Journal (Financial) β†’ Non-Fiction (97.7% conf)
  2. Nature Scientific Reports β†’ Non-Fiction (97.7% conf)
  3. Personal Travel Journal β†’ Non-Fiction (97.5% conf)

How to Use

PyTorch

import torch
import numpy as np
from model import TinyByteCNN, preprocess_text

# Load model
model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
model.eval()

# Prepare text
text = "Your text here..."
input_bytes = preprocess_text(text)  # Returns tensor of shape [1, 4096]

# Predict
with torch.no_grad():
    logits = model(input_bytes)
    probability = torch.sigmoid(logits).item()
    
    if probability > 0.5:
        print(f"Non-Fiction (confidence: {probability:.1%})")
    else:
        print(f"Fiction (confidence: {1-probability:.1%})")

Batch Processing

def classify_texts(texts, model, batch_size=32):
    results = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        inputs = torch.stack([preprocess_text(t) for t in batch])
        
        with torch.no_grad():
            logits = model(inputs)
            probs = torch.sigmoid(logits)
            
        for text, prob in zip(batch, probs):
            results.append({
                'text': text[:100] + '...',
                'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
                'confidence': prob.item() if prob > 0.5 else 1-prob.item()
            })
    
    return results

Training Infrastructure

  • Hardware: Apple M-series with 8GB MPS memory limit
  • Training Time: ~20 minutes
  • Framework: PyTorch 2.0+

Environmental Impact

  • Hardware Type: Apple Silicon M-series
  • Hours used: 0.33
  • Carbon Emitted: Minimal (ARM-based efficiency, ~10W average)

Citation

@model{tinybytecnn-fiction-2024,
  title={TinyByteCNN Fiction vs Non-Fiction Detector},
  author={Mitchell Currie},
  year={2024},
  publisher={HuggingFace},
  url={https://huggingface.co/username/tinybytecnn-fiction-detector}
}

Acknowledgments

This model uses data from:

  • HuggingFace Team (Cosmopedia dataset)
  • Project Gutenberg
  • Common Pile contributors
  • CNN/Daily Mail dataset creators

License

Apache 2.0

Contact

For questions or issues, please open an issue on the model repository.