tinybytecnn-fiction-classifier / README.md

Upload folder using huggingface_hub

4abce2d verified 18 days ago

10.1 kB

	---
	language:
	- en
	license: apache-2.0
	library_name: pytorch
	tags:
	- text-classification
	- fiction-detection
	- byte-level
	- cnn
	datasets:
	- HuggingFaceTB/cosmopedia
	- BEE-spoke-data/gutenberg-en-v1-clean
	- common-pile/arxiv_abstracts
	- ccdv/cnn_dailymail
	metrics:
	- accuracy
	- f1
	- roc_auc
	model-index:
	- name: TinyByteCNN-Fiction-Classifier
	results:
	- task:
	type: text-classification
	name: Fiction vs Non-Fiction Classification
	dataset:
	name: Custom Fiction/Non-Fiction Dataset (85k samples)
	type: custom
	split: validation
	metrics:
	- type: accuracy
	value: 99.91
	name: Validation Accuracy
	- type: f1
	value: 99.91
	name: F1 Score
	- type: roc_auc
	value: 99.99
	name: ROC AUC
	- task:
	type: text-classification
	name: Curated Test Samples
	dataset:
	name: 18 Diverse Fiction/Non-Fiction Samples
	type: curated
	split: test
	metrics:
	- type: accuracy
	value: 100.0
	name: Test Accuracy
	- type: confidence_avg
	value: 96.3
	name: Average Confidence
	---

	# TinyByteCNN Fiction vs Non-Fiction Detector

	A lightweight, byte-level CNN model for detecting fiction vs non-fiction text with 99.91% validation accuracy.

	## Model Description

	TinyByteCNN is a highly efficient byte-level convolutional neural network designed for binary classification of fiction vs non-fiction text. The model operates directly on UTF-8 byte sequences, eliminating the need for tokenization and making it robust to various text formats and languages.

	### Architecture Highlights

	- Model Size: 942,313 parameters (~3.6MB)
	- Input: Raw UTF-8 bytes (max 4096 bytes ≈ 512 words)
	- Architecture: Depthwise-separable 1D CNN with Squeeze-Excitation
	- Receptive Field: ~2.8KB covering multi-paragraph context
	- Key Features:
	- 4 stages with progressive downsampling (32x reduction)
	- Dilated convolutions for larger receptive field
	- SE attention modules for channel recalibration
	- Global average + max pooling head

	## Intended Uses & Limitations

	### Intended Uses
	- Automated content categorization for libraries and archives
	- Fiction/non-fiction filtering for content platforms
	- Educational content classification
	- Writing style analysis
	- Content recommendation systems

	### Limitations
	- Personal narratives: May misclassify personal journal entries and memoirs as fiction (observed ~97% fiction confidence on journal entries)
	- Mixed content: Struggles with creative non-fiction and narrative journalism
	- Length: Optimized for 512-4096 byte inputs; longer texts should be chunked
	- Language: Primarily trained on English text

	## Training Data

	The model was trained on a diverse dataset of 85,000 samples (60k train, 15k validation, 10k test) drawn from:

	### Fiction Sources (50%)
	1. Cosmopedia Stories (HuggingFaceTB/cosmopedia)
	- Synthetic fiction stories
	- License: Apache 2.0

	2. Project Gutenberg (BEE-spoke-data/gutenberg-en-v1-clean)
	- Classic literature
	- License: Public Domain

	3. Reddit WritingPrompts
	- Community-generated creative writing
	- Via synthetic alternatives

	### Non-Fiction Sources (50%)
	1. Cosmopedia Educational (HuggingFaceTB/cosmopedia)
	- Textbooks, WikiHow, educational blogs
	- License: Apache 2.0

	2. Scientific Papers (common-pile/arxiv_abstracts)
	- Academic abstracts and introductions
	- License: Various (permissive)

	3. News Articles (ccdv/cnn_dailymail)
	- CNN and Daily Mail articles
	- License: Apache 2.0

	## Training Procedure

	### Preprocessing
	- Unicode NFC normalization
	- Whitespace normalization (max 2 consecutive spaces)
	- UTF-8 byte encoding
	- Padding/truncation to 4096 bytes

	### Training Hyperparameters
	- Optimizer: AdamW (lr=3e-3, betas=(0.9, 0.98), weight_decay=0.01)
	- Schedule: Cosine decay with 5% warmup
	- Batch Size: 32
	- Epochs: 10
	- Label Smoothing: 0.05
	- Gradient Clipping: 1.0
	- Device: Apple M-series (MPS)

	## Evaluation Results

	### Validation Set (15,000 samples)
	\| Metric \| Value \|
	\|--------\|-------\|
	\| Accuracy \| 99.91% \|
	\| F1 Score \| 0.9991 \|
	\| ROC AUC \| 0.9999 \|
	\| Loss \| 0.1194 \|

	### Detailed Test Results on 18 Curated Samples

	The model achieved 100% accuracy across all categories, but shows interesting confidence patterns:

	\| Category \| Sample Title/Type \| True Label \| Predicted \| Confidence \| Analysis \|
	\|----------\|------------------\|------------\|-----------\|------------\|----------\|
	\| FICTION - General \| \| \| \| \| \|
	\| Literary \| Lighthouse Keeper Storm \| Fiction \| Fiction \| 79.8% \| ⚠️ Lowest confidence - realistic setting \|
	\| Sci-Fi \| Time Travel Bedroom \| Fiction \| Fiction \| 97.2% \| ✅ Clear fantastical elements \|
	\| Mystery \| Detective Rose Case \| Fiction \| Fiction \| 97.3% \| ✅ Strong narrative structure \|
	\| FICTION - Children's \| \| \| \| \| \|
	\| Animal Tale \| Benny's Carrot Problem \| Fiction \| Fiction \| 97.1% \| ✅ Clear storytelling markers \|
	\| Fantasy \| Princess Luna's Paintings \| Fiction \| Fiction \| 97.3% \| ✅ Magical elements detected \|
	\| Magical \| Tommy's Dream Sprites \| Fiction \| Fiction \| 96.0% \| ⚠️ Lower confidence - whimsical tone \|
	\| FICTION - Fantasy \| \| \| \| \| \|
	\| Epic Fantasy \| Shadowgate & Void Lords \| Fiction \| Fiction \| 97.4% \| ✅ High fantasy vocabulary \|
	\| Magic System \| Moonlight Weaver Elara \| Fiction \| Fiction \| 96.8% \| ✅ Complex world-building \|
	\| Urban Fantasy \| Dragon Memory Markets \| Fiction \| Fiction \| 97.3% \| ✅ Supernatural commerce \|
	\| NON-FICTION - Academic \| \| \| \| \| \|
	\| Biology \| Photosynthesis Process \| Non-Fiction \| Non-Fiction \| 97.8% \| ✅ Technical terminology \|
	\| Mathematics \| Calculus Theorem \| Non-Fiction \| Non-Fiction \| 97.8% \| ✅ Mathematical concepts \|
	\| Economics \| Market Equilibrium \| Non-Fiction \| Non-Fiction \| 97.9% \| ✅ Economic theory \|
	\| NON-FICTION - News \| \| \| \| \| \|
	\| Financial \| Federal Reserve Decision \| Non-Fiction \| Non-Fiction \| 97.8% \| ✅ Factual reporting style \|
	\| Local Gov \| Homeless Crisis Plan \| Non-Fiction \| Non-Fiction \| 97.9% \| ✅ Policy announcement format \|
	\| Science \| Exoplanet Discovery \| Non-Fiction \| Non-Fiction \| 97.9% \| ✅ Research reporting \|
	\| NON-FICTION - Journals \| \| \| \| \| \|
	\| Financial \| Wall Street Journal Market \| Non-Fiction \| Non-Fiction \| 97.7% \| ✅ Professional journalism \|
	\| Scientific \| Nature Research Report \| Non-Fiction \| Non-Fiction \| 97.7% \| ✅ Academic publication style \|
	\| Personal \| Kyoto Travel Log \| Non-Fiction \| Non-Fiction \| 97.5% \| ⚠️ Slightly lower - personal narrative \|

	### Key Insights:
	- Weakest Performance: Realistic literary fiction (79.8% confidence) - the lighthouse story lacks obvious fantastical elements
	- Strongest Performance: Academic/news content (97.8-97.9% confidence) - clear technical/factual language
	- Edge Cases: Personal narratives and whimsical children's stories show slightly lower confidence
	- Perfect Accuracy: 18/18 samples correctly classified despite confidence variations

	### Detailed Test Results

	#### ✅ All 12 Samples Correctly Classified

	Fiction Samples (3/3):
	1. Lighthouse keeper narrative → Fiction (79.8% conf)
	2. Time travel story → Fiction (97.2% conf)
	3. Detective mystery → Fiction (97.3% conf)

	Textbook Samples (3/3):
	1. Photosynthesis (Biology) → Non-Fiction (97.8% conf)
	2. Fundamental theorem (Calculus) → Non-Fiction (97.8% conf)
	3. Market equilibrium (Economics) → Non-Fiction (97.9% conf)

	News Articles (3/3):
	1. Federal Reserve decision → Non-Fiction (97.8% conf)
	2. City homeless initiative → Non-Fiction (97.9% conf)
	3. Exoplanet discovery → Non-Fiction (97.9% conf)

	Journal Articles (3/3):
	1. Wall Street Journal (Financial) → Non-Fiction (97.7% conf)
	2. Nature Scientific Reports → Non-Fiction (97.7% conf)
	3. Personal Travel Journal → Non-Fiction (97.5% conf)

	## How to Use

	### PyTorch

	```python
	import torch
	import numpy as np
	from model import TinyByteCNN, preprocess_text

	# Load model
	model = TinyByteCNN.from_pretrained("username/tinybytecnn-fiction-detector")
	model.eval()

	# Prepare text
	text = "Your text here..."
	input_bytes = preprocess_text(text) # Returns tensor of shape [1, 4096]

	# Predict
	with torch.no_grad():
	logits = model(input_bytes)
	probability = torch.sigmoid(logits).item()

	if probability > 0.5:
	print(f"Non-Fiction (confidence: {probability:.1%})")
	else:
	print(f"Fiction (confidence: {1-probability:.1%})")
	```

	### Batch Processing

	```python
	def classify_texts(texts, model, batch_size=32):
	results = []
	for i in range(0, len(texts), batch_size):
	batch = texts[i:i+batch_size]
	inputs = torch.stack([preprocess_text(t) for t in batch])

	with torch.no_grad():
	logits = model(inputs)
	probs = torch.sigmoid(logits)

	for text, prob in zip(batch, probs):
	results.append({
	'text': text[:100] + '...',
	'class': 'Non-Fiction' if prob > 0.5 else 'Fiction',
	'confidence': prob.item() if prob > 0.5 else 1-prob.item()
	})

	return results
	```

	## Training Infrastructure

	- Hardware: Apple M-series with 8GB MPS memory limit
	- Training Time: ~20 minutes
	- Framework: PyTorch 2.0+

	## Environmental Impact

	- Hardware Type: Apple Silicon M-series
	- Hours used: 0.33
	- Carbon Emitted: Minimal (ARM-based efficiency, ~10W average)

	## Citation

	```bibtex
	@model{tinybytecnn-fiction-2024,
	title={TinyByteCNN Fiction vs Non-Fiction Detector},
	author={Mitchell Currie},
	year={2024},
	publisher={HuggingFace},
	url={https://huggingface.co/username/tinybytecnn-fiction-detector}
	}
	```

	## Acknowledgments

	This model uses data from:
	- HuggingFace Team (Cosmopedia dataset)
	- Project Gutenberg
	- Common Pile contributors
	- CNN/Daily Mail dataset creators

	## License

	Apache 2.0

	## Contact

	For questions or issues, please open an issue on the [model repository](https://huggingface.co/username/tinybytecnn-fiction-detector).