innit: Fast English vs Non-English Text Detection
A lightweight byte-level CNN for fast binary language detection (English vs Non-English).
Model Details
- Model Type: Byte-level Convolutional Neural Network
- Task: Binary text classification (English vs Non-English)
- Architecture: TinyByteCNN_EN with 6 convolutional blocks
- Parameters: 156,642
- Input: Raw UTF-8 bytes (max 256 bytes)
- Output: Binary classification (0=Non-English, 1=English)
Performance
- Validation Accuracy: 99.94%
- Challenge Set Accuracy: 100% (14/14 test cases)
- Inference Speed: Sub-millisecond on modern CPUs
- Model Size: ~600KB
Supported Languages
Trained to distinguish English from 52+ languages across diverse scripts:
- Latin scripts: Spanish, French, German, Italian, Dutch, Portuguese, etc.
- CJK scripts: Chinese (Simplified/Traditional), Japanese, Korean
- Cyrillic scripts: Russian, Ukrainian, Bulgarian, Serbian
- Other scripts: Arabic, Hindi, Bengali, Thai, Hebrew, etc.
Architecture
TinyByteCNN_EN:
βββ Embedding: 257 β 80 dimensions (256 bytes + padding)
βββ 6x Convolutional Blocks:
β βββ Conv1D (kernel=3, residual connections)
β βββ GELU activation
β βββ BatchNorm1D
β βββ Dropout (0.15)
βββ Enhanced Pooling: mean + max + std
βββ Classification Head: 240 β 80 β 2
Training Data
- Total samples: 17,543 balanced samples
- English: 8,772 samples from diverse sources
- Non-English: 8,771 samples across 52+ languages
- Text lengths: 3-276 characters (optimized for short texts)
- Special coverage: Emoji handling, mathematical formulas, scientific notation
Quick Start
Option 1: ONNX Runtime (Recommended)
import onnxruntime as ort
import numpy as np
# Load ONNX model
session = ort.InferenceSession("model.onnx")
def predict(text):
# Prepare input
bytes_data = text.encode('utf-8', errors='ignore')[:256]
padded = np.zeros(256, dtype=np.int64)
padded[:len(bytes_data)] = list(bytes_data)
# Run inference
outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
logits = outputs[0][0]
# Apply softmax
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / np.sum(exp_logits)
return probs[1] # English probability
# Examples
print(predict("Hello world!")) # ~1.0 (English)
print(predict("Bonjour le monde")) # ~0.0 (French)
print(predict("Check our sale! π")) # ~1.0 (English with emoji)
Option 2: Python Package
# Install the utility package
pip install innit-detector
# CLI usage
innit "Hello world!" # β English (confidence: 0.974)
innit --download # Download model first
innit "Hello" "Bonjour" "δ½ ε₯½" # Multiple texts
# Library usage
from innit_detector import InnitDetector
detector = InnitDetector()
result = detector.predict("Hello world!")
print(result['is_english']) # True
Option 3: PyTorch (Advanced)
import torch
import torch.nn.functional as F
from safetensors.torch import load_file
import numpy as np
# Load model (requires TinyByteCNN_EN class definition)
state_dict = load_file("model.safetensors")
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
model.load_state_dict(state_dict)
model.eval()
def predict(text):
bytes_data = text.encode('utf-8', errors='ignore')[:256]
padded = np.zeros(256, dtype=np.long)
padded[:len(bytes_data)] = list(bytes_data)
with torch.no_grad():
logits = model(torch.tensor(padded).unsqueeze(0))
probs = F.softmax(logits, dim=1)
return probs[0][1].item()
ONNX Support
ONNX version available for cross-platform deployment:
model.onnx
- Full precision (FP32) for maximum compatibility
Challenge Set Results
Perfect 100% accuracy on comprehensive test cases:
- Ultra-short texts: "Good morning!" β
- Emoji handling: "Check out our sale! π" β
- Mathematical formulas: "x = (-b Β± β(bΒ²-4ac))/2a" β
- Scientific notation: "COβ + HβO β CβHββOβ" β
- Diverse scripts: Arabic, CJK, Cyrillic, Devanagari β
- English-like languages: Dutch, German β
Limitations
- Binary classification only (English vs Non-English)
- Optimized for texts up to 256 UTF-8 bytes
- May have reduced accuracy on very rare languages not in training data
- Not suitable for multilingual text (mixed languages in single input)
License
MIT License - free for commercial use.
- Downloads last month
- 10
Evaluation results
- Validation Accuracyself-reported99.940
- Challenge Set Accuracyself-reported100.000