innit: Fast English vs Non-English Text Detection

A lightweight byte-level CNN for fast binary language detection (English vs Non-English).

Model Details

  • Model Type: Byte-level Convolutional Neural Network
  • Task: Binary text classification (English vs Non-English)
  • Architecture: TinyByteCNN_EN with 6 convolutional blocks
  • Parameters: 156,642
  • Input: Raw UTF-8 bytes (max 256 bytes)
  • Output: Binary classification (0=Non-English, 1=English)

Performance

  • Validation Accuracy: 99.94%
  • Challenge Set Accuracy: 100% (14/14 test cases)
  • Inference Speed: Sub-millisecond on modern CPUs
  • Model Size: ~600KB

Supported Languages

Trained to distinguish English from 52+ languages across diverse scripts:

  • Latin scripts: Spanish, French, German, Italian, Dutch, Portuguese, etc.
  • CJK scripts: Chinese (Simplified/Traditional), Japanese, Korean
  • Cyrillic scripts: Russian, Ukrainian, Bulgarian, Serbian
  • Other scripts: Arabic, Hindi, Bengali, Thai, Hebrew, etc.

Architecture

TinyByteCNN_EN:
β”œβ”€β”€ Embedding: 257 β†’ 80 dimensions (256 bytes + padding)
β”œβ”€β”€ 6x Convolutional Blocks:
β”‚   β”œβ”€β”€ Conv1D (kernel=3, residual connections)
β”‚   β”œβ”€β”€ GELU activation
β”‚   β”œβ”€β”€ BatchNorm1D  
β”‚   └── Dropout (0.15)
β”œβ”€β”€ Enhanced Pooling: mean + max + std
└── Classification Head: 240 β†’ 80 β†’ 2

Training Data

  • Total samples: 17,543 balanced samples
  • English: 8,772 samples from diverse sources
  • Non-English: 8,771 samples across 52+ languages
  • Text lengths: 3-276 characters (optimized for short texts)
  • Special coverage: Emoji handling, mathematical formulas, scientific notation

Quick Start

Option 1: ONNX Runtime (Recommended)

import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("model.onnx")

def predict(text):
    # Prepare input
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.int64)
    padded[:len(bytes_data)] = list(bytes_data)
    
    # Run inference
    outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
    logits = outputs[0][0]
    
    # Apply softmax
    exp_logits = np.exp(logits - np.max(logits))
    probs = exp_logits / np.sum(exp_logits)
    return probs[1]  # English probability

# Examples
print(predict("Hello world!"))           # ~1.0 (English)
print(predict("Bonjour le monde"))       # ~0.0 (French)
print(predict("Check our sale! πŸŽ‰"))     # ~1.0 (English with emoji)

Option 2: Python Package

# Install the utility package
pip install innit-detector

# CLI usage
innit "Hello world!"                    # β†’ English (confidence: 0.974)
innit --download                        # Download model first
innit "Hello" "Bonjour" "δ½ ε₯½"          # Multiple texts

# Library usage
from innit_detector import InnitDetector
detector = InnitDetector()
result = detector.predict("Hello world!")
print(result['is_english'])  # True

Option 3: PyTorch (Advanced)

import torch
import torch.nn.functional as F
from safetensors.torch import load_file
import numpy as np

# Load model (requires TinyByteCNN_EN class definition)
state_dict = load_file("model.safetensors")
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
model.load_state_dict(state_dict)
model.eval()

def predict(text):
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.long)
    padded[:len(bytes_data)] = list(bytes_data)
    
    with torch.no_grad():
        logits = model(torch.tensor(padded).unsqueeze(0))
        probs = F.softmax(logits, dim=1)
        return probs[0][1].item()

ONNX Support

ONNX version available for cross-platform deployment:

  • model.onnx - Full precision (FP32) for maximum compatibility

Challenge Set Results

Perfect 100% accuracy on comprehensive test cases:

  • Ultra-short texts: "Good morning!" βœ…
  • Emoji handling: "Check out our sale! πŸŽ‰" βœ…
  • Mathematical formulas: "x = (-b Β± √(bΒ²-4ac))/2a" βœ…
  • Scientific notation: "COβ‚‚ + Hβ‚‚O β†’ C₆H₁₂O₆" βœ…
  • Diverse scripts: Arabic, CJK, Cyrillic, Devanagari βœ…
  • English-like languages: Dutch, German βœ…

Limitations

  • Binary classification only (English vs Non-English)
  • Optimized for texts up to 256 UTF-8 bytes
  • May have reduced accuracy on very rare languages not in training data
  • Not suitable for multilingual text (mixed languages in single input)

License

MIT License - free for commercial use.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Evaluation results