UTF-8 / Unicode 17.1
Gateway
GPU-accelerated UTF-8 decoding and normalization for modern text pipelines. WebGPU, Metal, and C fallback — one Unicode 17.1 gate for all your text.
What is Decoder?
Decoder is a UTF-8 + Unicode 17.1 compliant decoder and normalizer. It runs on WebGPU (browser / server), Metal (Apple Silicon), and includes a portable C fallback for environments without GPU access.
Decoder started as a WebGPU UTF-8 normalizer for a BPE tokenizer. It quickly became a standalone module: a generic "first gate" for any text pipeline where bytes enter, and normalized, validated, grapheme-aware text leaves.
The first gate for any text pipeline: ingest → normalize → segment → tokenize.
Platform Matrix
WebGPU Engine
- › Runs in browser and server WebGPU
- › Ideal for crawlers, dashboards, and client-side pipelines
- › Streams UTF-8 validation + normalization on the GPU
Metal Engine
- › Optimized for Apple Silicon (M-series)
- › Tight integration with Swift / Metal compute
- › Feeds compiler frontends, IDEs, and local tools
C Fallback
- › Portable CPU path for any environment
- › Drop-in library where GPU isn't available
- › Same normalization semantics and Unicode 17.1 behavior
Where Can You Use It?
01 Ingest & Preprocessing
- • OCR / ASR post-processing and noise cleanup
- • Web crawling with UTF-8 validation
- • RSS / email ingest, BOM cleanup
- • PDFs, scanned docs, mixed encodings
02 Security & Compliance
- • UTF-8 gate in WAF / CDN edge
- • Overlong, surrogate, out-of-range detection
- • BiDi override detection (Trojan Source)
- • Homoglyph / mixed-script anomalies
- • DLP / PII pre-pass with stable hashes
03 ML & Data Pipelines
- • Pre-tokenization normalization
- • Corpus cleaning for LM training
- • Character-level stats (n-gram, histograms)
- • Duplicate reduction via Unicode equivalence
04 Logs & Indexing
- • GB-scale log processing on GPU
- • Line / word boundary detection
- • Pre-pass for full-text and vector indexes
- • Grapheme / word segmentation before BPE
05 Dev Tools & Editors
- • UTF-8 → codepoint → token pipeline
- • Grapheme-aware cursor and selection
- • Diff / merge with word boundaries
- • Emoji, ZWJ chains, RI flags
06 Media & Real-Time
- • SRT / VTT / TTML subtitle decoding
- • Streaming decode with time offsets
- • Messaging normalization (emoji, mentions)
- • IoT / gateway text compliance filter
Unicode 17.1 Compliance & Security
Decoder acts as a UTF-8 / Unicode policy enforcement layer before anything else sees the bytes. It's designed to catch malformed input at the edge of your system.
→ OWASP Unicode Encoding Attacks
→ OWASP Input Validation Cheat Sheet
→ Trojan Source / invisible vulnerabilities
Overlong Sequences
Detects and rejects overlong UTF-8 encodings used to bypass security filters.
Surrogates & Out-of-Range
Validates code points against Unicode ranges; rejects surrogates and invalid sequences.
BiDi Overrides
Flags dangerous bidirectional control characters (Trojan Source style attacks).
Mixed Scripts & Homoglyphs
Detects confusable characters and suspicious script mixing patterns.
Pipeline Diagram
Engineering
Multi-GB/s Class Throughput
Designed for high-throughput pipelines on modern GPUs. Process logs, crawl data, and corpus files at GPU speeds.
Low-Latency Streaming
Ideal for log pipes, ingest services, and real-time text processing where latency matters.
GPU-First Approach
Explores what happens when proven CPU UTF-8 validation techniques (from Red Hat, Intel, simdjson) move to WebGPU and Metal.
Live Demo
Coming soon — paste text to run it through the Decoder pipeline.
About Decoder
Decoder is an independent research project by TLabs and developed at TLabs in Türkiye. It grew out of experiments with WebGPU BPE tokenizers and now targets any pipeline where text comes in, and everything downstream must trust it.
Decoder is under active development. Not all parts are public yet; core engines will be released on GitHub. It's designed as a reusable module, not a SaaS.