Independent research project by TLabs · Built in Türkiye

UTF-8 / Unicode 17.1
Gateway

GPU-accelerated UTF-8 decoding and normalization for modern text pipelines. WebGPU, Metal, and C fallback — one Unicode 17.1 gate for all your text.

BPE Tokenizer (WebGPU) See demo (Unicode 17.1) Read the docs

WebGPU Metal C Fallback

Overview

What is Decoder?

Decoder is a UTF-8 + Unicode 17.1 compliant decoder and normalizer. It runs on WebGPU (browser / server), Metal (Apple Silicon), and includes a portable C fallback for environments without GPU access.

Decoder started as a WebGPU UTF-8 normalizer for a BPE tokenizer. It quickly became a standalone module: a generic "first gate" for any text pipeline where bytes enter, and normalized, validated, grapheme-aware text leaves.

The first gate for any text pipeline: ingest → normalize → segment → tokenize.

Engines

Platform Matrix

WebGPU Engine

› Runs in browser and server WebGPU
› Ideal for crawlers, dashboards, and client-side pipelines
› Streams UTF-8 validation + normalization on the GPU

Metal Engine

› Optimized for Apple Silicon (M-series)
› Tight integration with Swift / Metal compute
› Feeds compiler frontends, IDEs, and local tools

C Fallback

› Portable CPU path for any environment
› Drop-in library where GPU isn't available
› Same normalization semantics and Unicode 17.1 behavior

Applications

Where Can You Use It?

01 Ingest & Preprocessing

• OCR / ASR post-processing and noise cleanup
• Web crawling with UTF-8 validation
• RSS / email ingest, BOM cleanup
• PDFs, scanned docs, mixed encodings

02 Security & Compliance

• UTF-8 gate in WAF / CDN edge
• Overlong, surrogate, out-of-range detection
• BiDi override detection (Trojan Source)
• Homoglyph / mixed-script anomalies
• DLP / PII pre-pass with stable hashes

03 ML & Data Pipelines

• Pre-tokenization normalization
• Corpus cleaning for LM training
• Character-level stats (n-gram, histograms)
• Duplicate reduction via Unicode equivalence

04 Logs & Indexing

• GB-scale log processing on GPU
• Line / word boundary detection
• Pre-pass for full-text and vector indexes
• Grapheme / word segmentation before BPE

05 Dev Tools & Editors

• UTF-8 → codepoint → token pipeline
• Grapheme-aware cursor and selection
• Diff / merge with word boundaries
• Emoji, ZWJ chains, RI flags

06 Media & Real-Time

• SRT / VTT / TTML subtitle decoding
• Streaming decode with time offsets
• Messaging normalization (emoji, mentions)
• IoT / gateway text compliance filter

Security

Unicode 17.1 Compliance & Security

Decoder acts as a UTF-8 / Unicode policy enforcement layer before anything else sees the bytes. It's designed to catch malformed input at the edge of your system.

→ OWASP Unicode Encoding Attacks
→ OWASP Input Validation Cheat Sheet
→ Trojan Source / invisible vulnerabilities

Overlong Sequences

Detects and rejects overlong UTF-8 encodings used to bypass security filters.

Surrogates & Out-of-Range

Validates code points against Unicode ranges; rejects surrogates and invalid sequences.

BiDi Overrides

Flags dangerous bidirectional control characters (Trojan Source style attacks).

Mixed Scripts & Homoglyphs

Detects confusable characters and suspicious script mixing patterns.

Architecture

Pipeline Diagram

Bytes (input stream)

↓

UTF-8 Validator (GPU / C)

↓

Unicode 17.1 Normalizer (NFC / NFKC / custom)

↓

Segmentation (grapheme / word / line)

↓

Downstream:

Tokenizer Search Indexer Security Scanner Compiler

Performance

Engineering

GB/s

Multi-GB/s Class Throughput

Designed for high-throughput pipelines on modern GPUs. Process logs, crawl data, and corpus files at GPU speeds.

μs

Low-Latency Streaming

Ideal for log pipes, ingest services, and real-time text processing where latency matters.

→

GPU-First Approach

Explores what happens when proven CPU UTF-8 validation techniques (from Red Hat, Intel, simdjson) move to WebGPU and Metal.

Try It

Live Demo

Coming soon — paste text to run it through the Decoder pipeline.

decoder.run/demo

Input text:

NFC NFKC

Output: Validation, normalization, and segmentation stats will appear here.

Status

About Decoder

Decoder is an independent research project by TLabs and developed at TLabs in Türkiye. It grew out of experiments with WebGPU BPE tokenizers and now targets any pipeline where text comes in, and everything downstream must trust it.

Decoder is under active development. Not all parts are public yet; core engines will be released on GitHub. It's designed as a reusable module, not a SaaS.

Roadmap

Open-source release planned

GPU BPE Tokenizer live →

Metal / C reference implementations planned

Benchmarks vs. CPU libraries upcoming

UTF-8 / Unicode 17.1 Gateway