- Python 100%
- Add missing Path import in bpe_trainer.py (line 14) - Initialize warm_start_summary variable in _cmd_train_unigram() - Add complete warm-start handling logic to unigram CLI These fixes resolve 13+ test failures where NameError was raised for undefined variables. BPE trainer tests went from 13 failures to 3 failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> |
||
|---|---|---|
| .github/workflows | ||
| benchmarks | ||
| docs | ||
| gpu_tokenizer | ||
| requirements | ||
| tests | ||
| .gitignore | ||
| conftest.py | ||
| LICENSE | ||
| main.py | ||
| mkdocs.yml | ||
| plan.md | ||
| README.md | ||
SuperToken
SuperToken is a GPU-accelerated tokenizer toolkit that offers high-throughput byte-pair and unigram training pipelines. It combines streaming data ingestion, adaptive batch sizing, and GPU-friendly packing utilities to keep your accelerators busy while you iterate on vocabulary design.
Table of Contents
- Features
- Installation
- Quick Start
- Privacy Modes
- Threat Model
- Command Reference
- Benchmarking
- Architecture & API Overview
- Project Layout
- Documentation
- Contributing
- License
Features
- GPU-native trainers for both Byte Pair Encoding (BPE) and unigram vocabularies via
GPUBPETrainerandGPUUnigramTrainer. - CPU parity mode for the unigram trainer, reusing the same candidate extension, forward/backward scoring, and pruning logic when CUDA is unavailable.
- Adaptive autoscaling batch suggestion system to maintain target GPU utilization using the
AutoScalerutility. - Streaming corpus ingestion with optional compression, memory-mapped shards, and background worker prefetch.
- Opt-in morphology preprocessing powered by pluggable annotators. Keep token statistics stable by default and selectively enable language-specific passes when you need them.
Installation
This project requires Python 3.10+ and a working PyTorch installation with CUDA support.
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .
Note: Editable installs make it easy to iterate on the library modules in
gpu_tokenizer/while running the CLI.
Quick Start
Train a BPE model against a directory of text shards:
python main.py train-bpe \
--data "data/**/*.txt" \
--merges 50000 \
--token-bytes 8192 \
--target-util 0.85 \
--morphology-lang tr \
--morphology-case-markers \
--out-dir ./artifacts/bpe
Enable morphology plugins only when you need them—leaving --morphology-lang unset keeps byte statistics identical to the raw
corpus. The example above activates the Turkish segmenter and optional case markers to demonstrate the new flags.
Need resilience against interruptions? The BPE trainer now supports periodic checkpointing and seamless resume:
python main.py train-bpe \
--data "data/**/*.txt" \
--merges 50000 \
--checkpoint-dir ./artifacts/bpe-checkpoints \
--checkpoint-every 2000
# Later, continue from the most recent checkpoint:
python main.py train-bpe \
--data "data/**/*.txt" \
--merges 50000 \
--resume-from ./artifacts/bpe-checkpoints \
--checkpoint-dir ./artifacts/bpe-checkpoints \
--checkpoint-every 2000
The CLI will restore the autoscaler state, on-device batches, and resume streaming where it left off while logging each checkpoint save/restore event.
Train a unigram model with a fixed vocabulary size:
python main.py train-unigram \
--data "data/**/*.txt" \
--vocab-size 50000 \
--epochs 3 \
--out-dir ./artifacts/unigram
Both commands will automatically adapt the batch size in response to your GPU throughput and persist the resulting vocabulary files.
Evaluate exported artifacts against a reference corpus in a single step—the CLI writes a schema-validated JSON report to reports/evaluate.json by default and prints a human-readable banner that mirrors the headline metrics:
python main.py evaluate \
--data tests/data/evaluate_corpus/*.txt \
--artifacts tests/data/models/bpe \
--deterministic \
--output reports/evaluate.json \
--summary-format table
The bundled fixtures span emoji (including ZWJ sequences), RTL strings, and multiple CJK/Indic scripts alongside structured code manifests so regression tests keep byte-level handling honest.
Supply --merges (or an artifacts directory containing merges.(json|txt)) when benchmarking BPE exports. SentencePiece packages expose unigram.vocab, so pass --model-type unigram or omit --merges to let auto-detection switch modes. Change --summary-format to json or none to customise the banner while keeping the JSON payload stable for downstream tooling.
Alternate between BPE warm starts and unigram refinement in a single run:
python main.py train-hybrid \
--data "data/**/*.txt" \
--merges 50000 \
--cycles 2 \
--unigram-epochs 2 \
--privacy tie-randomize \
--out-dir ./artifacts/hybrid
The hybrid workflow exports Hugging Face-ready BPE files alongside SentencePiece probabilities and a manifest describing each cycle.
Switch to the AST-aware pipeline when training on source repositories. Code-mode can be paired with morphology and privacy guards just like the text pipelines:
python main.py train-bpe \
--data repo.jsonl \
--merges 32000 \
--code-mode \
--code-langs python typescript \
--meta-compress \
--privacy hash-merges \
--out-dir ./artifacts/code-bpe
The CLI prints a code_mode summary block so you can audit AST coverage and meta-token compression gains. Enable --privacy to redact merge histories (hash-merges) or randomise tie-breaks (tie-randomize) before exporting.
Privacy Modes
SuperToken provides an opt-in privacy guard for the merge history produced by the GPU trainers. The --privacy flag, available on the train-bpe and train-hybrid subcommands, accepts three modes:
none(default) – Export raw merge tables and maintain deterministic tie-breaks. Checkpoints and manifests record the merge pairs in plain text.hash-merges– Replace merge IDs with salted hashes in all exported manifests. Theprivacyblock written tostate.json,bpe_merges.json, andhybrid_manifest.jsonindicates that merges were redacted and whether a salt was supplied via--privacy-salt.tie-randomize– Hash merges and randomize tie-break resolution. This deliberately breaks deterministic parity across devices; provide--tie-seedto make the stochastic ordering reproducible across runs.
Every exported manifest now includes a "privacy" section summarizing the active mode, whether merges were redacted, the effective tie seed, and if a salt was configured. Downstream consumers can inspect this block to detect redactions without reverse engineering trainer configuration. See docs/cli.md for end-to-end examples.
Threat Model
SuperToken targets operators that distribute tokenizer artifacts to semi-trusted partners or run training on shared infrastructure. We assume adversaries can inspect any exported manifest, checkpoint, or intermediate merge table that leaves the control plane, and that they can correlate those artifacts with known corpora to recover sensitive domain terminology. We do not attempt to defend against an attacker that compromises the training host, reads raw input shards, or tampers with the trainer implementation itself.
The privacy modes focus on limiting how much corpus information leaks through merge histories:
| Mode | Adversary capabilities mitigated | Residual risks and operator actions |
|---|---|---|
none |
None. Merge tables and tie breaks remain deterministic, enabling exact reconstruction of token orderings. | Only appropriate when all consumers already have access to the source corpus. Treat artifacts as public. |
hash-merges |
Prevents a passive observer from reading merge IDs directly or matching them to common subwords without brute-force enumeration. Salting frustrates precomputed rainbow tables. | Frequency analysis against known vocabularies can still reveal high-probability merges. Keep salts secret and rotate them between releases to slow cross-run correlation. |
tie-randomize |
Inherits hash-merges protections and additionally disrupts deterministic tie resolution, reducing an adversary's ability to infer subtle ordering preferences from multiple builds. |
Stochastic tie breaks introduce run-to-run variation that may complicate regression diffs. Persist the --tie-seed (if used) in a secure location so reruns remain auditable. |
Operators choosing a privacy guard should weigh the sensitivity of merge names, how widely artifacts will be shared, and whether reproducibility is a regulatory requirement. When uncertainty remains, prefer tie-randomize with an explicitly managed seed so auditors can replay outputs without exposing raw merge labels.
Command Reference
The CLI is organized into subcommands that share a common set of arguments.
| Command | Description | Highlights |
|---|---|---|
train-bpe |
Trains a GPU-accelerated BPE tokenizer. | Autoscaled batch sizing, streaming ingestion, optional on-the-fly merges export. |
train-unigram |
Trains a GPU-accelerated unigram tokenizer. | Epoch-based training with configurable vocab size and subword length. |
train-hybrid |
Alternates BPE warm starts with unigram refinement. | Shared batches across phases, hybrid artifact bundle (merges.txt, unigram.prob, manifest). |
benchmark |
Runs both trainers against synthetic and/or real corpora. | Emits comparative tables and JSON telemetry snapshots. |
Benchmarking
Run the bundled benchmark to compare the BPE and unigram trainers with a single command. The example below synthesizes 2,000 sentences while also sampling up to 1,000 documents from your dataset globs:
python main.py benchmark \
--data "data/**/*.txt" \
--max-real-docs 1000 \
--synthetic-docs 2000 \
--synthetic-min-len 16 \
--synthetic-max-len 64 \
--output-dir ./artifacts/benchmarks
Sample output:
Corpus → 2500 sequences, 102400 tokens (max len 128)
Trainer | Wall time (s) | Tokens/s | Final vocab
------------------+---------------+-------------+------------
GPUBPETrainer | 12.84 | 7975.19 | 50256
GPUUnigramTrainer | 8.42 | 12158.52 | 50000
Saved benchmark metadata → artifacts/benchmarks/benchmark_20240101T120000Z.json
The benchmark will always emit a pretty-printed comparison table and serialize the full telemetry payloads into timestamped JSON files under the requested output directory. Those JSON artifacts capture the raw trainer metadata, corpus descriptors, and the CLI configuration so runs are fully reproducible.
Common flags include:
--data: One or more glob patterns pointing at UTF-8 text shards.--compression: Choose betweennone,zstd, orlz4for shard decoding.--io-workers&--prefetch-batches: Control the background streaming pipeline.--bos/--eos: Optionally inject special token IDs during packing.
Run python main.py --help for a full list of options.
Morphology plugins (opt-in)
SuperToken ships with a small, safe-by-default morphology layer that leaves byte streams untouched unless explicitly enabled.
Plugins pre-segment text before it reaches the BytePacker, which can improve compression ratios for agglutinative languages
at the cost of changing downstream token statistics. To enable a plugin, pass --morphology-lang with one of the advertised
language codes:
tr– Turkish suffix annotator that optionally tags case markers and productive affixes.ja– Japanese script-aware segmenter that preserves contiguous Kanji, Hiragana, Katakana, and ASCII spans.ko– Korean script-aware segmenter that groups Hangul runs while leaving Latin digits and punctuation intact.
All bundled plugins are covered by unit tests that assert presegment/recompose round trips, so the original byte streams are
reconstructed exactly even when annotations are emitted.
python main.py train-bpe \
--data "data/**/*.txt" \
--merges 50000 \
--morphology-lang tr \
--morphology-case-markers \
--out-dir ./artifacts/bpe-tr
Leave the flag unset to retain the raw byte stream. See docs/api.md for the plugin interface and docs/cookbook/morphology.md for an end-to-end recipe that trains with the Turkish plugin and verifies reconstruction fidelity.
Architecture & API Overview
SuperToken is organized into modular layers that can be reused independently or combined through the CLI:
- Autoscaling (
gpu_tokenizer.autoscaler) – Provides theAutoScalerclass that tracks throughput telemetry and surfacessuggest_batch_sizehelpers for trainers. See the inline docstrings ingpu_tokenizer/autoscaler.pyfor configuration knobs and extension hooks around utilization targets. - BPE training (
gpu_tokenizer.bpe_trainer) – ImplementsGPUBPETrainer, merging heuristics, and checkpoint serialization. This module integrates directly with the autoscaler and exposes hooks for custom merge filters; refer togpu_tokenizer/bpe_trainer.py. - Unigram training (
gpu_tokenizer.unigram_trainer) – OffersGPUUnigramTrainerplus scoring utilities for probabilistic vocabularies. Docstrings ingpu_tokenizer/unigram_trainer.pydescribe how to plug in custom smoothing or constraint logic. - Datasets & packing (
gpu_tokenizer.datasets) – Houses streaming dataset abstractions, packing helpers, and synthetic corpus generators used by both trainers. Seegpu_tokenizer/datasets/__init__.pyand the submodules it re-exports. - I/O pipeline (
gpu_tokenizer.io) – Encapsulates shard decoding, compression handling, and background workers. Start withgpu_tokenizer/io/__init__.pyand follow the module-level docs for extension points. - CLI composition (
main.py) – Declares thetrain-bpe,train-unigram,train-hybrid, andbenchmarksubcommands. You can register new commands by extending thebuild_parserfunction and wiring your trainers to the shared autoscaler utilities. - Benchmark utilities (
benchmarks/) – Contains reusable benchmarking harnesses and report formatters. Module docstrings point to upcoming narrative guides underdocs/benchmarks/for more complex scenarios.
Future deep dives will land in the docs/ directory (see docs/architecture.md) and will mirror the high-level flow described here.
Project Layout
.
├── main.py # CLI entry point tying together trainers and utilities
├── gpu_tokenizer/ # Core GPU trainers, packing utilities, and dataset helpers
├── docs/ # Design notes and performance documentation
└── tests/ # Unit tests covering packing, IO, and trainer behavior
Documentation
- Architecture overview: Understand the end-to-end trainer pipeline, autoscaler lifecycle, and how datasets stream into GPU kernels.
- CLI usage guide: Learn the subcommands, shared flags, and example workflows for training, resuming, and benchmarking tokenizers.
- API reference: Dive into the primary Python entry points, including trainers, autoscaler hooks, dataset utilities, and benchmarking helpers.
- Module guide: Browse the module-by-module breakdown of the codebase for deeper implementation details.
- Performance notes and benchmarks: Review methodology and representative throughput numbers, plus tips for reproducing measurements.
Module Primers
- Trainers –
GPUBPETrainerandGPUUnigramTrainercoordinate packing, kernel launches, and checkpointing. See the Module guide → Trainers section for configuration hints and extension hooks. - Autoscaler – The adaptive batching logic in
gpu_tokenizer.autoscalerkeeps GPU utilization in the target band. Refer to Module guide → Autoscaler for heuristics and subclassing advice. - Streaming I/O – Dataset loaders and IO helpers manage compressed shards, worker pools, and synthetic corpora. Explore Module guide → Streaming I/O to customize ingestion paths.
CLI & Benchmark Navigation
- Discover commands –
main.pyis the CLI entry point; runpython main.py --helpto enumerate subcommands. Eachtrain-*action is registered inside thebuild_parserhelper alongside shared arguments. - Command implementations – The BPE flow lives in
gpu_tokenizer/cli_train_bpe.py, which binds argument parsing to theGPUBPETrainer. Mirror its structure when adding new CLI frontends so trainers remain reusable. - Benchmark utilities – Reusable harnesses, corpus generators, and reporting helpers reside under
benchmarks/. Pair them withpython main.py benchmarkfor quick comparisons, or import them directly in notebooks to script bespoke experiments.
Additional guides and API notes can be added under the docs/ directory as the project grows.
Contributing
- Fork the repository and create a virtual environment.
- Install development dependencies (see
pyproject.tomlif present). - Format your changes and ensure tests pass via
pytest. - Open a pull request describing your changes and include benchmark results when appropriate.
- The "Tests" GitHub Actions workflow runs syntax linting, evaluation-schema validation, and the full pytest suite on every push and pull request so contributors get immediate feedback.
License
This project is licensed under the Mozilla Public License 2.0.