No description
Find a file
hackall360 c2c385869b
Some checks failed
Docs / build (push) Has been cancelled
bench.yml / fix: add missing Path import and initialize warm_start_summary (push) Has been cancelled
tests.yml / fix: add missing Path import and initialize warm_start_summary (push) Has been cancelled
fix: add missing Path import and initialize warm_start_summary
- Add missing Path import in bpe_trainer.py (line 14)
- Initialize warm_start_summary variable in _cmd_train_unigram()
- Add complete warm-start handling logic to unigram CLI

These fixes resolve 13+ test failures where NameError was raised for
undefined variables. BPE trainer tests went from 13 failures to 3 failures.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-01 23:26:49 -07:00
.github/workflows Add CI workflow for linting and tests 2025-10-29 15:53:40 -07:00
benchmarks fix: Windows compatibility, critical bug fixes, and documentation improvements 2025-11-01 21:52:12 -07:00
docs fix: Windows compatibility, critical bug fixes, and documentation improvements 2025-11-01 21:52:12 -07:00
gpu_tokenizer fix: add missing Path import and initialize warm_start_summary 2025-11-01 23:26:49 -07:00
requirements Add CI workflow for linting and tests 2025-10-29 15:53:40 -07:00
tests fix: Windows compatibility, critical bug fixes, and documentation improvements 2025-11-01 21:52:12 -07:00
.gitignore fix: Windows compatibility, critical bug fixes, and documentation improvements 2025-11-01 21:52:12 -07:00
conftest.py fix: Windows compatibility, critical bug fixes, and documentation improvements 2025-11-01 21:52:12 -07:00
LICENSE Add Mozilla Public License 2.0 2025-10-14 09:59:29 -07:00
main.py fix: add missing Path import and initialize warm_start_summary 2025-11-01 23:26:49 -07:00
mkdocs.yml Add resume support for unigram and hybrid trainers 2025-10-19 01:36:48 -07:00
plan.md Update plan.md 2025-10-18 11:11:18 -07:00
README.md Add CI workflow for linting and tests 2025-10-29 15:53:40 -07:00

SuperToken

Python 3.10+ Status: Active Documentation Made with Love

SuperToken is a GPU-accelerated tokenizer toolkit that offers high-throughput byte-pair and unigram training pipelines. It combines streaming data ingestion, adaptive batch sizing, and GPU-friendly packing utilities to keep your accelerators busy while you iterate on vocabulary design.

Table of Contents

Features

  • GPU-native trainers for both Byte Pair Encoding (BPE) and unigram vocabularies via GPUBPETrainer and GPUUnigramTrainer.
  • CPU parity mode for the unigram trainer, reusing the same candidate extension, forward/backward scoring, and pruning logic when CUDA is unavailable.
  • Adaptive autoscaling batch suggestion system to maintain target GPU utilization using the AutoScaler utility.
  • Streaming corpus ingestion with optional compression, memory-mapped shards, and background worker prefetch.
  • Opt-in morphology preprocessing powered by pluggable annotators. Keep token statistics stable by default and selectively enable language-specific passes when you need them.

Installation

This project requires Python 3.10+ and a working PyTorch installation with CUDA support.

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -e .

Note: Editable installs make it easy to iterate on the library modules in gpu_tokenizer/ while running the CLI.

Quick Start

Train a BPE model against a directory of text shards:

python main.py train-bpe \
  --data "data/**/*.txt" \
  --merges 50000 \
  --token-bytes 8192 \
  --target-util 0.85 \
  --morphology-lang tr \
  --morphology-case-markers \
  --out-dir ./artifacts/bpe

Enable morphology plugins only when you need them—leaving --morphology-lang unset keeps byte statistics identical to the raw corpus. The example above activates the Turkish segmenter and optional case markers to demonstrate the new flags.

Need resilience against interruptions? The BPE trainer now supports periodic checkpointing and seamless resume:

python main.py train-bpe \
  --data "data/**/*.txt" \
  --merges 50000 \
  --checkpoint-dir ./artifacts/bpe-checkpoints \
  --checkpoint-every 2000

# Later, continue from the most recent checkpoint:
python main.py train-bpe \
  --data "data/**/*.txt" \
  --merges 50000 \
  --resume-from ./artifacts/bpe-checkpoints \
  --checkpoint-dir ./artifacts/bpe-checkpoints \
  --checkpoint-every 2000

The CLI will restore the autoscaler state, on-device batches, and resume streaming where it left off while logging each checkpoint save/restore event.

Train a unigram model with a fixed vocabulary size:

python main.py train-unigram \
  --data "data/**/*.txt" \
  --vocab-size 50000 \
  --epochs 3 \
  --out-dir ./artifacts/unigram

Both commands will automatically adapt the batch size in response to your GPU throughput and persist the resulting vocabulary files.

Evaluate exported artifacts against a reference corpus in a single step—the CLI writes a schema-validated JSON report to reports/evaluate.json by default and prints a human-readable banner that mirrors the headline metrics:

python main.py evaluate \
  --data tests/data/evaluate_corpus/*.txt \
  --artifacts tests/data/models/bpe \
  --deterministic \
  --output reports/evaluate.json \
  --summary-format table

The bundled fixtures span emoji (including ZWJ sequences), RTL strings, and multiple CJK/Indic scripts alongside structured code manifests so regression tests keep byte-level handling honest.

Supply --merges (or an artifacts directory containing merges.(json|txt)) when benchmarking BPE exports. SentencePiece packages expose unigram.vocab, so pass --model-type unigram or omit --merges to let auto-detection switch modes. Change --summary-format to json or none to customise the banner while keeping the JSON payload stable for downstream tooling.

Alternate between BPE warm starts and unigram refinement in a single run:

python main.py train-hybrid \
  --data "data/**/*.txt" \
  --merges 50000 \
  --cycles 2 \
  --unigram-epochs 2 \
  --privacy tie-randomize \
  --out-dir ./artifacts/hybrid

The hybrid workflow exports Hugging Face-ready BPE files alongside SentencePiece probabilities and a manifest describing each cycle.

Switch to the AST-aware pipeline when training on source repositories. Code-mode can be paired with morphology and privacy guards just like the text pipelines:

python main.py train-bpe \
  --data repo.jsonl \
  --merges 32000 \
  --code-mode \
  --code-langs python typescript \
  --meta-compress \
  --privacy hash-merges \
  --out-dir ./artifacts/code-bpe

The CLI prints a code_mode summary block so you can audit AST coverage and meta-token compression gains. Enable --privacy to redact merge histories (hash-merges) or randomise tie-breaks (tie-randomize) before exporting.

Privacy Modes

SuperToken provides an opt-in privacy guard for the merge history produced by the GPU trainers. The --privacy flag, available on the train-bpe and train-hybrid subcommands, accepts three modes:

  • none (default) Export raw merge tables and maintain deterministic tie-breaks. Checkpoints and manifests record the merge pairs in plain text.
  • hash-merges Replace merge IDs with salted hashes in all exported manifests. The privacy block written to state.json, bpe_merges.json, and hybrid_manifest.json indicates that merges were redacted and whether a salt was supplied via --privacy-salt.
  • tie-randomize Hash merges and randomize tie-break resolution. This deliberately breaks deterministic parity across devices; provide --tie-seed to make the stochastic ordering reproducible across runs.

Every exported manifest now includes a "privacy" section summarizing the active mode, whether merges were redacted, the effective tie seed, and if a salt was configured. Downstream consumers can inspect this block to detect redactions without reverse engineering trainer configuration. See docs/cli.md for end-to-end examples.

Threat Model

SuperToken targets operators that distribute tokenizer artifacts to semi-trusted partners or run training on shared infrastructure. We assume adversaries can inspect any exported manifest, checkpoint, or intermediate merge table that leaves the control plane, and that they can correlate those artifacts with known corpora to recover sensitive domain terminology. We do not attempt to defend against an attacker that compromises the training host, reads raw input shards, or tampers with the trainer implementation itself.

The privacy modes focus on limiting how much corpus information leaks through merge histories:

Mode Adversary capabilities mitigated Residual risks and operator actions
none None. Merge tables and tie breaks remain deterministic, enabling exact reconstruction of token orderings. Only appropriate when all consumers already have access to the source corpus. Treat artifacts as public.
hash-merges Prevents a passive observer from reading merge IDs directly or matching them to common subwords without brute-force enumeration. Salting frustrates precomputed rainbow tables. Frequency analysis against known vocabularies can still reveal high-probability merges. Keep salts secret and rotate them between releases to slow cross-run correlation.
tie-randomize Inherits hash-merges protections and additionally disrupts deterministic tie resolution, reducing an adversary's ability to infer subtle ordering preferences from multiple builds. Stochastic tie breaks introduce run-to-run variation that may complicate regression diffs. Persist the --tie-seed (if used) in a secure location so reruns remain auditable.

Operators choosing a privacy guard should weigh the sensitivity of merge names, how widely artifacts will be shared, and whether reproducibility is a regulatory requirement. When uncertainty remains, prefer tie-randomize with an explicitly managed seed so auditors can replay outputs without exposing raw merge labels.

Command Reference

The CLI is organized into subcommands that share a common set of arguments.

Command Description Highlights
train-bpe Trains a GPU-accelerated BPE tokenizer. Autoscaled batch sizing, streaming ingestion, optional on-the-fly merges export.
train-unigram Trains a GPU-accelerated unigram tokenizer. Epoch-based training with configurable vocab size and subword length.
train-hybrid Alternates BPE warm starts with unigram refinement. Shared batches across phases, hybrid artifact bundle (merges.txt, unigram.prob, manifest).
benchmark Runs both trainers against synthetic and/or real corpora. Emits comparative tables and JSON telemetry snapshots.

Benchmarking

Run the bundled benchmark to compare the BPE and unigram trainers with a single command. The example below synthesizes 2,000 sentences while also sampling up to 1,000 documents from your dataset globs:

python main.py benchmark \
  --data "data/**/*.txt" \
  --max-real-docs 1000 \
  --synthetic-docs 2000 \
  --synthetic-min-len 16 \
  --synthetic-max-len 64 \
  --output-dir ./artifacts/benchmarks

Sample output:

Corpus → 2500 sequences, 102400 tokens (max len 128)
Trainer           | Wall time (s) | Tokens/s    | Final vocab
------------------+---------------+-------------+------------
GPUBPETrainer     | 12.84         | 7975.19     | 50256
GPUUnigramTrainer | 8.42          | 12158.52    | 50000
Saved benchmark metadata → artifacts/benchmarks/benchmark_20240101T120000Z.json

The benchmark will always emit a pretty-printed comparison table and serialize the full telemetry payloads into timestamped JSON files under the requested output directory. Those JSON artifacts capture the raw trainer metadata, corpus descriptors, and the CLI configuration so runs are fully reproducible.

Common flags include:

  • --data: One or more glob patterns pointing at UTF-8 text shards.
  • --compression: Choose between none, zstd, or lz4 for shard decoding.
  • --io-workers & --prefetch-batches: Control the background streaming pipeline.
  • --bos/--eos: Optionally inject special token IDs during packing.

Run python main.py --help for a full list of options.

Morphology plugins (opt-in)

SuperToken ships with a small, safe-by-default morphology layer that leaves byte streams untouched unless explicitly enabled. Plugins pre-segment text before it reaches the BytePacker, which can improve compression ratios for agglutinative languages at the cost of changing downstream token statistics. To enable a plugin, pass --morphology-lang with one of the advertised language codes:

  • tr Turkish suffix annotator that optionally tags case markers and productive affixes.
  • ja Japanese script-aware segmenter that preserves contiguous Kanji, Hiragana, Katakana, and ASCII spans.
  • ko Korean script-aware segmenter that groups Hangul runs while leaving Latin digits and punctuation intact.

All bundled plugins are covered by unit tests that assert presegment/recompose round trips, so the original byte streams are reconstructed exactly even when annotations are emitted.

python main.py train-bpe \
  --data "data/**/*.txt" \
  --merges 50000 \
  --morphology-lang tr \
  --morphology-case-markers \
  --out-dir ./artifacts/bpe-tr

Leave the flag unset to retain the raw byte stream. See docs/api.md for the plugin interface and docs/cookbook/morphology.md for an end-to-end recipe that trains with the Turkish plugin and verifies reconstruction fidelity.

Architecture & API Overview

SuperToken is organized into modular layers that can be reused independently or combined through the CLI:

  • Autoscaling (gpu_tokenizer.autoscaler) Provides the AutoScaler class that tracks throughput telemetry and surfaces suggest_batch_size helpers for trainers. See the inline docstrings in gpu_tokenizer/autoscaler.py for configuration knobs and extension hooks around utilization targets.
  • BPE training (gpu_tokenizer.bpe_trainer) Implements GPUBPETrainer, merging heuristics, and checkpoint serialization. This module integrates directly with the autoscaler and exposes hooks for custom merge filters; refer to gpu_tokenizer/bpe_trainer.py.
  • Unigram training (gpu_tokenizer.unigram_trainer) Offers GPUUnigramTrainer plus scoring utilities for probabilistic vocabularies. Docstrings in gpu_tokenizer/unigram_trainer.py describe how to plug in custom smoothing or constraint logic.
  • Datasets & packing (gpu_tokenizer.datasets) Houses streaming dataset abstractions, packing helpers, and synthetic corpus generators used by both trainers. See gpu_tokenizer/datasets/__init__.py and the submodules it re-exports.
  • I/O pipeline (gpu_tokenizer.io) Encapsulates shard decoding, compression handling, and background workers. Start with gpu_tokenizer/io/__init__.py and follow the module-level docs for extension points.
  • CLI composition (main.py) Declares the train-bpe, train-unigram, train-hybrid, and benchmark subcommands. You can register new commands by extending the build_parser function and wiring your trainers to the shared autoscaler utilities.
  • Benchmark utilities (benchmarks/) Contains reusable benchmarking harnesses and report formatters. Module docstrings point to upcoming narrative guides under docs/benchmarks/ for more complex scenarios.

Future deep dives will land in the docs/ directory (see docs/architecture.md) and will mirror the high-level flow described here.

Project Layout

.
├── main.py              # CLI entry point tying together trainers and utilities
├── gpu_tokenizer/       # Core GPU trainers, packing utilities, and dataset helpers
├── docs/                # Design notes and performance documentation
└── tests/               # Unit tests covering packing, IO, and trainer behavior

Documentation

  • Architecture overview: Understand the end-to-end trainer pipeline, autoscaler lifecycle, and how datasets stream into GPU kernels.
  • CLI usage guide: Learn the subcommands, shared flags, and example workflows for training, resuming, and benchmarking tokenizers.
  • API reference: Dive into the primary Python entry points, including trainers, autoscaler hooks, dataset utilities, and benchmarking helpers.
  • Module guide: Browse the module-by-module breakdown of the codebase for deeper implementation details.
  • Performance notes and benchmarks: Review methodology and representative throughput numbers, plus tips for reproducing measurements.

Module Primers

  • Trainers GPUBPETrainer and GPUUnigramTrainer coordinate packing, kernel launches, and checkpointing. See the Module guide → Trainers section for configuration hints and extension hooks.
  • Autoscaler The adaptive batching logic in gpu_tokenizer.autoscaler keeps GPU utilization in the target band. Refer to Module guide → Autoscaler for heuristics and subclassing advice.
  • Streaming I/O Dataset loaders and IO helpers manage compressed shards, worker pools, and synthetic corpora. Explore Module guide → Streaming I/O to customize ingestion paths.

CLI & Benchmark Navigation

  • Discover commands main.py is the CLI entry point; run python main.py --help to enumerate subcommands. Each train-* action is registered inside the build_parser helper alongside shared arguments.
  • Command implementations The BPE flow lives in gpu_tokenizer/cli_train_bpe.py, which binds argument parsing to the GPUBPETrainer. Mirror its structure when adding new CLI frontends so trainers remain reusable.
  • Benchmark utilities Reusable harnesses, corpus generators, and reporting helpers reside under benchmarks/. Pair them with python main.py benchmark for quick comparisons, or import them directly in notebooks to script bespoke experiments.

Additional guides and API notes can be added under the docs/ directory as the project grows.

Contributing

  1. Fork the repository and create a virtual environment.
  2. Install development dependencies (see pyproject.toml if present).
  3. Format your changes and ensure tests pass via pytest.
  4. Open a pull request describing your changes and include benchmark results when appropriate.
  5. The "Tests" GitHub Actions workflow runs syntax linting, evaluation-schema validation, and the full pytest suite on every push and pull request so contributors get immediate feedback.

License

This project is licensed under the Mozilla Public License 2.0.