HOME / CATALOG / AI E-BOOKS / ZCODE – HARNESS FOR GLM-5.2

№038

↳ PREVIEW

#ZCode – Harness for GLM‑5.2: The Complete Guide

An in‑depth, hands‑on manual for building, extending, and productionizing models with the ZCode harness for the GLM‑5.2 family of large language models.

Introduction
Chapter 1: Foundations
Chapter 2: Getting Started
Chapter 3: Core Techniques
Chapter 4: Advanced Strategies
Chapter 5: Real‑World Applications
Chapter 6: Common Pitfalls
Chapter 7: Tools and Resources
Chapter 8: 30‑Day Action Plan
Conclusion
Exercises

Introduction (≈1 200 words)

Why ZCode?

The rapid evolution of large language models (LLMs) has shifted the bo

AI E-BOOKS

ZCode – Harness for GLM-5.2

The complete guide to ZCode – Harness for GLM-5.2

$29

ONE-TIME PAYMENT · LIFETIME UPDATES

DOWNLOADS

FORMAT

Markdown

DELIVERY

Instant

● PAY WITH CRYPTO · NO ID REQUIRED

USDT-TRC20 BTC ETH SOL CRYPTOBOT

BUY NOW (Direct Crypto) →

Click to open Telegram → pay → download link appears automatically

Direct crypto = any wallet · CryptoBot = pay inside Telegram app

Introduction (≈1 200 words)

Why ZCode?

The rapid evolution of large language models (LLMs) has shifted the bottleneck from raw model power to engineering efficiency. Researchers and engineers now spend disproportionate amounts of time wiring together data pipelines, customizing inference kernels, debugging distributed training loops, and maintaining reproducibility across hardware generations.

ZCode is a purpose‑built harness that sits between the GLM‑5.2 model family (the latest generation of Generalized Language Models from the hypothetical “GLM” lineage) and the practitioner’s workflow. It provides:

Unified Configuration – a single YAML/JSON source of truth for model architecture, training hyper‑parameters, data schemas, and deployment specs.
Plug‑and‑Play Modules – ready‑made data loaders, tokenizers, optimizers, schedulers, and evaluation suites that can be swapped without touching core code.
Scalable Runtime – transparent support for single‑GPU debugging, multi‑node TPU pods, and hybrid CPU‑GPU inference via an abstracted execution graph.
Observability Hooks – built‑in metrics, tracing, and logging that integrate with Prometheus, Grafana, and MLflow.
Safety & Compliance Layer – automated checks for data provenance, bias mitigation, and model‑card generation.

By abstracting away boilerplate, ZCode lets teams focus on model innovation rather than infrastructure wrestling. The harness is deliberately lightweight: its core is < 2 MB of Python/Cython code, yet it can orchestrate pipelines that scale to hundreds of billions of parameters.

Who Should Read This Guide?

Audience	What You’ll Gain
ML Researchers	How to prototype new architectures (e.g., mixture‑of‑experts, retrieval‑augmented GLM) within a reproducible harness.
ML Engineers	Production‑grade patterns for distributed training, checkpointing, serving, and continuous integration.
Data Scientists	Techniques for data preprocessing, feature engineering, and evaluation that align with GLM‑5.2’s tokenization quirks.
DevOps / SRE	Guidance on monitoring, autoscaling, and fault‑tolerance for ZCode‑driven workloads.
Technical Leaders	A strategic view of how ZCode reduces time‑to‑market and risk when adopting GLM‑5.2 at scale.

Prerequisites: basic familiarity with Python (≥ 3.9), PyTorch 2.x (or JAX 0.4+), and containerization (Docker). Prior exposure to transformer‑style LLMs helps but is not required; the guide walks through GLM‑5.2 specifics from the ground up.

Structure of the Guide

Foundations – theory behind GLM‑5.2, the design philosophy of ZCode, and core abstractions.
Getting Started – installation, first‑run tutorial, and configuring a minimal training job.
Core Techniques – data pipelines, tokenization tricks, mixed‑precision training, and evaluation harnesses.
Advanced Strategies – mixture‑of‑experts, retrieval‑augmented generation, model parallelism, and custom kernels.
Real‑World Applications – case studies: chatbots, code generation, scientific summarization, and multilingual translation.
Common Pitfalls – debugging tips, gotchas with sharding, and performance anti‑patterns.
Tools & Resources – CLI, UI dashboards, community plugins, and reference implementations.
30‑Day Action Plan – a step‑by‑step roadmap to go from zero to a production‑ready GLM‑5.2 service.
Exercises – hands‑on labs to cement each chapter’s concepts.

Let’s embark on the journey to harness the full power of GLM‑5.2 with ZCode.

Chapter 1: Foundations (≈2 200 words)

1.1 The GLM‑5.2 Architecture in a Nutshell

GLM‑5.2 belongs to the Generalized Language Model family, which extends the classic transformer decoder with several innovations:

Component	Description	Impact
Sparse Mixture‑of‑Experts (MoE) Core	Each transformer layer contains a router that selects k experts out of E (typically 64) per token.	Enables model capacity to grow beyond hardware memory limits while keeping per‑token FLOPs manageable.
Rotary Positional Embeddings (RoPE) v2	Improved sinusoidal encoding with learnable frequency bands.	Better extrapolation to longer sequences (up to 32 k tokens).
Gated Linear Units (GLU) in Feed‑Forward	Replaces standard FFN with a gated mechanism: `SiLU(xW) ⊗ (xV)`.	Improves gradient flow and reduces training instability.
Dynamic Sparsity Masking	Tokens can be assigned a computational budget mask that disables certain attention heads based on entropy.	Saves compute on predictable or low‑information tokens.
Unified Vision‑Language Tokens	Optional patch embeddings are concatenated to the token stream, enabling multimodal inputs without a separate encoder.	Simplifies architecture for VL tasks.
LayerNorm‑Free Stabilization	Uses ScaleNorm and DeepNorm residuals to reduce reliance on LayerNorm, improving fp16 stability.	Enables more aggressive mixed‑precision training.

The model is thus yields 2–3× higher throughput compared to dense baselines of comparable parameter count.

1.2 Design Goals of ZCode

ZCode was conceived to satisfy three orthogonal axes:

Usability – minimal boilerplate, declarative configs, and sensible defaults.
Extensibility – plugin architecture that lets users inject custom layers, optimizers, or data augmentations without fork‑ing the core.
Performance – zero‑overhead abstractions where possible; critical paths are compiled with TorchScript or JAX‑XLA.

These goals map onto three primary abstractions:

Abstraction	Responsibility	Typical Implementation
HarnessConfig	Holds the entire experiment specification (model, data, optimizer, logging).	Pydantic model validated at load time.
Engine	Orchestrates the lifecycle: setup → train/eval → checkpoint → teardown.	Thin wrapper around PyTorch Lightning / JAX‑pmap.
Plugin System	Registers entry points for data loaders, tokenizers, callbacks, and custom ops.	Setuptools entry‑points + dynamic import.

1.3 Core Data Flow

+-------------------+      +-------------------+      +-------------------+
|  Raw Data Source  | ---> |  DataLoader Plugin| ---> |  Tokenizer Plugin |
+-------------------+      +-------------------+      +-------------------+
          |                         |                         |
          v                         v                         v
+-------------------+      +-------------------+      +-------------------+
|  Pre‑process Cache| ---> |  Collate Function | ---> |  Model Forward    |
+-------------------+      +-------------------+      +-------------------+
          |                         |                         |
          v                         v                         v
+-------------------+      +-------------------+      +-------------------+
|  Loss & Metrics   | <--- |  Optimizer Step   | <--- |  Back‑propagation |
+-------------------+      +-------------------+      +-------------------+

Each block is a plug‑in; swapping any block (e.g., replacing the tokenizer with a SentencePiece variant) requires only a config change.

1.4 Configuration Schema

ZCode uses Pydantic models for static validation and IDE autocomplete. A minimal config looks like:

experiment_name: "glm5pt2_demo"
seed: 42
hardware:
  accelerator: "gpu"
  devices: 4
  mixed_precision: true
model:
  type: "GLM5pt2"
  variant: "base"          # options: base, large, xl
  moe_experts: 64
  moe_top_k: 2
  max_seq_len: 4096
data:
  train_path: "s3://my-bucket/train.jsonl"
  val_path:   "s3://my-bucket/val.jsonl"
  tokenizer: "hf://EleutherAI/gpt-neox-20b"
  batch_size: 256
optimizer:
  name: "AdamW"
  lr: 3e-4
  weight_decay: 0.01
scheduler:
  type: "cosine_with_warmup"
  warmup_steps: 2000
training:
  max_steps: 150000
  gradient_accumulation: 4
  clip_grad_norm: 1.0
logging:
  mlflow: true
  wandb: false
  console_log_level: "INFO"

All fields are typed; invalid values raise a clear ValidationError before any GPU is touched.

1.5 Extending ZCode: The Plugin Contract

A plugin is any Python package exposing a zcode_plugins entry‑point group. Example for a custom tokenizer:

# my_tokenizer/__init__.py
from zcode.interfaces import TokenizerPlugin

class MyBPETokenizer(TokenizerPlugin):
    def __init__(self, cfg):
        self.tokenizer = huggingface_tokenizers.Tokenizer.from_pretrained(cfg.name)

    def encode(self, text: List[str]) -> List[List[int]]:
        return [self.tokenizer.encode(t).ids for t in text]

    def decode(self, ids: List[List[int]]) -> List[str]:
        return [self.tokenizer.decode(ids) for ids in ids]

# setup.py
from setuptools import setup

setup(
    name="my-tokenizer-plugin",
    entry_points={
        "zcode_plugins.tokenizer": [
            "my_bpe = my_tokenizer:MyBPETokenizer"
        ]
    },
)

At runtime, ZCode discovers the plugin via importlib.metadata.entry_points() and injects it into the pipeline.

1.6 Safety, Ethics, and Model Cards

ZCode automatically generates a model card after each training run, populated with:

Training data provenance (hashes of shards, licenses).
Compute footprint (GPU‑hours, carbon estimate via ML CO2 Impact).
Evaluation metrics (perplexity, downstream task scores).
Known limitations and bias analysis prompts.

Users can extend the card template via Jinja2 to include domain‑specific disclosures (e.g., medical advice disclaimer for a clinical LLM).

1.7 Summary of Foundations

GLM‑5.2 combines MoE, RoPEv2, Gated FFNs, dynamic sparsity, and multimodal token support to achieve unprecedented scale‑efficiency trade‑offs.
ZCode offers a declarative, plugin‑driven harness that isolates engineering complexity while preserving full control over the model internals.
The configuration schema guarantees reproducibility; the plugin system enables rapid experimentation.
Built‑in observability and model‑card generation foster responsible AI practices.

With these foundations laid, the next chapter walks you through installing ZCode, launching your first GLM‑5.2 experiment, and verifying that everything works as expected.

Chapter 2: Getting Started (≈2 200 words)

2.1 System Requirements

Component	Minimum	Recommended
OS	Ubuntu 20.04 LTS (or Rocky Linux 9)	Ubuntu 22.04 LTS
Python	3.9	3.11
CUDA	11.8	12.2
GPU Memory	16 GB (single V100)	40 GB (A100) or 80 GB (H100) per node
RAM	32 GB	128 GB
Storage	100 GB SSD	1 TB NVMe (for dataset caching)
Network	1 GbE	10‑25 GbE (for multi‑node)
Container Runtime	Docker 20.10	Docker + NVIDIA Container Toolkit
Optional	-	Slurm / Kubernetes scheduler

Tip: If you lack a multi‑GPU node, you can still experiment with CPU‑only fallback using ZCode’s accelerator: "cpu" flag (performance will be ~10‑20× slower but useful for debugging).

2.2 Installing ZCode

ZCode is distributed via PyPI and as a conda package. The recommended route is a virtual environment to isolate dependencies.

# Create a fresh venv
python -m venv zcode-env
source zcode-env/bin/activate

# Upgrade pip & install core
pip install --upgrade pip setuptools wheel
pip install zcode[all]   # extras: torch, jax, mlflow, wandb, etc.

# Verify installation
zcode --version
# Expected: zcode, version 0.9.2

The [all] extra pulls in:

torch>=2.3 (with CUDA wheels)
jax[cuda12_pip] (if you prefer JAX)
mlflow, wandb, tensorboard
datasets (HuggingFace)
sentencepiece, tiktoken

If you plan to use only PyTorch, you can install zcode[torch] to keep the footprint smaller.

2.3 Hello‑World: Minimal Training Script

Create a folder quickstart/ and add the following files.

2.3.1 `config.yaml`

experiment_name: "glm5pt2_hello"
seed: 123
hardware:
  accelerator: "gpu"
  devices: 1
  mixed_precision: true
model:
  type: "GLM5pt2"
  variant: "tiny"          # a 125M param variant for fast debugging
  moe_experts: 8
  moe_top_k: 1
  max_seq_len: 1024
data:
  train_path: "data/sample_train.jsonl"
  val_path:   "data/sample_val.jsonl"
  tokenizer: "hf://gpt2"
  batch_size: 32
optimizer:
  name: "AdamW"
  lr: 5e-4
  weight_decay: 0.0
scheduler:
  type: "linear_with_warmup"
  warmup_steps: 100
training:
  max_steps: 500
  gradient_accumulation: 1
  clip_grad_norm: 1.0
logging:
  mlflow: true
  wandb: false
  console_log_level: "INFO"

2.3.2 `sample_data.jsonl` (tiny corpus)

{"text": "Hello world! This is a tiny test."}
{"text": "ZCode makes training GLM‑5.2 easy."}
{"text": "We love reproducible experiments."}

Duplicate the lines a few hundred times to create a few‑kilobyte file (enough for a sanity check).

2.3.3 `run.py`

import zcode
from zcode import HarnessConfig, Engine

def main():
    cfg = HarnessConfig.from_yaml("config.yaml")
    engine = Engine(cfg)
    engine.run()   # handles setup, training, validation, checkpointing

if __name__ == "__main__":
    main()

Run it:

python run.py

You should see logs similar to:

[INFO] Loading config from config.yaml
[INFO] Initializing GLM5pt2 (tiny) with 8 experts, top_k=1
[INFO] Tokenizer: gpt2 (vocab size 50257)
[INFO] Training on 1 device(s) (GPU:0)
[INFO] Step 0/500 | loss: 10.23 | lr: 0.0005
...
[INFO] Step 500/500 | loss: 2.84 | lr: 0.0000
[INFO] Training complete. Best val loss: 2.71
[INFO] Checkpoint saved to ./outputs/glm5pt2_hello/checkpoints/step_500.pt

Congratulations – you have just trained a GLM‑5.2 model with ZCode!

2.4 Exploring the Outputs

ZCode writes a structured output directory:

outputs/
└─ glm5pt2_hello/
   ├─ config.yaml          # copy of the input config
   ├─ logs/
   │   ├─ console.log
   │   └─ mlflow/          # if enabled
   ├─ checkpoints/
   │   ├─ step_0.pt
   │   └─ step_500.pt
   ├─ metrics.json         # final training/validation metrics
   └─ model_card.md        # auto‑generated model card

Open metrics.json to verify that loss decreased and that the validation perplexity is sensible. The model_card.md contains a summary you can share with teammates.

2.5 Running Evaluation Only

If you already have a checkpoint and want to compute metrics:

zcode evaluate \
  --config config.yaml \
  --checkpoint ./outputs/glm5pt2_hello/checkpoints/step_500.pt \
  --split val

The CLI will reuse the same data loader and tokenizer but skip the training loop.

2.6 Debugging Tips

Enable verbose logging: set console_log_level: "DEBUG" in the config.
Inspect tensors: add a DebuggerCallback (built‑in) that logs gradient norms and activation histograms every N steps.
Profile: zcode profile --config config.yaml --steps 50 runs a short trace and outputs a Chrome‑compatible JSON trace viewable in chrome://tracing.

2.7 Common Installation Issues & Fixes

Symptom	Likely Cause	Fix
`ImportError: libcuda.so.1`	NVIDIA driver missing or mismatched CUDA version	Install driver ≥ 525; ensure `nvidia-smi` works.
`torch.cuda.is_available() == False`	PyTorch CPU wheel installed inadvertently	Reinstall with `pip install torch==2.3.0+cu121 -f https://download.pytorch.org/whl/torch_stable.html`
`OutOfMemoryError`	Batch size too large for GPU memory	Reduce `batch_size` or enable `gradient_accumulation`.
`ValidationError: unknown model variant 'tiny'`	Typo in config	Use one of the supported variants: `tiny`, `base`, `large`, `xl`.
`PluginNotFoundError: my_tokenizer`	Plugin not installed or entry‑point misnamed	`pip install -e .` in the plugin directory; verify entry‑point group.

2.8 Next Steps

Having verified that ZCode can launch a minimal job, you are ready to:

Scale up – increase devices, enable model parallelism, or switch to a larger GLM‑5.2 variant.
Customize data – plug in your own dataset loader or apply domain‑specific preprocessing.
Experiment with advanced features – MoE routing strategies, retrieval augmentation, or custom kernels (covered in Chapters 3‑4).

The following chapters dive deep into those topics, providing patterns, performance tuning, and production‑grade best practices.

Chapter 3: Core Techniques (≈2 200 words)

3.1 Data Pipeline Patterns

Efficient data feeding is often the hidden bottleneck in LLM training. ZCode provides three complementary patterns:

Pattern	When to Use	Implementation Details
Streaming Sharded JSONL	Datasets that fit on disk but are too large to fit in RAM.	Use `zcode.data.streaming.JSONLShardLoader`. It yields one shard at a time, prefetching via `torch.utils.data.IterableDataset`.
Memory‑Mapped Token Cache	When tokenization is expensive and you want to avoid recomputing each epoch.	`zcode.data.cache.TokenMemmap` stores `int32` token IDs in a memory‑mapped file; lookups are O(1).
On‑the‑Fly Augmentation	For tasks needing synthetic noise (e.g., back‑translation, synonym replacement).	Implement a `zcode.callbacks.AugmentationCallback` that modifies batches after collation.

Example: Streaming Loader with Prefetch

data:
  train_path: "s3://my-bucket/wiki_shards/"
  train_shard_pattern: "part-{:05d}.jsonl"
  tokenizer: "hf://EleutherAI/gpt-neox-20b"
  batch_size: 512
  prefetch_factor: 2
  num_workers: 8

ZCode automatically creates a torch.utils.data.DataLoader with persistent_workers=True to reduce overhead.

3.2 Tokenization Specifics for GLM‑5.2

GLM‑5.2’s tokenizer is a Byte‑Level BPE with a vocabulary size of 128 k, extended with special tokens:

Token	ID	Purpose
`<pad>`	0	Padding
`<bos>`	1	Beginning‑of‑sequence
`<eos>`	2	End‑of‑sequence
`<mask>`	3	Masked language modeling (optional)
`<expert_i>` (i=0…E‑1)	1000 + i	Expert routing tokens (used internally)
`<image>`	250000	Vision placeholder (if multimodal)

ZCode’s tokenizer plugin automatically adds these tokens if missing. You can inspect the vocab:

from zcode.plugins.tokenizer import get_tokenizer
tok = get_tokenizer("hf://EleutherAI/gpt-neox-20b")
print(tok.get_vocab_size())   # → 128000 + specials

Handling Long Sequences
GLM‑5.2 supports up to 32 k tokens via RoPEv2. To train on such lengths:

Set max_seq_len: 32768 in the config.
Enable gradient_checkpointing: true (under training).
Use sequence_parallel: true (see §3.4) to split the sequence across devices.

3.3 Mixed Precision & Loss Scaling

ZCode leverages PyTorch’s native AMP (torch.cuda.amp) when mixed_precision: true. Key points:

Automatic Loss Scaling: ZCode wraps the optimizer with a GradScaler that adjusts the scale based on overflow/underflow heuristics.
FP8 Experimentation: For H100/H200, you can set precision: "fp8" (requires torch>=2.4 with FP8 support). ZCode will then use torch.float8_e4m3fn for activations and weights where supported.
Dynamic Loss Scaling: If you encounter frequent grad‑scaler resets, increase the initial scale (loss_scale: 2**16) in the config under training.

Example Config Snippet

training:
  mixed_precision: true
  precision: "fp16"   # or "bf16" on Ampere+/Hopper
  loss_scale: 65536   # initial static scale (ignored if dynamic)
  gradient_checkpointing: true

3.4 Sequence Parallelism & Tensor Parallelism

When a single GPU cannot hold the activation tensors for long sequences, ZCode offers two parallelism strategies:

3.4.1 Sequence Parallelism (SP)

Splits the token dimension across devices; each device processes a contiguous chunk of the sequence.
Requires a collective all‑gather after the attention block to reconstruct the full sequence for the feed‑forward.
Implemented via zcode.parallel.SequenceParallelWrapper.

Enable in config:

training:
  sequence_parallel: true
  sp_group_size: 2   # number of devices splitting the sequence

3.4.2 Tensor Parallelism (TP)

Partitions the weight matrices (e.g., QKV projection, FFN) across devices.
Best for very wide models (XL variant) where the hidden size exceeds device memory.
Uses torch.distributed._tensor APIs under the hood.

Enable:

training:
  tensor_parallel: true
  tp_group_size: 4   # must divide hidden size evenly

Note: SP and TP can be combined (e.g., sp_group_size=2, tp_group_size=4) for 8‑way scaling.

3.5 Optimizer & Scheduler Choices

While AdamW is the default, ZCode ships with several advanced optimizers:

Optimizer	Benefits	Typical LR Range
AdamW	Robust, well‑studied	1e‑4 – 5e‑4
LAMB	Better scaling with large batch sizes	1e‑3 – 3e‑3
Adafactor	Memory‑efficient (stores factored moment estimates)	1e‑3 – 5e‑3
Sophia	Second‑order preconditioner, faster convergence	1e‑4 – 2e‑4
Zerosum‑SGD	Novel optimizer for MoE routing stability	5e‑4 – 1e‑3

Scheduler options include linear warmup, cosine decay, polynomial decay, and the WarmupStableDecay (WSD) schedule which has shown improved stability for MoE training.

Example Config with Sophia + WSD

optimizer:
  name: "Sophia"
  lr: 1.5e-4
  betas: [0.9, 0.999]
  rho: 0.04   # preconditioner damping
scheduler:
  type: "wsd"
  warmup_steps: 2000
  stable_steps: 180000
  decay_steps: 20000
training:
  max_steps: 200000

3.6 Evaluation Harness

ZCode’s evaluation module is decoupled from training but shares the same data-loader and tokenizer abstractions.

3.6.1 Per‑Token Metrics

Perplexity (PPL) – exp(average negative log‑likelihood).
Bits‑per‑character (BPC) – useful for multilingual corpora.
Token‑level Accuracy – for masked language modeling probes.

3.6.2 Generation Metrics

When evaluating generative quality, ZCode can run:

Greedy sampling – deterministic baseline.
Top‑k / nucleus (top‑p) sampling – controllable randomness.
Beam search – for tasks like translation or summarization.
Contrastive search – reduces repetition (enabled via generation.contrastive: true).

Metrics include:

BLEU / ROUGE / METEOR (via nlg-eval).
Distinct‑n – measures diversity.
Entropy of token distribution – gauges uncertainty.

All metrics are logged to the same tracking backend (MLflow, Weights & Biases, TensorBoard) as training.

3.6.3 Custom Evaluation Scripts

You can plug in a EvaluationPlugin that receives the model and a batch and returns a dictionary of metrics. Example for factuality scoring using an external verifier:

from zcode.interfaces import EvaluationPlugin

class FactCheckerPlugin(EvaluationPlugin):
    def __init__(self, cfg):
        self.verifier = load_fact_verifier(cfg.verifier_path)

    def evaluate(self, model, batch):
        generations = model.generate(batch["input_ids"], max_new_tokens=64)
        scores = [self.verifier.score(g) for g in generations]
        return {"fact_score": sum(scores) / len(scores)}

Add to config:

evaluation:
  plugins:
    - fact_checker

3.7 Checkpointing & Recovery

ZCode checkpoints contain:

Model state (weights, optimizer states, scheduler).
RNG states (torch, numpy, python random).
GradScaler state (if mixed precision).
Training progress (step, epoch).
Git commit hash & environment snapshot (via pip freeze).

Resume Training

zcode train \
  --config config.yaml \
  --resume-from ./outputs/exp1/checkpoints/step_12345.pt

The harness automatically verifies that the config matches the checkpoint (warning if any mismatched fields).

Hybrid Checkpointing (for fault‑tolerant clusters)

Local: writes to node’s SSD every N steps (fast).
Remote: async upload to object storage (S3, GCS) every M steps.
On preemption, ZCode reloads the most recent remote checkpoint and continues from the latest local shard if available.

3.8 Profiling & Bottleneck Analysis

ZCode ships with a lightweight profiler based on PyTorch’s torch.profiler. Activate via:

profiling:
  enabled: true
  schedule: wait=1,warmup=2,active=3,repeat=2
  trace_handler: "disk"   # or "tensorboard"
  profile_memory: true
  profile_shapes: true

Run a short profiling job:

zcode profile --config config.yaml --steps 50

Open the generated trace.json in Chrome (chrome://tracing) to see:

Kernel launch overhead.
Memory allocation/free events.
NCCL communication times (for distributed training).
Op‑level time breakdown (e.g., aten::linear, aten::matmul).

Common findings and fixes:

Observation	Likely Cause	Remedy
Long `ncclAllReduce` spikes	Imbalanced gradient sharding (MoE experts unevenly loaded)	Adjust `moe_top_k` or use expert capacity factor > 1.0
Frequent `cudaMalloc` / `cudaFree`	Tensor fragmentation due to variable‑length sequences	Enable `padding_free: true` (dynamic batching) or use `torch.compile` with `mode="reduce-overhead"`
High `aten::dropout` time	Dropout applied after every transformer layer	Consider `dropout: 0.0` during pretraining; enable only for fine‑tuning.
Low GPU utilization (< 30 %)	Data loader bottleneck	Increase `num_workers`, enable `prefetch_factor`, or move dataset to faster storage (NVMe).

3.9 Reproducibility & Determinism

ZCode enforces deterministic behavior when seed is set:

Sets torch.manual_seed, numpy.random.seed, random.seed.
For CUDA, enables torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False.
When using model parallelism, also sets the NCCL deterministic flag (NCCL_DEBUG=INFO) if needed.