Metadata-Version: 2.4
Name: tensor-optix
Version: 1.16.0
Summary: Autonomous training loop for any sequential learning model — PPO, DQN, SAC, TD3, Rainbow DQN, Recurrent PPO for TensorFlow, PyTorch, and JAX/Flax; distributed async actor-learner (IMPALA + V-trace)
Author: sup3rus3r
License-Expression: MIT
Requires-Python: >=3.11
Description-Content-Type: text/markdown
Requires-Dist: gymnasium[box2d]>=1.0.0
Requires-Dist: numpy>=1.24.0
Requires-Dist: matplotlib>=3.7.0
Requires-Dist: optuna>=3.0.0
Requires-Dist: swig>=4.4.1
Requires-Dist: pyyaml>=6.0
Provides-Extra: tensorflow
Requires-Dist: tensorflow>=2.18.0; extra == "tensorflow"
Provides-Extra: tensorflow-gpu
Requires-Dist: tensorflow[and-cuda]>=2.18.0; sys_platform == "linux" and extra == "tensorflow-gpu"
Provides-Extra: torch
Requires-Dist: torch>=2.11.0; extra == "torch"
Requires-Dist: torchvision; extra == "torch"
Requires-Dist: torchaudio; extra == "torch"
Provides-Extra: jax
Requires-Dist: flax>=0.12.6; extra == "jax"
Requires-Dist: jax>=0.10.0; extra == "jax"
Requires-Dist: optax>=0.2.8; extra == "jax"
Provides-Extra: cuda
Requires-Dist: nvidia-cuda-nvcc-cu12; sys_platform == "linux" and extra == "cuda"
Requires-Dist: tensorflow[and-cuda]>=2.18.0; sys_platform == "linux" and extra == "cuda"
Provides-Extra: box2d
Requires-Dist: gymnasium[box2d]>=1.0.0; extra == "box2d"
Provides-Extra: atari
Requires-Dist: gymnasium[atari]>=1.0.0; extra == "atari"
Requires-Dist: ale-py>=0.8; extra == "atari"
Provides-Extra: mujoco
Requires-Dist: gymnasium[mujoco]>=1.0.0; extra == "mujoco"
Provides-Extra: all
Requires-Dist: tensorflow>=2.18.0; extra == "all"
Requires-Dist: torch>=2.11.0; extra == "all"
Requires-Dist: torchvision; extra == "all"
Requires-Dist: torchaudio; extra == "all"
Requires-Dist: gymnasium[atari]>=1.0.0; extra == "all"
Requires-Dist: gymnasium[mujoco]>=1.0.0; extra == "all"
Requires-Dist: ale-py>=0.8; extra == "all"
Requires-Dist: jax>=0.10.0; extra == "all"
Requires-Dist: flax>=0.12.6; extra == "all"
Requires-Dist: optax>=0.2.8; extra == "all"
Requires-Dist: torch>=2.11.0; extra == "all"
Provides-Extra: neuroevo
Requires-Dist: torch>=2.11.0; extra == "neuroevo"
Provides-Extra: onnx
Requires-Dist: onnx>=1.14; extra == "onnx"
Requires-Dist: onnxruntime>=1.16; extra == "onnx"
Requires-Dist: onnxscript>=0.1; extra == "onnx"
Provides-Extra: wandb
Requires-Dist: wandb>=0.16; extra == "wandb"
Provides-Extra: tensorboard
Requires-Dist: torch>=2.11.0; extra == "tensorboard"
Requires-Dist: tensorboard>=2.14; extra == "tensorboard"
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: twine>=5.0; extra == "dev"
Requires-Dist: torch>=2.11.0; extra == "dev"

# tensor-optix

tensor-optix is a training loop framework with statistical convergence control, online hyperparameter optimisation, and an optional neuroevolution subsystem for dynamic topology.

The core loop runs your agent against a pipeline, maintains a separate validation signal, and manages four states: ACTIVE, COOLING, DORMANT, and watchdog shutdown. Convergence is detected using a corrected t-test on the smoothed score slope plus lag-1 autocorrelation, not a fixed patience counter. Hyperparameters are updated every episode via SPSA gradient estimates, with automatic routing to momentum-based or sign-only updates depending on the autocorrelation structure of the score landscape. Checkpointing and rollback are driven by the validation signal only, never training score. On DORMANT, a MetaController evaluates the generalization gap and its slope to decide between spawning a policy variant, pruning the ensemble, or stopping.

The neuroevo subsystem (`pip install tensor-optix[neuroevo]`) represents the policy as a `NeuronGraph`: a mutable directed graph of heterogeneous scalar neurons (point, GRU, LSTM, trainable-GRU, trainable-LSTM) with variable-delay edges. `TopologyController` runs as a loop callback and evaluates three independent signals per episode: improvement slope significance, residual autocorrelation structure, and gradient utilization across hidden neurons. All three must cross their thresholds before a grow operation fires. Pruning is by importance score (incident edge weight magnitude times mean absolute activation). Merging is by Pearson correlation of per-episode activation histories. `TopologyAwareAdam` resets momentum state for parameters affected by any structural change. `BrainNetwork` composes multiple named `NeuronGraph` regions with sparse learnable inter-region edges. `HebbianHook` accumulates co-activation products across each episode and applies an Oja-style weight update after the PPO gradient step. `NeuromodulatorSignal` maps a `RegimeDetector` output (trending / ranging / volatile) to simultaneous changes in Hebbian learning rate, entropy coefficient, and topology grow/prune thresholds.

The entire system, including neuroevo, is accessed through a six-method `BaseAgent` interface:

```python
class BaseAgent(ABC):
    def act(self, observation) -> any: ...
    def learn(self, episode_data: EpisodeData) -> dict: ...
    def get_hyperparams(self) -> HyperparamSet: ...
    def set_hyperparams(self, hyperparams: HyperparamSet) -> None: ...
    def save_weights(self, path: str) -> None: ...
    def load_weights(self, path: str) -> None: ...
```

Fifteen agents are included across PyTorch, TensorFlow, and JAX/Flax. Bring your own by implementing the interface above.

---

## Install

```bash
# Core loop only, no algorithm implementations
pip install tensor-optix

# PyTorch algorithms
pip install tensor-optix[torch]

# TensorFlow algorithms
pip install tensor-optix[tensorflow]

# JAX/Flax
pip install tensor-optix[jax]

# Neuroevo (requires torch)
pip install tensor-optix[neuroevo]

# GPU (Linux/WSL2, CUDA 12.8)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
pip install tensor-optix[torch]

# All frameworks and environment extras
pip install tensor-optix[all]
```

Environment extras: `[box2d]`, `[atari]`, `[mujoco]`  
Logging extras: `[wandb]`, `[tensorboard]`  
Export: `[onnx]`

---

## Algorithms

All agents implement `BaseAgent` and are interchangeable with `RLOptimizer`.

### PyTorch

| Agent | Algorithm | Action Space |
|---|---|---|
| `TorchPPOAgent` | PPO + GAE-λ | Discrete |
| `TorchGaussianPPOAgent` | PPO | Continuous |
| `TorchRecurrentPPOAgent` | PPO + GRU/LSTM hidden state | Discrete |
| `TorchDQNAgent` | DQN + PER + n-step returns | Discrete |
| `TorchRainbowDQNAgent` | Rainbow DQN (NoisyNet, distributional, PER, n-step, dueling, double) | Discrete |
| `TorchSACAgent` | SAC, twin Q-critics, automatic entropy tuning | Continuous |
| `TorchTD3Agent` | TD3 | Continuous |

### TensorFlow

`TFPPOAgent`, `TFGaussianPPOAgent`, `TFDQNAgent`, `TFSACAgent`, `TFTDDAgent`

### JAX/Flax

`FlaxPPOAgent`

### Auto-selection

`make_agent` inspects the environment action space and returns a fully constructed agent.

```python
from tensor_optix import make_agent
import gymnasium as gym

env = gym.make("LunarLanderContinuous-v3")
agent = make_agent(env)                       # -> TorchSACAgent
agent = make_agent(env, framework="tf")       # -> TFSACAgent
agent = make_agent(env, deterministic=True)   # -> TorchTD3Agent
```

---

## Pipelines

A pipeline steps an environment (or data source), collects `EpisodeData`, and yields it to the agent. Three implementations are provided.

```python
from tensor_optix import BatchPipeline, LivePipeline, VectorBatchPipeline

# Gymnasium env: steps continuously, no reset between windows
pipeline = BatchPipeline(env=gym.make("CartPole-v1"), agent=agent, window_size=200)

# External data stream: background thread with bounded queue, configurable episode boundaries
pipeline = LivePipeline(
    data_source=MyFeed(),
    agent=agent,
    episode_boundary_fn=LivePipeline.every_n_seconds(300),
)

# N parallel envs via gymnasium.vector, sync or async subprocess
pipeline = VectorBatchPipeline(
    env_fns=[lambda: gym.make("CartPole-v1")] * 8,
    agent=agent,
    window_size=200,
)
```

---

## The loop

```python
from tensor_optix import RLOptimizer

opt = RLOptimizer(
    agent=agent,
    pipeline=pipeline,

    # Separate validation pipeline. All checkpoint and rollback decisions use val score only.
    val_pipeline=val_pipeline,
    rollback_on_degradation=True,

    # Optional external scorer run at checkpoint evaluation (e.g. held-out backtest)
    checkpoint_score_fn=lambda a: evaluate(a, held_out_env),

    # Convergence parameters
    dormant_threshold=10,            # consecutive episodes without improvement -> DORMANT
    min_episodes_before_dormant=50,  # statistical warmup before convergence detection activates
)

opt.run()
opt.best_snapshot   # -> PolicySnapshot: best weights + EvalMetrics + HyperparamSet
```

Loop state transitions: `ACTIVE` -> `COOLING` -> `DORMANT` -> watchdog shutdown or policy spawn.

On shutdown the loop restores best-known weights, not the final checkpoint.

---

## Hyperparameter optimisation

All optimisers operate in normalised [0, 1] parameter space and update every episode. No restarts required.

```python
from tensor_optix.optimizers import SPSAOptimizer, AdaptiveOptimizer

# SPSA: Rademacher perturbation vector, two-episode gradient estimate
optimizer = SPSAOptimizer(
    param_bounds={"learning_rate": (1e-4, 3e-3), "clip_ratio": (0.1, 0.3)},
    log_params={"learning_rate"},   # log-space normalisation for params spanning orders of magnitude
)

# AdaptiveOptimizer: routes between SPSA, Momentum, Backoff, and PBT
# based on lag-1 autocorrelation of the score stream and relative performance gap
optimizer = AdaptiveOptimizer(param_bounds={...})

opt = RLOptimizer(agent=agent, pipeline=pipeline, optimizer=optimizer)
```

| Optimizer | Routing condition |
|---|---|
| `SPSAOptimizer` | i.i.d. score noise, no autocorrelation structure |
| `MomentumOptimizer` | Positive lag-1 autocorrelation (smooth landscape) |
| `BackoffOptimizer` | Negative lag-1 autocorrelation (oscillating landscape, sign-only updates) |
| `PBTOptimizer` | Score below 20th percentile of history (exploit checkpoint population) |
| `AdaptiveOptimizer` | Routes automatically based on the two signals above |

### Trial-level search

`TrialOrchestrator` runs N independent short trials via Optuna TPE before the main run, then warm-starts from the best trial's weights and config.

```python
from tensor_optix import TrialOrchestrator

orch = TrialOrchestrator(
    agent_factory=make_agent,
    pipeline_factory=make_pipeline,
    param_space={
        "learning_rate": ("log_float", 1e-4, 3e-3),
        "clip_ratio":    ("float",     0.1,  0.3),
        "batch_size":    ("int",       32,   512),
    },
    n_trials=20,
    trial_episodes=50,
)
best_config = orch.run()
```

---

## Ensemble and policy evolution

`PolicyManager` runs as a loop callback. On each DORMANT event, `MetaController` evaluates the generalization gap (train minus val, normalised) and its slope, and the validation improvement rate, then issues one of: SPAWN, PRUNE, or STOP. Spawned variants are cloned from the best checkpoint with perturbed hyperparameters. `EnsembleAgent` wraps all active variants behind the `BaseAgent` interface, with weighted action averaging.

```python
from tensor_optix import PolicyManager

pm = PolicyManager(registry, max_spawns=4)
cb = pm.as_callback(agent, agent_factory=make_agent)
cb.set_stop_fn(opt.stop)
opt.add_callback(cb)
opt.run()
```

---

## Callbacks

```python
from tensor_optix.callbacks import RichDashboardCallback, WandbCallback, TensorBoardCallback

opt.add_callback(RichDashboardCallback())        # Rich live terminal panel
opt.add_callback(WandbCallback(project="run"))
opt.add_callback(TensorBoardCallback(log_dir="./tb"))
```

Custom callbacks subclass `LoopCallback` and override any of:

```python
class LoopCallback:
    def on_loop_start(self) -> None: ...
    def on_loop_stop(self) -> None: ...
    def on_episode_end(self, episode_id: int, eval_metrics) -> None: ...
    def on_improvement(self, snapshot) -> None: ...
    def on_plateau(self, episode_id: int, state) -> None: ...
    def on_dormant(self, episode_id: int) -> None: ...
    def on_degradation(self, episode_id: int, eval_metrics) -> None: ...
    def on_hyperparam_update(self, old: dict, new: dict) -> None: ...
```

---

## Distributed training (IMPALA + V-trace)

`AsyncActorLearner` implements IMPALA-style async actor-learner. N actor subprocesses read weights from shared memory (lock-free), collect trajectories, and push them to a queue. The learner dequeues trajectories, applies V-trace importance-sampling correction, and writes updated weights back to shared memory.

```python
from tensor_optix.distributed import AsyncActorLearner

learner = AsyncActorLearner(
    actor=actor,
    critic=critic,
    optimizer=optimizer,
    env_factory=lambda: gym.make("ALE/Pong-v5"),
    n_actors=8,
    trajectory_len=64,
)
stats = learner.run(max_steps=10_000_000)
# stats["steps_per_second"] -> ~4x single-process throughput on CPU
```

---

## Neuroevo

`NeuronGraph` is a mutable directed graph of scalar neurons with variable-delay edges. `GraphAgent` wraps it as a `BaseAgent` with PPO-style weight learning. `TopologyController` mutates the graph live during the training loop.

```bash
pip install tensor-optix[neuroevo]
```

### Graph construction

```python
from tensor_optix.neuroevo import NeuronGraph, GraphAgent, GRUNeuron, LSTMNeuron

graph = NeuronGraph()

for _ in range(4):
    graph.add_neuron(role="input", activation="linear")
for _ in range(8):
    graph.add_neuron(role="hidden", activation="tanh")
    # or: graph.add_neuron(role="hidden", neuron=GRUNeuron())
    # or: graph.add_neuron(role="hidden", neuron=LSTMNeuron())
graph.add_neuron(role="output", activation="linear")  # last output neuron is the value head

graph.add_edge(src_id, dst_id, weight=0.0, delay=0)   # feedforward (d=0)
graph.add_edge(src_id, dst_id, weight=0.0, delay=1)   # recurrent (d>=1, reads from history buffer)

agent = GraphAgent(graph, obs_dim=4, n_actions=2)
```

All edges initialise at weight=0.0, which is function-preserving at insertion time.

### Neuron types

| Type | Hidden state | Gradient through state |
|---|---|---|
| `Neuron` | None (point neuron) | N/A |
| `GRUNeuron` | Scalar h, detached | No |
| `LSTMNeuron` | Scalar h and c, detached | No |
| `TrainableGRUNeuron` | Scalar h, not detached | Yes, up to chunk_len steps |
| `TrainableLSTMNeuron` | Scalar h and c, not detached | Yes, up to chunk_len steps |

All types implement the same protocol: `step()`, `importance()`, `can_merge_with()`, `make_relay()`, `split_copy()`. `NeuronGraph` and `TopologyController` are type-blind.

### Trainable recurrent neurons

`TrainableGRUNeuron` and `TrainableLSTMNeuron` set `is_recurrent = True`. `RecurrentGraphAgent` detects this flag and switches from shuffled-minibatch PPO to sequential chunk training with truncated BPTT.

```python
from tensor_optix.neuroevo import TrainableGRUNeuron, TrainableLSTMNeuron, RecurrentGraphAgent

graph = NeuronGraph()
# ... input and output neurons ...
graph.add_neuron(role="hidden", neuron=TrainableGRUNeuron())
graph.add_neuron(role="hidden", neuron=TrainableLSTMNeuron())

agent = RecurrentGraphAgent(
    graph, obs_dim=4, n_actions=2,
    hyperparams=HyperparamSet(params={"chunk_len": 64}),
)
# Falls back to standard shuffled-minibatch PPO if no recurrent neurons are present
```

### Topology controller

```python
from tensor_optix.neuroevo import TopologyController

controller = TopologyController.for_graph(
    graph=graph,
    scheduler=opt._scheduler,
    grow_grad_threshold=0.7,         # fraction of hidden neurons with |grad| > eps required to grow
    prune_neuron_threshold=1e-4,     # importance score below this -> prune candidate
    prune_edge_threshold=1e-3,       # |weight| below this for prune_edge_patience episodes -> prune
    merge_similarity_threshold=0.95, # Pearson correlation threshold for merge
)
opt.add_callback(controller)
opt.run()
```

Grow fires only when all three signals agree:
1. Improvement slope t-test is not significant (gradient updates are not making progress)
2. Score residuals have significant autocorrelation (capacity is underutilised)
3. Gradient utilization exceeds `grow_grad_threshold` (existing neurons are saturated)

For multi-region graphs, use `TopologyController.for_brain(brain, scheduler=...)`. Each region gets independent signal buffers and cooldown timers.

### BrainNetwork

```python
from tensor_optix.neuroevo import BrainNetwork, TopologyController

brain = BrainNetwork()
brain.add_region("sensory",   sensory_graph)
brain.add_region("memory",    memory_graph)
brain.add_region("executive", executive_graph)

brain.add_pathway("sensory",  "memory",    n_connections=8, delay=1)
brain.add_pathway("memory",   "executive", n_connections=8, delay=0)

controller = TopologyController.for_brain(brain, scheduler=opt._scheduler)
```

Inter-region edges are learnable parameters. Regions are executed in topological order each forward pass.

### Hebbian learning

`HebbianHook` applies an Oja-style local weight update after each episode. The rule is: `dw = eta * mean_t(h_pre * h_post) - lambda * w`. Call `record()` after each `act()` to accumulate co-activation products, then `apply()` after the PPO gradient step.

```python
from tensor_optix.neuroevo import HebbianHook

hook = HebbianHook(graph, hebbian_lr=1e-3, weight_decay=1e-4)

for step in episode:
    action = agent.act(obs)
    hook.record()
    obs, reward, done, _ = env.step(action)

agent.learn(episode_data)
hook.apply()
hook.reset()
```

Use `HebbianHook.from_brain(brain, ...)` for `BrainNetwork` graphs.

### Neuromodulation

`NeuromodulatorSignal` takes a `RegimeDetector` classification and applies coordinated parameter changes across `HebbianHook`, `GraphAgent`, and `TopologyController` simultaneously.

```python
from tensor_optix.neuroevo import NeuromodulatorSignal
from tensor_optix.core import RegimeDetector

detector = RegimeDetector()
mod = NeuromodulatorSignal(hook=hook, agent=agent, controller=controller)

regime = detector.detect(metrics_history)  # "trending" | "ranging" | "volatile"
mod.apply(regime)
# trending  -> lower entropy coefficient, reduce hebbian_lr (consolidate)
# volatile  -> raise entropy coefficient, raise grow thresholds (explore)
# ranging   -> raise hebbian_lr (local plasticity)
```

### Dale's Law

```python
# clamp mode (default): outgoing weights clamped post-step
graph = NeuronGraph(dale_mode="clamp")
graph.add_neuron(role="hidden", activation="relu", cell_type="excitatory")  # weights >= 0
graph.add_neuron(role="hidden", activation="tanh", cell_type="inhibitory")  # weights <= 0

# softplus mode: raw parameter theta, effective weight = softplus(theta) * sign
# gradient-safe, no dead zone at the clamp boundary
# enforce_dale() is a no-op in this mode
graph = NeuronGraph(dale_mode="softplus")
w = graph.effective_weight(edge_id)  # reads post-softplus value
```

### TopologyAwareAdam

Drop-in Adam replacement that resets (m, v) momentum state for parameters touched by a grow, prune, or merge operation. Stale momentum estimates from before a structural change would otherwise corrupt the first update on modified parameters.

```python
from tensor_optix.neuroevo import TopologyAwareAdam

optimizer = TopologyAwareAdam(graph.parameters(), lr=3e-4)
optimizer.notify_topology_change(new_params)  # call after any topology mutation
```

---

## Core utilities

### Normalizers

Online Welford mean/variance. `ObsNormalizer` normalises observations. `RewardNormalizer` divides rewards by return standard deviation (not reward std), preserving sign.

```python
from tensor_optix.core.normalizers import ObsNormalizer, RewardNormalizer

obs_norm = ObsNormalizer(shape=(obs_dim,))
obs_norm.update(obs_batch)
normalized = obs_norm.normalize(obs)

rew_norm = RewardNormalizer()
```

### Hindsight Experience Replay

Wraps `PrioritizedReplayBuffer` with episode-level goal relabeling. Supports `future` (default), `final`, and `episode` relabeling strategies.

```python
from tensor_optix.core.her_buffer import HERBuffer

her = HERBuffer(obs_dim=obs_dim, act_dim=act_dim, goal_dim=goal_dim, strategy="future", k=4)
her.store_episode(obs_list, act_list, rew_list, next_obs_list, done_list,
                  achieved_goals, compute_reward_fn)
obs_b, act_b, rew_b, next_b, done_b, weights, idx, n = her.sample(batch_size)
```

### Checkpoint registry

```python
from tensor_optix.core.checkpoint_registry import CheckpointRegistry

registry = CheckpointRegistry(checkpoint_dir="./checkpoints", max_snapshots=10)
registry.save(agent, eval_metrics, hyperparams)
registry.load_best(agent)
registry.load_ensemble(agent, top_k=3)   # stochastic weight averaging over top-k snapshots
```

### Regime detection

Classifies score history into one of three regimes using detrended coefficient of variation. Detrended CV measures noise around the trend, not raw score variance.

```python
from tensor_optix.core import RegimeDetector

detector = RegimeDetector()
regime = detector.detect(metrics_history)   # "trending" | "ranging" | "volatile"
```

---

## Requirements

- Python >= 3.11
- gymnasium >= 1.0
- numpy >= 1.24

The core loop, `PolicyManager`, and all ensemble and evolution logic have no framework dependency. Framework installs are opt-in via extras.
