Metadata-Version: 2.4
Name: ts-data-generator
Version: 0.4.0
Summary: A Python library for generating synthetic time series data
Project-URL: Repository, https://github.com/manojmanivannan/ts-data-generator.git
Project-URL: Issues, https://github.com/manojmanivannan/ts-data-generator/issues
Author-email: Manoj Manivannan <manojm18@live.in>
License: MIT
License-File: LICENSE
Keywords: data engineering,data generator,python,synthetic data,time series
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Requires-Python: >=3.10
Requires-Dist: click>=8.0
Requires-Dist: matplotlib>=3.5
Requires-Dist: pandas>=1.5
Requires-Dist: pydantic>=2.0
Requires-Dist: python-dotenv>=1.0
Provides-Extra: all
Requires-Dist: holidays>=0.96; extra == 'all'
Requires-Dist: scipy>=1.7; extra == 'all'
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.3; extra == 'dev'
Provides-Extra: holidays
Requires-Dist: holidays>=0.96; extra == 'holidays'
Provides-Extra: imputer
Requires-Dist: scipy>=1.7; extra == 'imputer'
Provides-Extra: test
Requires-Dist: pytest>=7.0; extra == 'test'
Requires-Dist: ruff>=0.3; extra == 'test'
Description-Content-Type: text/markdown

<div align="center">

# Synthetic Time Series Data Generator

[![CI](https://github.com/manojmanivannan/ts-data-generator/actions/workflows/ci.yaml/badge.svg)](https://github.com/manojmanivannan/ts-data-generator/actions/workflows/ci.yaml)
[![Python](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![License](https://img.shields.io/badge/license-MIT-green)](./LICENSE)

Generate realistic synthetic time series datasets with configurable dimensions,
metrics, composable trend functions, and injectable anomalies — via a Python API
or the `tsdata` CLI.

<img src="https://github.com/manojmanivannan/ts-data-generator/raw/main/notebooks/image.png" alt="sample plot" width="800"/>

</div>

---

## Quickstart

### CLI (no install)

```bash
uvx --python 3.11 --from ts-data-generator tsdata generate \
    --preset daily-sales --output sales.csv
```

With anomalies and a fixed seed for reproducibility:

```bash
uvx --python 3.11 --from ts-data-generator tsdata generate \
    --start 2024-01-01 --end 2024-01-07 --granularity h \
    --dims "region:US,EU,AP" \
    --mets "temperature:SinusoidalTrend(amplitude=10,freq=24)" \
    --anomalies "temperature:PointAnomaly(probability=0.01,magnitude=5)" \
    --seed 42 --output weather.csv
```

### Python API

```python
from ts_data_generator import DataGen
from ts_data_generator.utils.trends import SinusoidalTrend
from ts_data_generator.utils.functions import random_choice
from ts_data_generator.anomalies import PointAnomaly, MissingData

dg = DataGen(seed=42)
dg.start_datetime = "2024-01-01"
dg.end_datetime = "2024-01-07"
dg.to_granularity("h")

dg.add_dimension("region", random_choice(["US", "EU", "AP"]))
dg.add_metric(
    "temperature",
    {SinusoidalTrend(amplitude=10, freq=24)},
    anomalies=[PointAnomaly(probability=0.01, magnitude=5)],
)

print(dg.data.head())
dg.data.to_csv("weather.csv", index_label="datetime")
```

---

## Installation

```bash
pip install ts-data-generator
```

With optional extras:

```bash
# Schema imputing (requires scipy)
pip install "ts-data-generator[imputer]"

# Holiday trend support (requires holidays)
pip install "ts-data-generator[holidays]"

# All optional features
pip install "ts-data-generator[all]"
```

For local development:

```bash
git clone https://github.com/manojmanivannan/ts-data-generator.git
cd ts-data-generator
uv sync --extra dev
```

---

## Core Concepts

### Dimensions
Categorical or continuous columns generated by an infinite generator function.

| Function | Description | CLI shorthand |
|---|---|---|
| `random_choice` | Random element from a collection | `name:random_choice:A,B,C` |
| `random_int` | Random integer in `[start, end]` | `name:random_int:1,100` |
| `random_float` | Random float in `[start, end)` | `name:random_float:0.0,1.0` |
| `constant` | Fixed value or cycle | `name:constant:10` |
| `ordered_choice` | Sequential cycle | `name:ordered_choice:A,B,C` |
| `auto_generate_name` | Auto-generated column name | `name:auto_generate_name:cat` |

Shorthand: `name:values` defaults to `random_choice`. Example: `--dims "product:A,B,C"`.

### Metrics
Numeric columns built by additively composing one or more **trends**.

| Trend | Description | Key parameters |
|---|---|---|
| `SinusoidalTrend` | Sine wave with optional noise | `amplitude`, `freq`, `phase`, `noise_level` |
| `LinearTrend` | Linear ramp with optional noise | `limit`, `offset`, `noise_level` |
| `WeekendTrend` | Spikes on Saturday/Sunday | `weekend_effect`, `direction`, `limit` |
| `HolidayTrend` | Ramp around holidays | `country`, `effect`, `pre_window`, `post_window` |
| `ARNoiseTrend` | Autoregressive AR(p) noise | `coefficients` or `decay`+`order`, `noise_std` |
| `MarkovTrend` | Discrete-state Markov chain | `states`, `values`, `stickiness` or `transition_matrix` |
| `StockTrend` | Random walk + multi-scale sine | `amplitude`, `direction`, `noise_level` |

Trends combine with `+`: `metric_name:Trend1(...)+Trend2(...)`.

### Anomalies
Inject realistic irregularities into metric values. Anomalies are applied per-metric after trend composition and run in order (PointAnomaly → MissingData last, so NaN values are never overwritten).

| Anomaly | Description | Key parameters |
|---|---|---|
| `PointAnomaly` | Isolated value spikes | `probability`, `mode` (`additive`/`replacement`), `magnitude` |
| `MissingData` | NaN gaps | `mode` (`random`/`burst`), `probability`, `min_length`, `max_length` |
| `ConceptDrift` | Gradual regime shifts | `segments` (list of `DriftSegment`) |

**PointAnomaly** supports two modes:
- `additive` — adds the magnitude to the trend value at anomalous timestamps.
- `replacement` — replaces the trend value with the magnitude.
Magnitude can be a fixed scalar or a `(min, max)` tuple for uniform sampling.

**MissingData** supports three modes:
- `random` — each timestamp independently becomes NaN with the given probability.
- `burst` — consecutive blocks of NaN of configurable length, non-overlapping.
- `patterned` — NaN wherever a schedule callable `(pd.Timestamp) -> bool` returns True (e.g. every Sunday). Patterned mode composes with random/burst via separate MissingData instances in the anomalies list.

**ConceptDrift** applies gradual distribution-level shifts using `DriftSegment`:
```python
from ts_data_generator.anomalies import ConceptDrift, DriftSegment

ConceptDrift(segments=[
    DriftSegment(start_timestamp="2024-01-15T06:00:00",
                 transition_window=1800, target_mean=50, target_std=5,
                 hold_duration=7200, restore=True),
])
```
Each segment alpha-blends from baseline into `N(target_mean, target_std)` over `transition_window` seconds, holds for `hold_duration` seconds, and optionally transitions back.

Drift positions are specified by absolute `start_timestamp`. Multi-segment sequences are built by repeating `--anomalies` for the same metric in the CLI, or by passing a list of segments in the API.

Anomalies combine with `+` and are scoped to a metric:
```
metric_name:PointAnomaly(...)+MissingData(...)
```

### Deterministic generation
Pass `seed` to `DataGen(seed=42)` or `--seed 42` in the CLI for reproducible output. The seed initializes a PCG64-backed `SeedableRNG` that is threaded through trend generation and anomaly injection, replacing global `np.random` calls.

### Multi-Items
Linked columns generated from a single function — useful when columns have
dependencies (e.g. `col3 = col1 + col2`).

```python
def linked_gen():
    while True:
        a, b = random.randint(1, 100), random.randint(1, 100)
        yield (a, b, a + b)

dg.add_multi_items(names=["val1", "val2", "val3"], function=linked_gen())
```

### Aggregation
Data can be resampled to a coarser granularity with per-metric aggregation
methods (`sum`, `mean`, `min`, `max`):

```python
dg.add_metric("sales", {LinearTrend(limit=50)}, aggregation_type=AggregationType.SUM)
hourly = dg.aggregate("h")  # from 5min -> hourly
```

---

## CLI Reference

```
tsdata [OPTIONS] COMMAND [ARGS]
```

### `generate` — create a CSV dataset

```
tsdata generate \
    --start "2024-01-01" \
    --end "2024-01-31" \
    --granularity "D" \
    --dims "product:A,B,C,D" \
    --dims "region:X,Y,Z" \
    --mets "sales:LinearTrend(limit=1000)+WeekendTrend(weekend_effect=100)" \
    --output "daily_sales.csv"
```

**Options:**

| Option | Description |
|---|---|
| `--start` | Start datetime (`YYYY-MM-DD`) |
| `--end` | End datetime (`YYYY-MM-DD`) |
| `--granularity` | `s`, `min`, `5min`, `h`, `D`, `W`, `ME`, `Y` |
| `--dims` | Dimension spec (repeatable) |
| `--mets` | Metric spec (repeatable) |
| `--anomalies` | Anomaly spec keyed by metric name (repeatable) |
| `--seed` | Integer seed for deterministic generation |
| `--output` | Output CSV path (must end in `.csv`) |
| `--preset` | Use a built-in preset |
| `--config` | Path to a JSON config file |

**Presets** — `daily-sales`, `hourly-metrics`, `minute-stock`, `weekly-revenue`,
`monthly-recurring`. List with `tsdata presets`.

### Anomaly examples

```bash
# Point anomalies
tsdata generate ... --anomalies "sales:PointAnomaly(probability=0.01,magnitude=5)"

# Missing data (random mode)
tsdata generate ... --anomalies "sales:MissingData(probability=0.05)"

# Missing data (burst mode)
tsdata generate ... --anomalies "sales:MissingData(mode=burst,burst_probability=0.02,min_length=3,max_length=10)"

# Missing data (patterned mode — NaN every Sunday)
tsdata generate ... --anomalies "sales:MissingData(mode=patterned,schedule=weekday==6)"

# Concept drift
tsdata generate ... --anomalies "sales:ConceptDrift(start_timestamp=2024-01-15T06:00:00,target_mean=50,target_std=5,hold_duration=7200)"

# Multiple anomaly types on one metric
tsdata generate ... --anomalies "sales:PointAnomaly(probability=0.01,magnitude=5)+MissingData(probability=0.05)"

# Multi-segment concept drift (repeat --anomalies for the same metric)
tsdata generate ... \
    --anomalies "sales:ConceptDrift(start_timestamp=2024-01-01T00:00:00,transition_window=1800,target_mean=50,hold_duration=7200)" \
    --anomalies "sales:ConceptDrift(start_timestamp=2024-01-02T00:00:00,transition_window=3600,target_mean=100,hold_duration=7200,restore=true)"

# Deterministic generation
tsdata generate ... --seed 42
```

### Other commands

```
tsdata dimensions    # List available dimension functions
tsdata metrics       # List available trend functions
tsdata presets       # List preset configurations
tsdata presets <name>  # Show details for a specific preset
```

### Environment variables

Any option can be set via environment variables prefixed with `TSDATA_`:

```bash
export TSDATA_START="2024-01-01"
export TSDATA_GRANULARITY="h"
tsdata generate --end "2024-01-02" --dims "id:A,B" --mets "val:LinearTrend(limit=10)" --output out.csv
```

### JSON config file

```json
{
  "start": "2024-01-01",
  "end": "2024-01-12",
  "granularity": "5min",
  "dimensions": ["product:A,B,C", "region:X,Y,Z"],
  "metrics": [
    "sales:LinearTrend(limit=500)+WeekendTrend(weekend_effect=50)",
    "orders:LinearTrend(limit=200)"
  ],
  "anomalies": [
    "sales:PointAnomaly(probability=0.01,magnitude=5)+MissingData(probability=0.05)"
  ],
  "seed": 42,
  "output": "data.csv"
}
```

CLI arguments override config file values.

---

## Schema Imputing

Reverse-engineer trend parameters from existing CSV data (requires `pip install "ts-data-generator[imputer]"`):

```python
from ts_data_generator.schema.converter import SchemaConverter

converter = SchemaConverter("data.csv", index_col=0)
schema = converter.impute_schema()
trends = converter.analyze_numeric_trends(columns=["sales"], top_freq=2)
converter.construct_trend_column("sales", trends["sales"])
```

See the [imputer notebook](./notebooks/imputer.ipynb) for a full walkthrough.

---

## Example Notebooks

| Notebook | Description |
|---|---|
| [sample.ipynb](./notebooks/sample.ipynb) | End-to-end: dimensions, metrics, trends, multi-items, plotting |
| [aggregate.ipynb](./notebooks/aggregate.ipynb) | Aggregation with multi-items and custom aggregation types |
| [imputer.ipynb](./notebooks/imputer.ipynb) | Reverse-engineering schema and trends from existing CSV |

---

## Package Structure

```
ts_data_generator/
├── __init__.py            # Public API: DataGen
├── exceptions.py          # Custom exception hierarchy
├── _version.py            # Package version
├── data_gen.py            # DataGen engine (orchestrator)
├── cli.py                 # Click CLI (tsdata command)
├── random.py              # SeedableRNG wrapper (PCG64-backed)
├── anomalies/
│   ├── __init__.py        # Anomaly, PointAnomaly, MissingData, ConceptDrift, DriftSegment
│   ├── base.py            # Abstract Anomaly base class
│   ├── point.py           # PointAnomaly (isolated spikes)
│   ├── missing.py         # MissingData (NaN gaps: random, burst, patterned)
│   └── drift.py           # ConceptDrift + DriftSegment (regime shifts)
├── core/
│   └── dataframe_builder.py  # DataFrame generation logic
├── schema/
│   ├── models.py          # Granularity, AggregationType, Metrics, Dimensions
│   └── converter.py       # CSV schema analysis & trend imputing
├── utils/
│   ├── functions.py       # Dimension generator functions
│   └── trends.py          # Trend generators (Sine, Linear, Weekend, Holiday, ARNoise, Markov, Stock)
└── transforms/
    └── normalizer.py      # Min-max & standard normalization strategies
```

---

## License

MIT — see [LICENSE](./LICENSE).
