Metadata-Version: 2.4
Name: labenv-embedding-cache
Version: 0.2.1
Summary: Shared embedding cache core for cross-project reuse
Author: labenv
License-Expression: LicenseRef-Proprietary
Requires-Python: >=3.10
Description-Content-Type: text/markdown
Requires-Dist: numpy>=1.24
Requires-Dist: omegaconf>=2.3
Provides-Extra: embed
Requires-Dist: torch>=2.1; extra == "embed"
Requires-Dist: accelerate>=0.33; extra == "embed"
Requires-Dist: transformers>=4.40; extra == "embed"
Requires-Dist: sentence-transformers>=2.6; extra == "embed"
Provides-Extra: debug
Requires-Dist: debugpy>=1.8; extra == "debug"

# labenv_embedding_cache

Shared policy and Python cache-core library for projects that reuse text embedding caches.

## Files
- `embedding_rulebook.yaml`: machine-readable cache policy (path/layout/consistency)
- `embedding_registry.yaml`: machine-readable model registry shared across projects
- `embedding_cache_spec.yaml`: machine-readable spec for dataset key / metadata / variant tag
- `src/labenv_embedding_cache/`: reusable Python library
- `CODEX_PROMPT.md`: reusable prompt template for Codex sessions

## Install library
Preferred distribution is wheel (`v0.2.1`).

```bash
# Internal index (recommended)
pip install "labenv-embedding-cache==0.2.1"

# Optional: explicitly point pip to internal index
# PIP_EXTRA_INDEX_URL="https://<internal-index>/simple" pip install "labenv-embedding-cache==0.2.1"

# GitHub Release wheel fallback
pip install "labenv-embedding-cache @ https://github.com/ryuuua/labenv_embedding_cache/releases/download/v0.2.1/labenv_embedding_cache-0.2.1-py3-none-any.whl"

# Local editable (maintainer workflow)
pip install -e /path/to/labenv_embedding_cache
```

Rollback-only legacy install (VCS pin):

```bash
pip install "git+ssh://git@github.com/ryuuua/labenv_embedding_cache.git@c0154c06ee6e41852c58ac76d6504f5b38d20168#egg=labenv-embedding-cache"
```

## 0.2.1 compatibility note

`0.2.1` keeps the `0.2.0` cache compatibility boundary and adds architecture
governance checks, explicit downstream compatibility tests, and the
`resolution.py` / `verification.py` module split. Existing imports from
`labenv_embedding_cache.resolver` remain available as a compatibility shim.

## 0.2.0 compatibility note

`0.2.0` is an intentional compatibility boundary for legacy fallback.

- Canonical cache files remain supported.
- Legacy tuple caches (`arr_0`) remain unsupported.
- Legacy index fallback now expects canonical cache metadata on the candidate cache file.
- In practice this means fallback candidates must carry `metadata_json` with `variant_tag` and the usual identity metadata.
- Header-only legacy caches that only store `ids` / `embeddings` / `shuffle_seed` plus sparse top-level fields are no longer eligible fallback targets.

If a project still depends on header-only legacy caches, stay on `0.1.x` or regenerate those caches into canonical format before upgrading.

## Release publishing (PyPI / TestPyPI / GitHub Packages)

Tag push (`v*`) or manual dispatch triggers:
- `.github/workflows/release-wheel.yml` (GitHub Release + optional internal index)
- `.github/workflows/publish-package-indexes.yml` (TestPyPI / PyPI / GitHub Packages)

Before tagging a release:

```bash
python tools/check_tag_version.py v0.2.1
PYTHONPATH=src python -m pytest -q
python -m build
python -m twine check dist/labenv_embedding_cache-0.2.1*
```

Secrets for `publish-package-indexes.yml`:
- `TEST_PYPI_API_TOKEN` (optional; if unset, TestPyPI publish is skipped)
- `PYPI_API_TOKEN` (optional; if unset, PyPI publish is skipped)
- `GITHUB_PACKAGES_TOKEN` (optional; if unset, GitHub Packages publish is skipped)
- `GITHUB_PACKAGES_USERNAME` (optional; defaults to `${{ github.actor }}`)

## Standalone run (smoke tests)

Install with optional embedding/debug extras:

```bash
pip install -e ".[embed,debug]"
```

Embedding generation only (no cache):

```bash
python tools/embedding_smoketest.py
```

Embedding + cache read/write:

```bash
python tools/embedding_cache_smoketest.py --backend dummy
# or (downloads model)
python tools/embedding_cache_smoketest.py --backend sentence-transformers --model sentence-transformers/all-MiniLM-L6-v2
```

Debugpy (wait for attach):

```bash
python tools/embedding_cache_smoketest.py --backend dummy --debugpy --wait-for-client
```

Embedding generation from `conf/embedding` presets (auto DDP/pipeline policy via `embedding_model.md`):

```bash
python tools/generate_embeddings_from_conf.py --preset conf/embedding/qwen3_embedding.yaml --strategy auto
```

Default is non-normalized embeddings. Use `--normalize` to generate L2-normalized caches.
`normalize_embeddings=true` is treated as a separate model variant (`registry_key=...__l2`) and cache variant (`norm=l2`).

Best practices: `docs/EMBEDDING_BEST_PRACTICES.md`

## Policy path setup
No environment variable is required for normal use. The package resolves its
bundled `embedding_rulebook.yaml` automatically.

If you want to pin `EMBEDDING_RULEBOOK_PATH` explicitly in your shell profile:

```bash
export EMBEDDING_RULEBOOK_PATH="$(labenv-embedding-cache-path rulebook)"
```

Then reload shell:

```bash
source ~/.zshrc
```

## Cache verification and lock export (canonical-only)

Build a read-only index for existing `lm/**` cache files:

```bash
labenv-embedding-cache index-build --cache-dir /work/$USER/data/embedding_cache
```

Show index stats:

```bash
labenv-embedding-cache index-stats --cache-dir /work/$USER/data/embedding_cache
```

Verify whether request manifests can be served without regeneration:

```bash
labenv-embedding-cache verify-requests --requests /path/to/request_manifest.jsonl --min-selected-models 2
```

Build request manifest rows from canonical metadata in Python:

```python
import labenv_embedding_cache as lec

record = lec.build_request_manifest_entry(
    dataset_name="ag_news",
    model_id="bert",
    model_name="bert-base-uncased",
    expected_cache_path="/work/$USER/data/embedding_cache/lm/bert-base-uncased/ag_news__x.npz",
    metadata=metadata,
)
```

Export a lock payload for CI/DVC (`policy digest + index + optional verify report`):

```bash
labenv-embedding-cache lock-export \
  --cache-dir /work/$USER/data/embedding_cache \
  --requests /path/to/request_manifest.jsonl \
  --output /work/$USER/data/embedding_cache/.labenv/lock.json \
  --min-selected-models 2
```

Index file location:
- `${EMBEDDING_CACHE_DIR}/.labenv/index_v1.jsonl`

Legacy fallback policy is controlled by rulebook:
- `compatibility.legacy_index.enabled`
- `compatibility.legacy_index.sunset_date`
- `compatibility.legacy_index.require_ids_sha256_match`

Current default is strict canonical mode:
- `compatibility.legacy_index.enabled: false`
- `cache.compatibility.accept_legacy_npz_tuple: false`

Identity expansion is controlled by spec:
- `identity.profiles.default.hard_fields`
- `identity.profiles.default.soft_fields`
- `identity.profiles.default.defaults`
- `identity.profiles.default.legacy_match_policy`

## Quick usage

```python
from labenv_embedding_cache.api import get_or_compute_embeddings

vectors, resolution = get_or_compute_embeddings(
    cfg,
    texts,
    ids,
    labels=labels,
    compute_embeddings=my_backend_compute_fn,
)
print(resolution.cache_path, resolution.was_cache_hit)
```

## AppConfig compatibility

- `labenv_embedding_cache` accepts both Hydra `DictConfig` and typed dataclass configs (for example `CEBRA-NLP-gen2` / `larm` `AppConfig`).
- To keep cache sharing stable during typed-migration, dataset fingerprinting treats these legacy-equivalent cases as identical:
  - missing vs empty container: `dataset.splits`, `dataset.label_columns`, `dataset.label_map`, `dataset.label_remap`
  - missing vs `false`: `dataset.drop_multi_label_samples`, `dataset.multi_label`, `dataset.trust_remote_code`

## Downstream compatibility

CEBRA / Arc / Larm consumers should use the package facade:

```python
import labenv_embedding_cache as lec
```

The supported public surface and read-only manifest/cache contract are documented in
`docs/DOWNSTREAM_COMPATIBILITY.md`.

## Docker / Docker Compose (env1-env4)

Compose files match the `labenv_config/envkit-templates` profiles (env1_a6000/env2_3090/env3_cc21_a100/env4_cc21_cpu):
- env1: `nvcr.io/nvidia/pytorch:25.09-py3` (CUDA 13.0 profile; CUDA 12.8 runtime is frozen legacy)
- env2: `nvcr.io/nvidia/pytorch:25.09-py3`
- env3: `nvcr.io/nvidia/pytorch:23.10-py3`
- env4: `python:3.11-slim`

Cache roots are unified by compose profile:
- `env1` / `env2`: embedding cache = `/home/ryua/data/embedding_cache` (`EMBEDDING_CACHE_DIR` / `TEXT_EMBEDDING_CACHE_DIR`)
- `env3` / `env4`: embedding cache = `/work/ryunosuke-ab/data/embedding_cache` (`EMBEDDING_CACHE_DIR` / `TEXT_EMBEDDING_CACHE_DIR`)
- `env1` / `env2`: HF cache reuse = `/data/cache` (`HF_HOME`, `TRANSFORMERS_CACHE`)

Host paths are bind-mounted so caches are reused across repos by default.

Examples:

```bash
docker compose -f docker/compose.env1.yaml run --rm app python tools/embedding_cache_smoketest.py --backend dummy
docker compose -f docker/compose.env4.yaml run --rm app python tools/embedding_smoketest.py --backend transformers
# debugpy (port publish requires --service-ports)
docker compose -f docker/compose.env1.yaml run --rm --service-ports app python tools/embedding_cache_smoketest.py --backend dummy --debugpy --wait-for-client
```

## Sweep utility

Create a stable, line-based sweep list file:

```bash
python tools/make_sweep_list.py --glob "configs/sweep/*.yaml" --out sweep.txt --root .
```

## Per-project Codex usage
In each project, paste `CODEX_PROMPT.md` (or add equivalent guidance in `AGENTS.md`) so Codex always aligns with this policy.
For cross-project rollout requests, use `PROJECT_MIGRATION_PROMPT.md`.

## Automation runbook
- Runtime-aware automation prompt: `docs/AUTOMATION_PROMPT_RUNTIME_AWARE.md`
- Copy/paste execution checklist: `docs/AUTOMATION_EXECUTION_CHECKLIST.md`
- Pre-filled ready-to-run checklist: `docs/AUTOMATION_EXECUTION_CHECKLIST_READY.md`
- What to place in central/downstream/runtime repos: `docs/REPO_AUTOMATION_ASSETS.md`

## Architecture governance
- Current structure and dependency DAG: `docs/architecture/current-structure.md`
- Machine-readable module map: `docs/architecture/module-map.yaml`
- ML/cache contracts: `docs/ML_CONTRACTS.md`
- Public/internal API boundary: `docs/architecture/public-api.md`

Local checks:

```bash
python tools/check_policy_bundle_sync.py
python tools/check_architecture_boundaries.py
python tools/architecture_inventory.py
python tools/dynamic_reference_inventory.py --text
```

## Updating shared standards
1. Edit `embedding_rulebook.yaml`, `embedding_registry.yaml`, and/or `embedding_cache_spec.yaml`.
2. In each project, run its embedding-cache validation/tests.
3. If schema behavior changes, bump `identity.version` in `embedding_rulebook.yaml`.
