Metadata-Version: 2.4
Name: remote-embedding
Version: 0.3.1
Summary: A shared FastAPI embedding service and LangChain-compatible remote client for reusing one embedding model across multiple applications and lowering VRAM usage on limited GPUs.
Author: Meshkat Shariat Bagheri
License-Expression: MIT
Project-URL: Homepage, https://github.com/MeshkatShB/remote-embedding
Project-URL: Issues, https://github.com/MeshkatShB/remote-embedding/issues
Keywords: embeddings,fastapi,langchain,huggingface,api
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Framework :: FastAPI
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.10
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: fastapi>=0.115
Requires-Dist: langchain-core>=0.3
Requires-Dist: langchain-huggingface>=0.1.2
Requires-Dist: pydantic>=2.7
Requires-Dist: python-dotenv>=1.0
Requires-Dist: requests>=2.32
Requires-Dist: uvicorn>=0.30
Dynamic: license-file

# remote-embedding

`remote-embedding` packages two things together:

- A FastAPI server that exposes a `/embed` API backed by local Hugging Face models.
- A LangChain-compatible `RemoteEmbeddings` client that calls that server remotely.

This lets multiple applications share a single loaded embedding model instance instead of each process loading its own copy. On constrained GPUs, that reduces duplicated VRAM usage and makes it easier to serve embeddings from limited hardware.

## Install

```bash
pip install remote-embedding
```

## Package Layout

The import package is `remote_embedding`.

```python
from remote_embedding import RemoteEmbeddings
```

## Run The Server

Set the environment variables your model needs. You can copy values from `.env.example` into your own `.env` file, or set them directly in the shell.

PowerShell:

```powershell
$env:EMBEDDING_MODEL_NAME="BAAI/bge-base-en-v1.5"
$env:EMBEDDING_DIR="C:\\path\\to\\model-cache"
$env:DEVICE="cpu"
$env:MAX_LOADED_MODELS="1"
$env:MAX_INPUTS_PER_REQUEST="128"
$env:EMBEDDING_BATCH_SIZE="32"
$env:CLEAR_CUDA_CACHE_AFTER_REQUEST="true"
$env:MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
$env:ENCODE_KWARGS='{"normalize_embeddings": true}'
```

Bash:

```bash
export EMBEDDING_MODEL_NAME=BAAI/bge-base-en-v1.5
export EMBEDDING_DIR=/path/to/model-cache
export DEVICE=cpu
export MAX_LOADED_MODELS=1
export MAX_INPUTS_PER_REQUEST=128
export EMBEDDING_BATCH_SIZE=32
export CLEAR_CUDA_CACHE_AFTER_REQUEST=true
export MODEL_KWARGS='{"local_files_only": true, "trust_remote_code": true}'
export ENCODE_KWARGS='{"normalize_embeddings": true}'
```

You can also configure the server with CLI flags:

```bash
remote-embedding-server \
  --host 0.0.0.0 \
  --port 5055 \
  --model-name BAAI/bge-base-en-v1.5 \
  --embedding-dir /path/to/model-cache \
  --device cuda \
  --max-loaded-models 1 \
  --max-inputs-per-request 128 \
  --embedding-batch-size 32 \
  --clear-cuda-cache-after-request \
  --model-kwargs '{"local_files_only": true, "trust_remote_code": true}' \
  --encode-kwargs '{"normalize_embeddings": true}'
```

Start the API:

```bash
remote-embedding-server
```

Or:

```bash
python -m remote_embedding
```

Defaults:

- `HOST=0.0.0.0`
- `PORT=5055`

CLI flags override environment variables for the current process.

## Configuration

Server configuration:

- `HOST`: bind address for the FastAPI server
- `PORT`: bind port for the FastAPI server
- `EMBEDDING_MODEL_NAME`: default model to preload and use when a request does not pass `model_name`
- `EMBEDDING_DIR`: optional local cache/model directory for Hugging Face downloads or local files
- `DEVICE`: device passed to `HuggingFaceEmbeddings`, such as `cpu` or `cuda`
- `MAX_LOADED_MODELS`: maximum number of embedding model instances kept in memory, default `1`
- `MAX_INPUTS_PER_REQUEST`: maximum number of strings accepted in one `/embed` request, default `128`
- `EMBEDDING_BATCH_SIZE`: default encoder `batch_size`, default `32`
- `CLEAR_CUDA_CACHE_AFTER_REQUEST`: clears unused CUDA allocator memory after each embedding request, default `true`
- `MODEL_KWARGS`: JSON object merged into `HuggingFaceEmbeddings(..., model_kwargs=...)`
- `ENCODE_KWARGS`: JSON object passed to `HuggingFaceEmbeddings(..., encode_kwargs=...)`

Client configuration through `RemoteEmbeddings(...)`:

- `base_url`: full server URL, such as `http://127.0.0.1:5055`
- `timeout`: request timeout in seconds
- `expected_dimensions`: optional validation for returned vector size
- `model_name`: optional per-client default model name sent with each request
- `embedding_dir`: optional per-client cache/model directory override sent with each request
- `model_kwargs`: optional JSON-serializable dict sent to the server and merged into `model_kwargs`
- `encode_kwargs`: optional JSON-serializable dict sent to the server as `encode_kwargs`

Call `embeddings.close()` when you are done with a long-lived client, or use `RemoteEmbeddings` as a context manager. This closes the client's HTTP connection pool. GPU memory is owned by the server process and is released when models are evicted or the server shuts down.

If `EMBEDDING_MODEL_NAME` is configured on the server, the server can preload one shared embedding model instance and let multiple applications reuse it. That is what saves VRAM versus loading the same model separately in each application process.

`model_kwargs` and `encode_kwargs` become part of the server-side model cache key. Different combinations can create different embedding instances. The server evicts older instances once `MAX_LOADED_MODELS` is exceeded, and defaults to keeping one model loaded to protect GPU memory.

## Use The Client

```python
from remote_embedding import RemoteEmbeddings

embeddings = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    timeout=120,
    expected_dimensions=768,
    model_name="BAAI/bge-base-en-v1.5",
    embedding_dir="C:/models/cache",
    model_kwargs={"local_files_only": True, "trust_remote_code": True},
    encode_kwargs={"normalize_embeddings": True},
)

docs = embeddings.embed_documents(["hello world", "remote embeddings"])
query = embeddings.embed_query("search text")
embeddings.close()
```

Or:

```python
from remote_embedding import RemoteEmbeddings

with RemoteEmbeddings(base_url="http://127.0.0.1:5055") as embeddings:
    docs = embeddings.embed_documents(["hello world", "remote embeddings"])
```

## RAG Pipeline Usage

If your RAG pipeline currently loads a local embedding model inside each application process, you can replace that with `RemoteEmbeddings` and route embedding calls to one shared server.

Before:

```python
from langchain_huggingface import HuggingFaceEmbeddings

embed_model = HuggingFaceEmbeddings(
    model_name="Qwen/Qwen3-Embedding-0.6B",
    model_kwargs={"device": "cuda", "local_files_only": True},
    cache_folder=EMBEDDING_DIR,
)
```

After:

```python
from remote_embedding import RemoteEmbeddings

embed_model = RemoteEmbeddings(
    base_url="http://127.0.0.1:5055",
    model_name="Qwen/Qwen3-Embedding-0.6B",
    embedding_dir="C:/models/cache",
    encode_kwargs={"normalize_embeddings": True},
)
```

This makes it easier for multiple RAG applications, workers, or services to share the same loaded embedding model instead of each loading its own copy into GPU memory.

## Build For PyPI

Build distributions locally:

```bash
python -m pip install --upgrade build
python -m build
```

This creates:

- `dist/*.tar.gz`
- `dist/*.whl`

Upload with Twine:

```bash
python -m pip install --upgrade twine
python -m twine upload dist/*
```

## Contributing

Contributions are welcome through issues and pull requests.

Typical local workflow:

```bash
git clone git@github.com:MeshkatShB/remote-embedding.git
cd remote-embedding
python -m pip install --upgrade build
python -m build
```

If you change packaging metadata, rebuild `dist/` before opening a release-oriented pull request.

## License

This project is licensed under the MIT License. See `LICENSE` for the full text.

## Citation

If you use this project in research, infrastructure, or published work, cite the repository:

```bibtex
@software{bagheri_remote_embedding_2026,
  author = {Bagheri, Meshkat Shariat},
  title = {remote-embedding},
  year = {2026},
  url = {https://github.com/MeshkatShB/remote-embedding}
}
```
