Metadata-Version: 2.1
Name: convozen
Version: 0.3.7
Summary: Python SDK for ConvoZen TTS and STT services
Author: Convozen
Keywords: tts,stt,speech,voice,convozen
Classifier: License :: OSI Approved :: MIT License
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Multimedia :: Sound/Audio :: Speech
Classifier: Topic :: Software Development :: Libraries :: Python Modules
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: requests <3.0,>=2.28.0
Provides-Extra: async
Requires-Dist: httpx <1.0,>=0.24.0 ; extra == 'async'
Requires-Dist: httpcore[asyncio] ; extra == 'async'
Provides-Extra: dev
Requires-Dist: pytest ; extra == 'dev'
Requires-Dist: responses ; extra == 'dev'
Requires-Dist: pytest-asyncio ; extra == 'dev'
Requires-Dist: anyio[trio] ; extra == 'dev'
Requires-Dist: respx ; extra == 'dev'
Requires-Dist: httpx <1.0,>=0.24.0 ; extra == 'dev'

# ConvoZen Voice SDK

Text-to-Speech (TTS) and Speech-to-Text (STT) for Python.

> One API key for both TTS and STT.

## Install

```bash
pip install convozen
```

Requires Python 3.9+

---

## TTS — Convert Text to Speech

Copy and run:

```python
import convozen

client = convozen.Client(api_key="your-api-key")

audio = client.tts.synthesize("नमस्ते… मुझे आपके सर्विस के बारे में थोड़ी जानकारी चाहिए।", language="hi")

with open("output.wav", "wb") as f:
    f.write(audio)
```

With options:

```python
import convozen

client = convozen.Client(api_key="your-api-key")

audio = client.tts.synthesize(
    text="Welcome to Playground.",
    language="en",          # default: "en"
    voice="roohi",          # default: "roohi"
    model="ragini-v1",      # default: "ragini-v1"
    speed=1.2,              # default: 1.0
)

with open("output.wav", "wb") as f:
    f.write(audio)
```

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `text` | `str` | *required* | Text to convert to speech |
| `language` | `str` | `"en"` | Language code (see supported languages below) |
| `voice` | `str` | `"roohi"` | Voice ID for synthesis (see available voices below) |
| `model` | `str` | `"ragini-v1"` | TTS model to use. Currently only `ragini-v1` is available |
| `speed` | `float` | `1.0` | Speech speed. `0.5` = half speed, `2.0` = double speed |
| `stream` | `bool` | `False` | If `True`, returns audio chunks as they are generated |
| `format` | `str` | `"wav"` | Audio output format. Currently only `wav` is supported |

### Streaming

```python
import convozen

client = convozen.Client(api_key="your-api-key")

for chunk in client.tts.synthesize("Long text here...", language="en", stream=True):
    audio_player.write(chunk)
```

### Available Voices

| Voice | ID | Default |
|-------|----|---------|
| Roohi | `roohi` | Yes |
| Amaya | `amaya` | |
| Kiyansh | `kiyansh` | |
| Neeraj | `neeraj` | |
| Manya | `manya` | |
| Nidhi | `nidhi` | |
| Ira | `ira` | |
| Trisha | `trisha` | |
| Charvi | `charvi` | |

```python
audio = client.tts.synthesize("Hello", language="en", voice="roohi")
```

---

## STT — Convert Speech to Text

Copy and run:

```python
import convozen

client = convozen.Client(api_key="your-api-key")

result = client.stt.transcribe("recording.wav")
print(result.text)              # "hello how can I help you"
print(result.score)             # -2.45  (confidence — closer to 0 is better)
print(result.word_timestamps)   # [WordTimestamp(word="hello", start_s=0.0, end_s=0.4), …]
```

Word-level timestamps are always returned — no extra flag needed.

---

### Speaker Diarization

Set `speaker_labels=True` to get a per-speaker breakdown instead of a flat transcript. Useful for call recordings, interviews, or any audio with multiple speakers:

```python
result = client.stt.transcribe("call.wav", speaker_labels=True)

for turn in result.turns:
    print(f"[{turn.speaker_id}]  {turn.start_sec:.1f}s – {turn.end_sec:.1f}s")
    print(f"  {turn.transcript}")

# [s0]  0.0s – 3.2s
#   Hello, this is support. How can I help you?
# [s1]  4.0s – 7.8s
#   Hi, I'd like to reschedule my appointment.
```

Response fields (`DiarizeResponse`):

| Field | Type | Description |
|-------|------|-------------|
| `turns` | `list[DiarizeTurn]` | Speaker turns in order |
| `num_speakers` | `int` | Number of distinct speakers detected |
| `total_duration_sec` | `float` | Total audio duration in seconds |

Each `DiarizeTurn` has `speaker_id`, `start_sec`, `end_sec`, `transcript`.

---

### Specify Languages

If you know which languages are spoken in the audio, pass them in `lang_tags`. This improves accuracy — especially for multilingual audio like Hindi + English call recordings:

```python
result = client.stt.transcribe(
    "recording.wav",
    lang_tags=["hi", "en"],
)
print(result.text)  # "हां मुझे appointment reschedule करना है"
```

---

### Keyword Boosting

Have domain-specific words the model keeps getting wrong? Pass them in `keywords` and the model will bias towards recognizing them:

```python
# Without keyword boosting: "can you tell me about conversion"
# With keyword boosting:    "can you tell me about convozen"

result = client.stt.transcribe(
    "recording.wav",
    keywords=["convozen", "akshara"],
)
```

---

### Speech Separation

`speech_separation` controls whether audio enhancement is applied before transcription. It is enabled by default and helps on noisy recordings, overlapping speech, or audio captured in difficult conditions.

> **Note:** `speech_separation` only takes effect when `speaker_labels=True` (diarization path). It has no effect on plain transcription.

Disable it on clean studio-quality audio to save processing time:

```python
# Default — enhancement on (recommended for most real-world audio)
result = client.stt.transcribe("noisy_call.wav", speaker_labels=True)

# Disable for clean, single-speaker audio
result = client.stt.transcribe("clean_recording.wav", speaker_labels=True, speech_separation=False)
```

---

### Choose a Model

Use `model` to select which ASR model processes the audio:

```python
# Fast model — lower latency, good for real-time or bulk processing
result = client.stt.transcribe("recording.wav", model="akshara-flash")

# Pro model — highest accuracy (default)
result = client.stt.transcribe("recording.wav", model="akshara-pro")
```

| Model | Best for |
|-------|----------|
| `akshara-pro` | Maximum accuracy (default) |
| `akshara-flash` | Lower latency, bulk / real-time use |

---

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `audio` | `str` | *required* | Path to the audio file |
| `speaker_labels` | `bool` | `False` | `True` to return per-speaker turns instead of a flat transcript |
| `model` | `str` | `"akshara-pro"` | ASR model to use (`"akshara-pro"` or `"akshara-flash"`) |
| `lang_tags` | `list[str]` | `None` | Languages spoken in the audio — improves accuracy when known |
| `blank_penalty` | `float` | `None` | Penalizes silence/blank tokens; increase to reduce empty gaps in output |
| `keywords` | `list[str]` | `None` | Domain words the model should bias towards recognizing |
| `speech_separation` | `bool` | `True` | Apply speech enhancement before transcription — helps on noisy audio. Only applies when `speaker_labels=True` |

---

## Supported Languages

| Code | Language | TTS | STT |
|------|----------|:---:|:---:|
| `en` | English | Yes | Yes |
| `hi` | Hindi | Yes | Yes |
| `ta` | Tamil | Yes | Yes |
| `te` | Telugu | Yes | Yes |
| `kn` | Kannada | Yes | Yes |
| `mr` | Marathi | — | Yes |
| `bn` | Bengali | — | Yes |
| `gu` | Gujarati | — | Yes |
| `ml` | Malayalam | — | Yes |

---

## Check Credits

```python
import convozen

client = convozen.Client(api_key="your-api-key")
info = client.account.info()
print(info.credits.tts.balance)
print(info.credits.stt.balance)
```

---

## Error Handling

```python
import convozen
from convozen import AuthenticationError, RateLimitError, APIError

client = convozen.Client(api_key="your-api-key")

try:
    audio = client.tts.synthesize("Hello")
except AuthenticationError:
    print("Invalid API key")
except RateLimitError:
    print("Too many requests")
except APIError as e:
    print(f"Server error: {e}")
```

---

## Standalone Clients

```python
from convozen import TTS, STT

tts = TTS(api_key="your-api-key")
audio = tts.synthesize("Hello", language="hi")

stt = STT(api_key="your-api-key")

# Flat transcript
result = stt.transcribe("recording.wav")
print(result.text)

# Per-speaker diarization
result = stt.transcribe("call.wav", speaker_labels=True)
for turn in result.turns:
    print(f"[{turn.speaker_id}] {turn.start_sec:.2f}s – {turn.end_sec:.2f}s : {turn.transcript}")
```
