Metadata-Version: 2.4
Name: sifft
Version: 0.8.2
Summary: Spark Ingestion Framework for Tables (SIFFT) - a Python library providing a consistent interface for ingesting files into Apache Spark environments.
Author: Iwan Dyke, Fahad Khan, Michał Poręba
License-Expression: MIT
License-File: LICENSE
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Natural Language :: English
Requires-Python: >=3.10
Requires-Dist: fsspec>=2023.1.0
Requires-Dist: openpyxl>=3.1.0
Requires-Dist: pandas>=2.0.0
Requires-Dist: pyld>=2.0.4
Requires-Dist: s3fs>=2023.1.0
Requires-Dist: setuptools>=58.0
Requires-Dist: xlrd>=2.0.2
Provides-Extra: dev
Requires-Dist: mypy>=1.0.0; extra == 'dev'
Requires-Dist: pre-commit>=4.0.0; extra == 'dev'
Requires-Dist: pytest-bdd>=8.1.0; extra == 'dev'
Requires-Dist: pytest-cov>=6.3.0; extra == 'dev'
Requires-Dist: pytest-html>=4.1.1; extra == 'dev'
Requires-Dist: pytest>=9.0.2; extra == 'dev'
Requires-Dist: ruff>=0.15.1; extra == 'dev'
Provides-Extra: dev-spark3
Requires-Dist: delta-spark<4,>=3.2; extra == 'dev-spark3'
Requires-Dist: mypy>=1.0.0; extra == 'dev-spark3'
Requires-Dist: pre-commit>=4.0.0; extra == 'dev-spark3'
Requires-Dist: pyspark<4,>=3.5; extra == 'dev-spark3'
Requires-Dist: pytest-bdd>=8.1.0; extra == 'dev-spark3'
Requires-Dist: pytest-cov>=6.3.0; extra == 'dev-spark3'
Requires-Dist: pytest-html>=4.1.1; extra == 'dev-spark3'
Requires-Dist: pytest>=9.0.2; extra == 'dev-spark3'
Requires-Dist: ruff>=0.15.1; extra == 'dev-spark3'
Provides-Extra: dev-spark4
Requires-Dist: delta-spark<5,>=4; extra == 'dev-spark4'
Requires-Dist: mypy>=1.0.0; extra == 'dev-spark4'
Requires-Dist: pre-commit>=4.0.0; extra == 'dev-spark4'
Requires-Dist: pyspark<5,>=4; extra == 'dev-spark4'
Requires-Dist: pytest-bdd>=8.1.0; extra == 'dev-spark4'
Requires-Dist: pytest-cov>=6.3.0; extra == 'dev-spark4'
Requires-Dist: pytest-html>=4.1.1; extra == 'dev-spark4'
Requires-Dist: pytest>=9.0.2; extra == 'dev-spark4'
Requires-Dist: ruff>=0.15.1; extra == 'dev-spark4'
Provides-Extra: pyspark3
Requires-Dist: delta-spark<4,>=3.2; extra == 'pyspark3'
Requires-Dist: pyspark<4,>=3.5; extra == 'pyspark3'
Provides-Extra: pyspark4
Requires-Dist: delta-spark<5,>=4; extra == 'pyspark4'
Requires-Dist: pyspark<5,>=4; extra == 'pyspark4'
Description-Content-Type: text/markdown

# SIFFT

**Spark Ingestion Framework For Tables** — format-agnostic file reading, data validation, and table writing for PySpark.

## Install

```bash
pip install sifft              # Without PySpark (use your Databricks runtime)
pip install sifft[pyspark3]    # With PySpark 3.5 + Delta Lake 3.x
pip install sifft[pyspark4]    # With PySpark 4.x + Delta Lake 4.x
```

## Quick Start

```python
from pyspark.sql import SparkSession
from file_processing import process_file
from dataframe_validation import validate_csvw_constraints
from table_writing import write_table, TableWriteOptions

spark = SparkSession.builder.appName("Pipeline").getOrCreate()

# Read (auto-detects format, delimiter, header)
result = process_file("data.csv", spark)

# Validate (if CSVW metadata exists alongside the file)
if result.metadata:
    report = validate_csvw_constraints(result.dataframe, result.metadata)
    assert report.valid, f"{len(report.violations)} violations"

# Write
write_table(result.dataframe, "catalog.schema.target", spark,
            TableWriteOptions(format="delta", mode="append"))
```

## Features

- **File Processing** — CSV, TSV, pipe-delimited, Excel. Auto-detection of delimiters and headers. Checksum-based deduplication.
- **Data Validation** — CSVW constraint checking: required, unique, min/max, pattern, enum, primary keys.
- **Table Writing** — Delta, Parquet, ORC. Append, overwrite, merge/upsert, schema evolution.
- **File Management** — Safe move/list across local, S3, Azure, GCS.
- **Extensibility** — Custom format handlers, write modes, and constraint validators.

## Documentation

- [Getting Started](docs/getting-started.md) — installation, requirements, full quick start
- [File Processing](docs/file-processing.md) — reading files, tracking, deduplication
- [Data Validation](docs/data-validation.md) — schema validation and CSVW constraints
- [Table Writing](docs/table-writing.md) — write modes, formats, schema evolution
- [File Management](docs/file-management.md) — safe move/list with cloud support
- [CSVW Metadata](docs/csvw-metadata.md) — metadata format and constraint reference
- [Extensibility](docs/extensibility.md) — custom handlers and validators
- [AutoLoader Integration](docs/autoloader-integration.md) — using SIFFT with Databricks AutoLoader

## Compatibility

- Python 3.10+
- PySpark 3.5.x, 4.0.x, or 4.1.x
- Databricks / Unity Catalog compatible

## Development

```bash
just test        # Run tests in Docker
just test-local  # Run tests locally (requires Java)
just --list      # All available commands
```

## Design Decisions

Architecture Decision Records are in [design_decisions/](design_decisions/).

## License

MIT — see [LICENSE](LICENSE).

## Contributors

- Iwan Dyke
- Fahad Khan
- Michal Poreba
