Architecture — Arnio

System shape

Runtime flow

CSV / JSONL / pandas records
        |
        v
C++ parser and columnar Frame
        |
        v
pybind11 extension (_arnio_cpp)
        |
        v
Python ArFrame wrapper
        |
        +--> cleaning pipeline
        +--> quality profile and gates
        +--> schema validation
        +--> pandas / Arrow / DuckDB / Parquet

C++ core

The native layer owns the columnar data model, CSV parsing, CSV writing, and performance-sensitive cleaning primitives.

Frame and Column

Typed columns store values and null masks. Frame-level guards protect row-count consistency and duplicate names.

CSV reader

Handles quoted records, multiline fields, malformed row widths, decimal separators, dtype overrides, Unicode paths, and decoding policy.

CSV writer

Writes validated delimiters, headers, quoted values, multiline fields, and line terminators.

Native transforms

Common string and numeric operations run close to the stored columns where the implementation benefits from C++.

Python API layer

The Python package wraps native frames with stable, typed user-facing APIs. This layer validates Python inputs, translates lower-level errors into Arnio exceptions, preserves pandas attrs where supported, and exposes helpers that are easier to compose from notebooks and ETL jobs.

ArFrameread_csvpipelineprofilevalidateto_arrow

Quality engine

The quality layer converts an ArFrame to a pandas DataFrame for inspection-oriented analytics. That keeps reporting expressive while the parser and storage stay native.

Object	Role
`ColumnProfile`	Column-level dtype, null, uniqueness, examples, warnings, and top-value signals.
`DataQualityReport`	Whole-frame score, score components, suggestions, Markdown/HTML/Pandas/JSON exports, and optional redaction/exclusions.
`ProfileComparison`	Drift summary between two quality reports.
`QualityGateResult`	Pass/fail wrapper for CI and release gates.

Schema engine

The schema layer validates public data contracts after ingestion and cleaning. It supports strictness, row-level issues, warning severities, max-error limits, Markdown/Pandas output, JSON round-trip, schema diffs, YAML export helpers, and custom validators.

Interop boundaries

Pandas

to_pandas, from_pandas, and the df.arnio accessor make pandas the analysis boundary.

Arrow and Parquet

to_arrow returns a pyarrow table; write_parquet writes files through the optional Parquet extra.

DuckDB

register_duckdb registers an ArFrame as a SQL relation through pandas interop.

scikit-learn

ArnioCleaner provides controlled transforms with row-count and feature-name checks.

Production hardening

Recent releases improved malformed CSV row handling, zero-column row-count preservation, duplicate column guards, JSONL nrows parsing, Windows clean targets, and validation error reporting. Duplicate CSV header handling remains an active area of improvement and is still being tracked through open issues and testing efforts.