Documentation

Installation

Arnio is published on PyPI. The base install includes pandas and NumPy. Optional extras enable Arrow, Parquet, and scikit-learn workflows.

Shell

pip install arnio
pip install "arnio[arrow]"
pip install "arnio[parquet]"
pip install "arnio[sklearn]"

Quickstart

Use Arnio at the boundary where raw files enter analysis or automation. It keeps parsing, cleaning, profiling, and validation explicit.

Python

import arnio as ar

frame = ar.read_csv("customers.csv")
report = ar.profile(frame)
suggestions = ar.suggest_cleaning(report)
clean = ar.pipeline(frame, suggestions)

schema = ar.Schema({
    "customer_id": ar.String(nullable=False, unique=True),
    "email": ar.Email(nullable=True),
})
result = ar.validate(clean, schema, max_errors=50)

Note

The on_bad_lines parameter defaults to "error". Use "warn" to log malformed rows without failing, or "skip" to silently skip them.

CSV Reading

Current CSV reading covers the production edge cases that were added across v1.17.0, v1.18.0, and current main.

Option	Use
`skiprows`	Skip leading rows before header or data parsing.
`has_header`	Control whether the first row is treated as column headers.
`usecols`	Read only a selected subset of columns.
`nrows`	Limit the number of rows read from the file.
`trim_headers`	Remove leading and trailing whitespace from column names.
`thousands_separator`	Parse numbers containing thousands separators such as commas.
`null_values`	Treat specified values as null or missing data.
`mode`	Control CSV parsing behavior for strict or relaxed ingestion workflows.
`decimal_separator`	Parse locale-specific decimal characters such as comma decimals.
`dtype`	Force selected columns to supported Arnio dtypes.
`encoding_errors` Unreleased	Choose `strict`, `replace`, or `ignore` for decode failures.
`on_bad_lines`	Use `error`, `warn`, or `skip` for malformed row widths.

Python

frame = ar.read_csv(
    "orders.csv",
    has_header=True,
    usecols=["order_id", "amount", "customer"],
    nrows=50_000,
    skiprows=2,
    trim_headers=True,
    thousands_separator=",",
    null_values=["", "N/A", "NULL"],
    decimal_separator=".",
    dtype={"order_id": "string"},
    mode="strict",
    encoding_errors="replace",  # Unreleased
)

Note

on_bad_lines defaults to "error" for malformed rows. Use "warn" to log issues without failing, or "skip" to skip them silently.

Chunked CSV

read_csv_chunked streams a CSV into ArFrame chunks. Chunked reads support malformed-row handling, but schema validation itself remains a whole-frame operation after you materialize the data you want to validate.

Python

for chunk in ar.read_csv_chunked("huge.csv", chunksize=100_000, on_bad_lines="skip"):
    clean = ar.strip_whitespace(chunk)
    process(clean)

JSONL, CSV, and Parquet Exports

Arnio can read JSON Lines, write CSV, and write Parquet through the optional pyarrow extra. Use sniff_delimiter and scan_csv when you need a cheap look at unknown files.

Python

events = ar.read_jsonl("events.ndjson", nrows=10_000)
ar.write_csv(events, "events.csv")
ar.write_parquet(events, "events.parquet", compression="zstd")

Python

delimiter = ar.sniff_delimiter("raw_file.txt")
schema_preview = ar.scan_csv("raw_file.txt", delimiter=delimiter)

ArFrame

ArFrame is the public Python wrapper around the native columnar frame. Current main adds more ergonomic frame operations after v1.18.0.

from_records__getitem__to_dictdrop_columnsdescribeschema_summary

Python

frame = ar.from_records([
    {"id": "a1", "amount": 10.5},
    {"id": "a2", "amount": 18.0},
])

ids = frame["id"]
numeric = frame.describe()
summary = frame.schema_summary
payload = frame.to_dict()

Cleaning

Cleaning functions are available directly and through pipeline. Recent releases added column removal helpers, winsorization, missing-token standardization, coalescing, safer division, unicode normalization, and stricter validation errors.

Column operations

select_columns, drop_columns, drop_empty_columns, drop_columns_matching, rename_columns, rename_columns_matching.

Value operations

replace_values, standardize_missing_tokens, coalesce_columns, parse_bool_strings, normalize_unicode.

Numeric operations

clip_numeric, round_numeric_columns, winsorize_outliers, safe_divide_columns.

Rows and nulls

drop_nulls, keep_rows_with_nulls, fill_nulls, drop_duplicates, filter_rows.

Quality Reports

The quality engine is now more than a summary. It supports privacy-aware exports, HTML/Markdown/Pandas output, near-constant and high-cardinality warnings, drift comparison, and CI gates.

Python

report = ar.profile(frame, exclude_columns=["private_notes"])  # exclude_columns is unreleased
report.to_markdown()
report.to_html("quality.html")
report.to_dict(exclude_columns=["private_notes"])  # Unreleased
report.to_dict()
report.to_json()

full_report = ar.profile(frame)
full_report.to_dict(exclude_columns=["private_notes"])

baseline = ar.profile(old_frame)
current = ar.profile(new_frame)
gate = ar.check_quality_gates(baseline, current)

Safe auto_clean Workflow

auto_clean() is powerful and safety-sensitive. Use dry_run=True and explain=True to audit changes before applying them to real data.

Python


# Step 1 — Preview what would change, without touching the data
report = ar.auto_clean(frame, mode="strict", dry_run=True)
casts = dict(report.suggestions).get("cast_types")

# Step 2 — Inspect explanations separately (cannot combine with dry_run)
clean, explanation = ar.auto_clean(frame, explain=True)

# Step 3 — Apply intentionally using the previewed casts
clean = ar.auto_clean(
    frame,
    mode="strict",
    allow_lossy_casts=True,
    confirmed_casts=casts,
)
df = ar.to_pandas(clean)

Schema Validation

Schemas validate the shape and semantics of a dataset. v1.18.0 added max_errors; v1.17.0 added URL scheme restrictions and YAML export helpers.

Python

ar.register_validator("positive", lambda value: value > 0)

schema = ar.Schema({
    "email": ar.Email(nullable=False),
    "homepage": ar.URL(allowed_schemes=["https"]),
    "created_at": ar.DateTime(format="%Y-%m-%dT%H:%M:%S"),
    "country": ar.CountryCode(),
    "amount": ar.Custom("positive"),
})

result = schema.validate(frame, max_errors=100)
yaml_text = ar.schema_to_yaml(schema)

Schema-level uniqueness and cross-field rules

For production data contracts, use unique to enforce uniqueness across columns and rules for custom cross-field validation.

Python

schema = ar.Schema(
    fields={
        "customer_id": ar.Int64(nullable=False),
        "account_balance": ar.Float64(nullable=False),
        "credit_limit": ar.Float64(nullable=False),
    },
    # Ensure customer_id is unique across the dataset
    unique=["customer_id"],
    # Cross-field rule: balance cannot exceed credit limit
    rules=[
        lambda df: [
            ar.ValidationIssue(
                column="account_balance",
                rule="balance_exceeds_limit",
                message="Account balance exceeds credit limit",
                row_index=int(idx) + 1,
            )
            for idx in df[df["account_balance"] > df["credit_limit"]].index
        ],
    ],
)

result = ar.validate(frame, schema)
result.passed

Integrations

Arnio is designed to sit beside pandas, not replace it. Use the pandas accessor, Arrow export, DuckDB registration, Parquet output, and scikit-learn transformer where they fit your workflow.

Python

df = ar.to_pandas(frame, copy=True)
frame = df.arnio.to_arframe()
profile = df.arnio.profile()

suggestions = df.arnio.suggest_cleaning()
clean = df.arnio.to_arframe()
for step, params in suggestions:
    clean = ar.pipeline(clean, [(step, params)])

clean_df = df.arnio.auto_clean()

result = df.arnio.validate({"email": ar.Email(nullable=False), "age": ar.Int64(min=0)})

arrow_table = ar.to_arrow(frame)
ar.register_duckdb(frame, conn, "clean_sales")

Notebook Display

ArFrame and DataQualityReport render HTML automatically in Jupyter and IPython. When an object is the last expression in a notebook cell, the frontend calls _repr_html_ and renders a rich view—no extra conversion needed.

Python

import arnio as ar

frame  # ArFrame renders a bounded HTML preview of up to 10 rows
report = ar.profile(frame)
report  # DataQualityReport renders its full HTML dashboard

The preview is bounded so notebook output stays readable on large frames. Use frame.head(), report.to_html("file.html"), or report.to_dict() for fuller inspection.