Installation

Arnio is published on PyPI. The base install includes pandas and NumPy. Optional extras enable Arrow, Parquet, and scikit-learn workflows.

Shell
pip install arnio
pip install "arnio[arrow]"
pip install "arnio[parquet]"
pip install "arnio[sklearn]"

Quickstart

Use Arnio at the boundary where raw files enter analysis or automation. It keeps parsing, cleaning, profiling, and validation explicit.

Python
import arnio as ar

frame = ar.read_csv("customers.csv", on_bad_lines="warn")
report = ar.profile(frame)
suggestions = ar.suggest_cleaning(report)
clean = ar.pipeline(frame, suggestions)

schema = ar.Schema({
    "customer_id": ar.String(nullable=False, unique=True),
    "email": ar.Email(nullable=True),
})
result = ar.validate(clean, schema, max_errors=50)

CSV Reading

Current CSV reading covers the production edge cases that were added across v1.17.0, v1.18.0, and current main.

OptionUse
skiprowsSkip leading rows before header or data parsing.
decimal_separatorParse locale-specific decimal characters such as comma decimals.
dtypeForce selected columns to supported Arnio dtypes.
encoding_errorsChoose strict, replace, or ignore for decode failures.
on_bad_linesUse error, warn, or skip for malformed row widths.
Python
frame = ar.read_csv(
    "orders.csv",
    delimiter=",",
    skiprows=2,
    decimal_separator=".",
    dtype={"order_id": "string"},
    encoding_errors="replace",
    on_bad_lines="warn",
)

Chunked CSV

read_csv_chunked streams a CSV into ArFrame chunks. Chunked reads support malformed-row handling, but schema validation itself remains a whole-frame operation after you materialize the data you want to validate.

Python
for chunk in ar.read_csv_chunked("huge.csv", chunksize=100_000, on_bad_lines="skip"):
    clean = ar.strip_whitespace(chunk)
    process(clean)

JSONL, CSV, and Parquet Exports

Arnio can read JSON Lines, write CSV, and write Parquet through the optional pyarrow extra. Use sniff_delimiter and scan_csv when you need a cheap look at unknown files.

Python
events = ar.read_jsonl("events.ndjson", nrows=10_000)
ar.write_csv(events, "events.csv")
ar.write_parquet(events, "events.parquet", compression="zstd")
Python
delimiter = ar.sniff_delimiter("raw_file.txt")
schema_preview = ar.scan_csv("raw_file.txt", delimiter=delimiter)

ArFrame

ArFrame is the public Python wrapper around the native columnar frame. Current main adds more ergonomic frame operations after v1.18.0.

from_records__getitem__to_dictdrop_columnsdescribeschema_summary
Python
frame = ar.from_records([
    {"id": "a1", "amount": 10.5},
    {"id": "a2", "amount": 18.0},
])

ids = frame["id"]
numeric = frame.describe()
summary = frame.schema_summary
payload = frame.to_dict()

Cleaning

Cleaning functions are available directly and through pipeline. Recent releases added column removal helpers, winsorization, missing-token standardization, coalescing, safer division, unicode normalization, and stricter validation errors.

Column operations

select_columns, drop_columns, drop_empty_columns, drop_columns_matching, rename_columns.

Value operations

replace_values, standardize_missing_tokens, coalesce_columns, parse_bool_strings, normalize_unicode.

Numeric operations

clip_numeric, round_numeric_columns, winsorize_outliers, safe_divide_columns.

Rows and nulls

drop_nulls, keep_rows_with_nulls, fill_nulls, drop_duplicates, filter_rows.

Quality Reports

The quality engine is now more than a summary. It supports privacy-aware exports, HTML/Markdown/Pandas output, near-constant and high-cardinality warnings, drift comparison, and CI gates.

Python
report = ar.profile(frame, exclude_columns=["private_notes"])
report.to_markdown()
report.to_html("quality.html")
report.to_dict(exclude_columns=["private_notes"])
report.to_json()

baseline = ar.profile(old_frame)
current = ar.profile(new_frame)
gate = ar.check_quality_gates(baseline, current)

Schema Validation

Schemas validate the shape and semantics of a dataset. v1.18.0 added max_errors; v1.17.0 added URL scheme restrictions and YAML export helpers.

Python
ar.register_validator("positive", lambda value: value > 0)

schema = ar.Schema({
    "email": ar.Email(nullable=False),
    "homepage": ar.URL(allowed_schemes=["https"]),
    "created_at": ar.DateTime(format="%Y-%m-%dT%H:%M:%S"),
    "country": ar.CountryCode(),
    "amount": ar.Custom("positive"),
})

result = schema.validate(frame, max_errors=100)
yaml_text = ar.schema_to_yaml(schema)

Integrations

Arnio is designed to sit beside pandas, not replace it. Use the pandas accessor, Arrow export, DuckDB registration, Parquet output, and scikit-learn transformer where they fit your workflow.

Python
df = ar.to_pandas(frame, copy=True)
frame = df.arnio.to_arframe()
profile = df.arnio.profile()

arrow_table = ar.to_arrow(frame)
ar.register_duckdb(frame, conn, "clean_sales")