Installation
Arnio is published on PyPI. The base install includes pandas and NumPy. Optional extras enable Arrow, Parquet, and scikit-learn workflows.
pip install arnio
pip install "arnio[arrow]"
pip install "arnio[parquet]"
pip install "arnio[sklearn]"Quickstart
Use Arnio at the boundary where raw files enter analysis or automation. It keeps parsing, cleaning, profiling, and validation explicit.
import arnio as ar
frame = ar.read_csv("customers.csv", on_bad_lines="warn")
report = ar.profile(frame)
suggestions = ar.suggest_cleaning(report)
clean = ar.pipeline(frame, suggestions)
schema = ar.Schema({
"customer_id": ar.String(nullable=False, unique=True),
"email": ar.Email(nullable=True),
})
result = ar.validate(clean, schema, max_errors=50)CSV Reading
Current CSV reading covers the production edge cases that were added across v1.17.0, v1.18.0, and current main.
| Option | Use |
|---|---|
skiprows | Skip leading rows before header or data parsing. |
decimal_separator | Parse locale-specific decimal characters such as comma decimals. |
dtype | Force selected columns to supported Arnio dtypes. |
encoding_errors | Choose strict, replace, or ignore for decode failures. |
on_bad_lines | Use error, warn, or skip for malformed row widths. |
frame = ar.read_csv(
"orders.csv",
delimiter=",",
skiprows=2,
decimal_separator=".",
dtype={"order_id": "string"},
encoding_errors="replace",
on_bad_lines="warn",
)Chunked CSV
read_csv_chunked streams a CSV into ArFrame chunks. Chunked reads support malformed-row handling, but schema validation itself remains a whole-frame operation after you materialize the data you want to validate.
for chunk in ar.read_csv_chunked("huge.csv", chunksize=100_000, on_bad_lines="skip"):
clean = ar.strip_whitespace(chunk)
process(clean)JSONL, CSV, and Parquet Exports
Arnio can read JSON Lines, write CSV, and write Parquet through the optional pyarrow extra. Use sniff_delimiter and scan_csv when you need a cheap look at unknown files.
events = ar.read_jsonl("events.ndjson", nrows=10_000)
ar.write_csv(events, "events.csv")
ar.write_parquet(events, "events.parquet", compression="zstd")delimiter = ar.sniff_delimiter("raw_file.txt")
schema_preview = ar.scan_csv("raw_file.txt", delimiter=delimiter)ArFrame
ArFrame is the public Python wrapper around the native columnar frame. Current main adds more ergonomic frame operations after v1.18.0.
frame = ar.from_records([
{"id": "a1", "amount": 10.5},
{"id": "a2", "amount": 18.0},
])
ids = frame["id"]
numeric = frame.describe()
summary = frame.schema_summary
payload = frame.to_dict()Cleaning
Cleaning functions are available directly and through pipeline. Recent releases added column removal helpers, winsorization, missing-token standardization, coalescing, safer division, unicode normalization, and stricter validation errors.
Column operations
select_columns, drop_columns, drop_empty_columns, drop_columns_matching, rename_columns.
Value operations
replace_values, standardize_missing_tokens, coalesce_columns, parse_bool_strings, normalize_unicode.
Numeric operations
clip_numeric, round_numeric_columns, winsorize_outliers, safe_divide_columns.
Rows and nulls
drop_nulls, keep_rows_with_nulls, fill_nulls, drop_duplicates, filter_rows.
Quality Reports
The quality engine is now more than a summary. It supports privacy-aware exports, HTML/Markdown/Pandas output, near-constant and high-cardinality warnings, drift comparison, and CI gates.
report = ar.profile(frame, exclude_columns=["private_notes"])
report.to_markdown()
report.to_html("quality.html")
report.to_dict(exclude_columns=["private_notes"])
report.to_json()
baseline = ar.profile(old_frame)
current = ar.profile(new_frame)
gate = ar.check_quality_gates(baseline, current)Schema Validation
Schemas validate the shape and semantics of a dataset. v1.18.0 added max_errors; v1.17.0 added URL scheme restrictions and YAML export helpers.
ar.register_validator("positive", lambda value: value > 0)
schema = ar.Schema({
"email": ar.Email(nullable=False),
"homepage": ar.URL(allowed_schemes=["https"]),
"created_at": ar.DateTime(format="%Y-%m-%dT%H:%M:%S"),
"country": ar.CountryCode(),
"amount": ar.Custom("positive"),
})
result = schema.validate(frame, max_errors=100)
yaml_text = ar.schema_to_yaml(schema)Integrations
Arnio is designed to sit beside pandas, not replace it. Use the pandas accessor, Arrow export, DuckDB registration, Parquet output, and scikit-learn transformer where they fit your workflow.
df = ar.to_pandas(frame, copy=True)
frame = df.arnio.to_arframe()
profile = df.arnio.profile()
arrow_table = ar.to_arrow(frame)
ar.register_duckdb(frame, conn, "clean_sales")