Scope: This page documents the public Python API exposed from arnio.__all__ on current origin/main. Items marked current-main are merged after the v1.18.0 tag.

I/O

read_csv(path, *, delimiter=",", has_header=True, usecols=None, nrows=None, skiprows=None, decimal_separator=".", dtype=None, encoding="utf-8", encoding_errors="strict", on_bad_lines="error")

Read CSV-like input into an ArFrame. Supports extension-flexible paths, TSV delimiter inference, explicit dtypes, decoding policy, and configurable malformed-row handling.

Returns: ArFrame. Raises CsvReadError for read and parse failures.

read_csv_chunked(path, *, chunksize, delimiter=",", has_header=True, decimal_separator=".", on_bad_lines="error")

Yield ArFrame chunks from large files. on_bad_lines may error, warn, or skip malformed row-width records.

scan_csv(path, *, delimiter=",", has_header=True, trim_headers=True, decimal_separator=".", encoding="utf-8", encoding_errors="strict")

Infer column names and dtypes without loading the full file.

read_jsonl(path, *, nrows=None)

Read JSON Lines or NDJSON into an ArFrame; mixed object columns are coerced to strings.

write_csv(frame, path, *, delimiter=",", write_header=True, line_terminator="\n")

Write an ArFrame to CSV or TSV with delimiter and line terminator validation.

write_parquet(frame, path, *, compression=None, row_group_size=None)

Write an ArFrame to Parquet via the optional pyarrow dependency. Install with pip install "arnio[parquet]".

sniff_delimiter(path, *, encoding="utf-8")

Detect a likely delimiter for CSV-like input before reading or scanning.

Frame API

class ArFrame

Python wrapper over the native C++ frame. Core properties include shape, columns, dtypes, is_empty, memory_usage(), head(), tail(), and preview().

ArFrame.from_records(records, *, columns=None) / ar.from_records(...)

Build an ArFrame from record dictionaries or row-like records. Added in v1.17.0.

frame["column"] / frame[["a", "b"]]

Current-main API. Select one column as a Python list or multiple columns as an ArFrame.

frame.to_dict()

Return a dict[str, list] representation for serialization or tests.

frame.drop_columns(columns)

Current-main method equivalent to the top-level drop_columns helper.

frame.describe() / frame.schema_summary

Summary statistics and column summaries. schema_summary returns ColumnSummary objects with name, dtype, and nullability.

Conversion

to_pandas(frame, *, copy=False)

Convert to pandas, preserving the fast zero-copy path where supported. Use copy=True for defensive isolation.

from_pandas(df)

Convert a pandas DataFrame to ArFrame with dtype validation, duplicate-column checks, attrs preservation, and zero-column row-count preservation.

to_arrow(frame)

Export to pyarrow.Table. Install with pip install "arnio[arrow]". Added in v1.18.0.

Cleaning

Cleaning functions return new frames and are usable directly or from pipeline().

FunctionPurpose
drop_nulls, keep_rows_with_nulls, fill_nullsControl null-bearing rows and values.
drop_columns, select_columns, drop_empty_columns, drop_columns_matchingManage columns explicitly or by regex.
strip_whitespace, normalize_case, normalize_unicode, trim_column_namesNormalize text and headers.
cast_types, parse_bool_strings, round_numeric_columns, clip_numeric, winsorize_outliersType and numeric cleanup.
replace_values, standardize_missing_tokens, combine_columns, coalesce_columns, safe_divide_columnsCommon feature engineering and value repair helpers.

Pipeline

pipeline(frame, steps, *, verbose=False)

Run named cleaning steps or registered Python callables. verbose=True enables lightweight diagnostics through the arnio logger and optional PipelineContext.

register_step(name, fn, overwrite=False), list_steps(), reset_steps(), get_builtin_step_signatures()

Manage the step registry and discover available built-in signatures.

Quality

profile(frame, *, sample_size=5, exclude_columns=None, approx_top_values=False, top_n=5)

Return DataQualityReport with row/column counts, duplicate metrics, column profiles, warnings, quality score, score components, and cleaning suggestions.

DataQualityReport.to_dict(redact_sample_values=False, exclude_columns=None), to_json(), to_markdown(), to_html(), summary(), to_pandas()

Export reports for notebooks, CI, docs, and automation. exclude_columns and to_json() are current-main updates after v1.18.0.

compare_profiles(left, right), check_quality_gates(baseline, current), suggest_cleaning(frame_or_report), auto_clean(frame, *, mode="safe", dry_run=False, explain=False)

Compare quality over time, fail CI on drift, generate pipeline-compatible steps, or run automatic cleaning with optional audit output.

Schema

Schema(fields, strict=False).validate(frame, max_errors=None) / validate(frame, schema, max_errors=None)

Validate frames against field definitions. max_errors was added in v1.18.0.

Field builders

Int64, Float64, String, Bool, Email, URL(allowed_schemes=...), PhoneNumber, CountryCode, CurrencyCode, Date, DateTime, Regex, and Custom.

diff_schema, schema_to_dict, schema_to_yaml, register_validator

Compare schema contracts, export schema definitions, and register custom semantic validators.

Integrations

pandas accessor: df.arnio.to_arframe(), pipeline(), clean(), profile(), suggest_cleaning(), auto_clean(), validate()

Run Arnio workflows directly from pandas DataFrames.

register_duckdb(frame, conn, name)

Register an ArFrame as a DuckDB relation through pandas interop.

ArnioCleaner

Optional scikit-learn transformer for pipeline-safe data preparation. Install with pip install "arnio[sklearn]".