API Reference

Scope: This page documents the public Python API exposed from arnio.__all__ on current origin/main.

I/O

read_csv(path, *, delimiter=None, has_header=True, usecols=None, nrows=None, skiprows=None, encoding="utf-8", trim_headers=True, decimal_separator=".", thousands_separator=None, null_values=None, dtype=None, mode="strict", encoding_errors="strict", on_bad_lines="error")

Read CSV-like input into an ArFrame. Supports extension-flexible paths, TSV delimiter inference, explicit dtypes, decoding policy, and configurable malformed-row handling. Key options: trim_headers strips whitespace from column names; null_values specifies additional tokens to treat as null; thousands_separator enables numeric parsing with grouping separators; mode controls parser strictness ("strict" or "lax").

Returns: ArFrame. Raises CsvReadError for read and parse failures.

read_csv_chunked(path, *, chunksize=10000, dtype=None, delimiter=None, has_header=True, usecols=None, nrows=None, skip_rows=0, skiprows=None, encoding="utf-8", trim_headers=True, decimal_separator=".", thousands_separator=None, null_values=None, mode="strict", on_bad_lines="error")

Yield ArFrame chunks from large files. delimiter=None infers tab delimiters for .tsv paths and comma otherwise. on_bad_lines may error, warn, or skip malformed row-width records. Supports the same parser controls as read_csv(), including dtype, usecols, null_values, thousands_separator, trim_headers, and mode.

scan_csv(path, *, delimiter=None, encoding="utf-8", trim_headers=True, decimal_separator=".", thousands_separator=None, sample_size=None, null_values=None, has_header=True, encoding_errors="strict", mode="strict", on_bad_lines="error")

Infer column names and dtypes without loading the full file. Like read_csv, omitted delimiters use tab for .tsv paths and comma otherwise.

Example: ar.scan_csv("raw.csv", sample_size=500, null_values=["NA", "MISSING"]) keeps sampling and null-token handling aligned with a later read_csv() call.

read_jsonl(path, *, encoding="utf-8", encoding_errors="strict", nrows=None)

Read JSON Lines or NDJSON into an ArFrame; mixed object columns are coerced to strings. The encoding and encoding_errors parameters control how file bytes are decoded into text.

read_jsonl_chunked(path, *, chunksize=10000, encoding="utf-8", encoding_errors="strict", nrows=None)

Yield JSON Lines or NDJSON records as ArFrame chunks without buffering the whole file. Uses the same validation, line-numbered errors, decoding controls, and nrows limit as read_jsonl().

write_csv(frame, path, *, delimiter=",", write_header=True, line_terminator="\n", escape_formulas=False, encoding="utf-8", encoding_errors="strict")

Write an ArFrame to CSV or TSV with delimiter and line terminator validation. Set escape_formulas=True to prefix string cells that start with spreadsheet formula trigger characters before CSV quoting. encoding defaults to "utf-8" (native fast path); any Python codec is accepted and transcoded in bounded chunks. encoding_errors controls unencodable character handling: "strict" (default), "replace", or "ignore".

write_json(frame, path, *, orient="records", indent=None)

Export an ArFrame to JSON without pandas conversion. Supported orient formats: "records", "list", and "split".

write_jsonl(frame, path, *, encoding="utf-8", encoding_errors="strict")

Export an ArFrame to JSON Lines or NDJSON. Emits one compact JSON object per row, preserves JSON null, and supports .jsonl and .ndjson output paths. encoding_errors controls unencodable character handling: "strict", "replace", or "ignore".

read_parquet(path, *, columns=None, usecols=None)

Read a Parquet file into an ArFrame via the optional pyarrow dependency. Install with pip install "arnio[parquet]". Use usecols or columns for column subset selection.

write_parquet(frame, path, *, compression="snappy", row_group_size=None, preserve_attrs=True)

Write an ArFrame to Parquet via the optional pyarrow dependency. Install with pip install "arnio[parquet]". preserve_attrs=True writes JSON-serializable DataFrame.attrs into Parquet metadata; set to False to skip that export. Accepted compression values: "snappy" (default), "gzip", "brotli", "zstd", and "none".

sniff_delimiter(path, *, encoding="utf-8", sample_size=2048)

Detect a likely delimiter for CSV-like input before reading or scanning. sample_size controls how many characters are read from the start of the file for detection; the default of 2048 is sufficient for most files. Increase it when the file has an unusual delimiter that only appears further into the content.

Frame API

class ArFrame

Python wrapper over the native C++ frame. Core properties include shape, columns, dtypes, is_empty, memory_usage(), head(), tail(), and preview().

ArFrame.from_records(records, *, columns=None) / ar.from_records(...)

Build an ArFrame from record dictionaries or row-like records. Added in v1.17.0.

ArFrame._repr_html_()

Jupyter/IPython automatically calls this to render a bounded HTML table preview showing up to 10 rows, column names, dtypes, and shape summary.

frame["column"] / frame[["a", "b"]]

Select one column as a Python list or multiple columns as an ArFrame.

item in frame / frame.__contains__(item)

Return True when the item is a valid column name in the frame.

frame.select_columns(columns)

Return a new ArFrame containing only the requested columns in original order.

frame.select_dtypes(include=..., exclude=...)

Return a new ArFrame filtered by dtype names like int64, float64, string, bool, or null.

frame.astype(dtype)

Cast one or more columns to a specified dtype or per-column type mapping and return a new ArFrame.

frame.to_dict()

Return a dict[str, list] representation for serialization or tests.

frame.to_csv(path, *, delimiter=",", write_header=True, **kwargs)

Convenience wrapper to write the frame to a CSV file. Delegates to arnio.write_csv.

frame.drop_columns(columns)

Current-main method equivalent to the top-level drop_columns helper.

frame.describe() / frame.schema_summary

Summary statistics and column summaries. schema_summary returns ColumnSummary objects with name, dtype, and nullability.

Conversion

to_pandas(frame, *, copy=False)

Convert to pandas, preserving the fast zero-copy path where supported. Use copy=True for defensive isolation.

from_pandas(df)

Convert a pandas DataFrame to ArFrame with dtype validation, duplicate-column checks, attrs preservation, and zero-column row-count preservation.

from_dict(data)

Build an ArFrame from a mapping of column names to column values.

to_arrow(frame)

Export to pyarrow.Table. Install with pip install "arnio[arrow]". Added in v1.18.0.

from_arrow(table)

Convert a PyArrow Table to an ArFrame. Install with pip install "arnio[arrow]".

to_polars(frame)

Convert an ArFrame to a Polars DataFrame via Arrow. Requires optional Polars dependency; install with pip install "arnio[polars]".

from_polars(df)

Convert a Polars DataFrame to ArFrame with inferred types and null values preserved. Requires optional Polars dependency; install with pip install "arnio[polars]".

Cleaning

Cleaning functions return new frames and are usable directly or from pipeline().

Function	Purpose
`drop_nulls`, `keep_rows_with_nulls`, `fill_nulls`	Control null-bearing rows and values.
`validate_columns_exist`, `drop_columns`, `select_columns`, `drop_empty_columns`, `drop_constant_columns`, `drop_columns_matching`	Validate and manage columns explicitly, by emptiness, constants, or regex.
`filter_rows`, `drop_duplicates`, `find_fuzzy_duplicates`	Filter row sets, remove duplicate rows, and detect near-duplicate rows by similarity.
`hash_columns`, `strip_whitespace`, `normalize_whitespace`, `normalize_case`, `normalize_unicode`, `clean_column_names`, `trim_column_names`, `slugify_column_names`, `rename_columns`, `rename_columns_matching`	Normalize text, headers, and column names.
`cast_types`, `parse_bool_strings`, `round_numeric_columns`, `clip_numeric`, `winsorize_outliers`, `normalize_minmax`	Type and numeric cleanup.
`encode_categorical`	Encode STRING columns using one-hot indicator columns or caller-supplied ordinal mappings.
`replace_values`, `standardize_missing_tokens`, `combine_columns`, `coalesce_columns`, `safe_divide_columns`	Common feature engineering and value repair helpers.

hash_columns(frame, *, subset, algorithm="sha256")

Replace values in string columns with their lowercase hex-encoded hash digest. Uses the standard-library hashlib module — no custom digest code. Null cells are preserved as null; empty strings are hashed normally. algorithm accepts "sha256" (default, recommended) or "md5" (for deduplication workloads only — not cryptographically strong). Raises ValueError for missing columns; raises TypeError for non-string columns. Available as a pipeline step: ("hash_columns", {"subset": ["email"], "algorithm": "sha256"}).

Warning: hashing is deterministic pseudonymization, not encryption. This does not constitute anonymization under GDPR or equivalent regulations.

CastReport and CastFailure

Returned by cast_types(..., errors="report"). CastReport holds the cast frame and a failures list of CastFailure records. Each CastFailure has column, row, value, and target_dtype attributes. bool(report) is True when there is at least one failure.

Pipeline

pipeline(frame, steps, *, verbose=False, track_lineage=False)

Run named cleaning steps or registered Python callables. verbose=True enables lightweight diagnostics through the arnio logger and optional PipelineContext. Pass track_lineage=True to receive a (ArFrame, LineageReport) tuple that maps every dropped row back to its original index and the step that removed it.

LineageReport

Returned by pipeline(..., track_lineage=True). Attributes: dropped_by_step (dict mapping step name → sorted list of original row indices dropped by that step; steps that dropped no rows have an empty list) and total_dropped (int). Method: to_pandas() returns a flat DataFrame with columns original_index and step.

register_step(name, fn, overwrite=False), unregister_step(name), list_steps(), reset_steps(), get_builtin_step_signatures()

Manage the step registry and discover available built-in signatures.

Pipeline Serialization

save_pipeline(steps, filepath): Save a list of pipeline steps to a JSON or YAML file for reproducible jobs.
load_pipeline(filepath): Load a list of pipeline steps from a JSON or YAML file.
PipelineSerializationError: Exception raised when a declarative pipeline cannot be saved or loaded.

Quality

profile(frame, *, exclude_columns=None, sample_size=5, approx_top_values=False, approx_top_values_min_unique=1000, approx_top_values_min_ratio=0.2, approx_top_values_sample_size=2000)

Return DataQualityReport with row/column counts, duplicate metrics, column profiles, warnings, quality score, score components, and cleaning suggestions.

DataQualityReport._repr_html_()

Jupyter/IPython automatically calls this to render the report dashboard HTML directly in the notebook output area.

DataQualityReport.to_dict(redact_sample_values=False, exclude_columns=None), to_json(), to_markdown(), to_html(), summary(), to_pandas()

Export reports for notebooks, CI, docs, and automation.

compare_profiles(left, right), suggest_cleaning(frame_or_report), auto_clean(frame, *, mode="safe", return_report=False, dry_run=False, allow_lossy_casts=False, confirmed_casts=None, explain=False)

Compare quality over time, generate pipeline-compatible steps, or run automatic cleaning with optional reports and audit output.

Additional options:

return_report – return the pre-cleaning quality report alongside cleaned output.
allow_lossy_casts – opt in to strict-mode casts that may lose information.
confirmed_casts – explicit column-to-dtype confirmations required before strict-mode casts are applied.

Return values:

auto_clean(...) → ArFrame
auto_clean(..., return_report=True) → (ArFrame, DataQualityReport)
auto_clean(..., dry_run=True) → DataQualityReport
auto_clean(..., explain=True) → (ArFrame, CleanExplanation)
auto_clean(..., return_report=True, explain=True) → (ArFrame, DataQualityReport, CleanExplanation)

# Preview proposed changes
report = ar.auto_clean(frame, mode="strict", dry_run=True)

# Clean and return report
cleaned, report = ar.auto_clean(
    frame,
    return_report=True,
)

# Clean with explanation
cleaned, explanation = ar.auto_clean(
    frame,
    explain=True,
)

check_quality_gates(baseline_profile, current_profile, *, max_row_count_delta_ratio=0.1, max_duplicate_ratio_delta=0.05, max_null_ratio_delta=0.05, max_numeric_mean_delta_ratio=0.1, max_numeric_std_delta_ratio=0.2, allow_new_columns=False, allow_missing_columns=False, fail_on_dtype_change=True)

Compare two DataQualityReport objects and return a QualityGateResult suitable for CI and release checks.

Option	Default	Gate behavior
`max_row_count_delta_ratio`	`0.1`	Maximum relative row-count drift. Set to `None` to disable.
`max_duplicate_ratio_delta`	`0.05`	Maximum absolute duplicate-ratio drift. Set to `None` to disable.
`max_null_ratio_delta`	`0.05`	Maximum absolute per-column null-ratio drift. Set to `None` to disable.
`max_numeric_mean_delta_ratio`	`0.1`	Maximum relative per-column numeric mean drift. Set to `None` to disable.
`max_numeric_std_delta_ratio`	`0.2`	Maximum relative per-column numeric standard-deviation drift. Set to `None` to disable.
`allow_new_columns`	`False`	Allow columns that appear only in the current profile.
`allow_missing_columns`	`False`	Allow baseline columns that are absent from the current profile.
`fail_on_dtype_change`	`True`	Fail when shared columns change dtype.

baseline = ar.profile(ar.read_csv("baseline.csv"))
current = ar.profile(ar.read_csv("current.csv"))

result = ar.check_quality_gates(
    baseline,
    current,
    max_null_ratio_delta=0.02,
    allow_new_columns=True,
)

if not result.passed:
    print(result.to_markdown())
    result.raise_for_failures()

Quality result classes

ColumnProfile, CleaningSuggestion, CleanStepRecord, CleanExplanation, DriftReport, ProfileComparison, QualityGateIssue, and QualityGateResult describe structured profiling, cleaning, comparison, and gate output.

detect_drift(old, new)

Compare two ArFrame datasets and return a DriftReport describing structural and statistical changes. Unlike compare_profiles, this function handles added or removed columns without raising an error.

DriftReport fields:

added_columns – columns present in new but absent from old.
removed_columns – columns present in old but absent from new.
dtype_changes – shared columns whose dtype changed: {col: (old_dtype, new_dtype)}.
null_ratio_changes – shared columns whose null ratio changed: {col: (old_ratio, new_ratio)}.
row_count – row counts as (old_rows, new_rows).
has_drift – True when any of the above are non-empty or row counts differ.
to_dict() – JSON-safe export of the report.

old_frame = ar.read_csv("data_v1.csv")
new_frame = ar.read_csv("data_v2.csv")
report = ar.detect_drift(old_frame, new_frame)

if report.has_drift:
    print("Added columns:", report.added_columns)
    print("Removed columns:", report.removed_columns)
    print("Dtype changes:", report.dtype_changes)
    print("Null ratio changes:", report.null_ratio_changes)

import json
print(json.dumps(report.to_dict(), indent=2))

Schema

Schema(fields, strict=False, unique=None, rules=None).validate(frame, max_errors=None) / validate(frame, schema, max_errors=None)

Validate frames against field definitions. max_errors was added in v1.18.0.

unique: list or tuple of column names that must contain no duplicate values in the frame.

rules: list of callables (pd.DataFrame) -> list[ValidationIssue] for cross-field or custom contract checks.

schema = ar.Schema(
    fields={"email": ar.Email(), "user_id": ar.Int64()},
    unique=["email"],
)
result = schema.validate(frame)

validate_chunked(chunks, schema, *, max_errors=None)

Validate an iterable of ArFrame chunks against a Schema and return a single ValidationResult. Row indices in all ValidationIssue objects are adjusted to reflect global positions across chunk boundaries. Supports max_errors for early termination. Does not modify validate().

Field builders

Each builder returns a Field for use in Schema(fields={...}). All builders accept keyword-only arguments unless noted otherwise.

Builder	Group	Key parameters	Notes
`Int64(...)`	Numeric	`nullable=True`, `min=None`, `max=None`, `unique=False`, `severity="error"`, `required_if=None`	Integer range bounds are inclusive. `min` and `max` must be numeric.
`Float64(...)`	Numeric	`nullable=True`, `min=None`, `max=None`, `unique=False`, `severity="error"`, `required_if=None`	Floating-point range bounds. Same constraints as `Int64`.
`String(...)`	String	`nullable=True`, `pattern=None`, `allowed=None`, `case_sensitive=True`, `min_length=None`, `max_length=None`, `unique=False`, `severity="error"`, `required_if=None`	`pattern` is a regex applied to every non-null value. `allowed` accepts a set, list, or tuple — not a bare string.
`Bool(...)`	Boolean	`nullable=True`, `severity="error"`, `required_if=None`	No range or uniqueness constraints; dtype must be boolean.
`Email(...)`	Semantic	`nullable=True`, `unique=False`, `validation="light"`, `severity="error"`, `required_if=None`	`validation` is `"light"` (format check) or `"strict"` (RFC 5322 conformant).
`URL(...)`	Semantic	`nullable=True`, `unique=False`, `allowed_schemes=None`, `severity="error"`, `required_if=None`	Omit to allow the default http/https URL validator; pass `allowed_schemes=[...]` to permit a custom set such as `["https", "ftp"]`.
`PhoneNumber(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Validates E.164-style phone numbers.
`CountryCode(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Expects uppercase ISO 3166-1 alpha-2 codes (e.g. `"US"`, `"IN"`).
`CurrencyCode(...)`	Semantic	`nullable=True`, `unique=False`, `allowed=None`, `severity="error"`, `required_if=None`	Validates 3-letter uppercase ISO 4217 codes. `allowed` narrows to a custom subset of valid codes.
`LanguageCode(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Expects lowercase ISO 639-1 two-letter codes (e.g. `"en"`, `"fr"`).
`TimeZone(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Validates IANA timezone identifiers (e.g. `"America/New_York"`).
`UUID(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Validates the canonical 8-4-4-4-12 hexadecimal UUID format (RFC 4122, e.g. `"550e8400-e29b-41d4-a716-446655440000"`). Both upper- and lower-case hex digits are accepted.
`IPv4(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Validates strict dotted-decimal IPv4 addresses (e.g. `"192.168.1.1"`). Each octet must be 0–255; leading zeros are rejected.
`MACAddress(...)`	Semantic	`nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Validates IEEE 802 MAC-48 addresses in colon-separated (`"AA:BB:CC:DD:EE:FF"`) or hyphen-separated (`"AA-BB-CC-DD-EE-FF"`) notation.
`Date(...)`	Date / Time	`nullable=True`, `min=None`, `max=None`, `unique=False`, `severity="error"`, `required_if=None`	Validates ISO 8601 date strings (`YYYY-MM-DD`). `min` / `max` accept ISO date strings, `datetime.date`, or `pd.Timestamp`; numeric values raise `TypeError`.
`DateTime(...)`	Date / Time	`nullable=True`, `min=None`, `max=None`, `format=None`, `unique=False`, `severity="error"`, `required_if=None`	`format` is a `strptime` format string. `min` / `max` accept ISO strings or `datetime` objects.
`Regex(pattern, ...)`	Regex	`pattern` (positional, required), `nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	Pattern is compiled at call time — invalid expressions raise `re.error` immediately, not at validation time.
`Custom(name, ...)`	Custom	`name` (positional, required), `nullable=True`, `unique=False`, `severity="error"`, `required_if=None`	`name` must match a validator registered via `register_validator()`. Raises `ValueError` if the name is not found.

Common parameters: nullable — allow null values; unique — require all non-null values to be distinct; severity — "error" (default) or "warning"; required_if — a (column, value) tuple that makes the field mandatory when another column equals a given value.

diff_schema, schema_to_dict, schema_to_yaml, schema_from_yaml, register_validator

Compare schema contracts, export schema definitions, and register custom semantic validators. Structured results include SchemaDiff, SchemaDiffEntry, ValidationIssue, and ValidationResult.

Integrations

ArnioPandasAccessor / pandas accessor: df.arnio.to_arframe(), pipeline(), clean(), profile(), suggest_cleaning(), auto_clean(), validate()

Run Arnio workflows directly from pandas DataFrames.

df.arnio.clean(steps=None, *, strip_whitespace=True, drop_nulls=False, drop_duplicates=False)

Clean a DataFrame with Arnio and return a pandas.DataFrame.

When steps is provided, it is passed directly to ar.pipeline() and the keyword flags are ignored. When steps is omitted, the convenience-clean path runs with the keyword flags controlling which operations apply.

strip_whitespace – strip leading and trailing whitespace from string columns (default True).
drop_nulls – drop rows that contain any null value (default False).
drop_duplicates – drop duplicate rows (default False).

import pandas as pd
import arnio as ar

df = pd.DataFrame({"name": [" Alice ", " Alice "], "score": [10, 10]})
cleaned = df.arnio.clean(drop_duplicates=True)
# cleaned: one row, name stripped to "Alice"

Returns: pandas.DataFrame

register_duckdb(frame, conn, name)

ArnioCleaner

Optional scikit-learn transformer for pipeline-safe data preparation. Install with pip install "arnio[sklearn]".

Exceptions

ArnioError and specialized exceptions

ArnioError is the package-level base exception. Public specialized exceptions include CsvReadError, JsonlReadError, RemoteReadError, TypeCastError, PipelineStepError, UnknownStepError, and SchemaValidationError.