Skip to content

RFC: typed-arrow Integration that Reduces Tonbo Complexity #468

@ethe

Description

@ethe

Summary

Refactor Tonbo in place to adopt the typed-arrow family (typed-arrow, typed-arrow-dyn, and optionally typed-arrow-unified) as the primary engine for Arrow schema/array handling. Tonbo will keep only its database semantics (primary key, _null/_ts, tombstones, Parquet sorting/pushdown, projection) as a thin layer. This reduces Tonbo macro/code complexity while increasing Arrow compatibility and maintaining behavior and performance.

Motivation

  • Better Arrow compatibility: support Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
  • Reduce Tonbo macro+array/builder code: centralize Arrow typing in typed-arrow; keep Tonbo focused on storage/LSM/WAL/compaction/scan.
  • Improve maintainability and interop (DataFusion/Parquet): use standard Arrow builders/arrays and schema metadata.
  • Avoid migration risk: keep working storage paths and public APIs stable while swapping the Arrow layer incrementally.

Background (Today)

  • Compile-time path: tonbo_macros::#[derive(Record)] generates <Type>Schema, <Type>Ref<'_>, <Type>ImmutableArrays, <Type>Builder>; schema prepends _null, _ts, sets PK indices and Parquet sorting; projection/tombstones supported.
  • Runtime path: DynSchema + DynRecord{,Ref} + dynamic arrays/builder; supports lists (deep), partial coverage for other nested types; tombstones preserve PK.
  • Parquet/DataFusion: rely on Arrow RecordBatch and Parquet sorting/paths (TS then PK).

Goals

  • Maintain Tonbo DB semantics: PK extraction, _null/_ts sentinels, tombstones that still append PK, Parquet sorting/paths, projection.
  • Expand Arrow type coverage via typed-arrow.
  • Keep public API stable where possible (existing derive and traits), with migration shims if needed.
  • Achieve performance parity (or better) for row append and read/projection.

Non-Goals

  • Rewriting LSM, WAL, compaction, or transaction semantics.
  • Changing Tonbo’s high-level user API/ergonomics beyond necessary adapters.

High-Level Design

  1. typed-arrow as the source of truth for schemas/arrays/builders
  • Users write a single annotation #[tonbo::typed::record] on a struct, marking PK fields with #[record(primary_key)].
  • The attribute macro injects #[derive(typed_arrow::Record)] and encodes PK user-indices in schema metadata.
  • No per-type Tonbo builders/arrays are generated anymore for typed records.
  1. Minimal Tonbo DB glue, generic over the typed record
  • TonboTypedSchema<R> implements Tonbo’s record::Schema by composing [_null, _ts] + R::fields() and providing PK paths/sorting and indices. It embeds metadata (tonbo.primary_key_user_indices, tonbo.primary_key_full_indices, tonbo.sorting_columns).
  • TonboTypedBuilders<R> implements ArrowArraysBuilder and delegates user columns to R::Builders, while appending sentinels and preserving PK in tombstones.
  • TonboTypedArrays implements ArrowArrays and exposes a full RecordBatch. get() performs projection while keeping PK columns.
  • The macro still generates only the type glue needed by Tonbo DB: impl Record for T (with type Schema = TonboTypedSchema<T> and a TRef<'_>), plus Encode/Decode for WAL.
  1. Runtime (dynamic) path via typed-arrow-dyn
  • Replace custom dynamic arrays with typed-arrow-dyn under a thin Tonbo adapter that adds sentinels and preserves PK on tombstones. Provide an interim Value/ValueRef translation if needed.
  1. Optional unified facade
  • typed-arrow-unified can offer a single surface across compile-time and runtime records. The Tonbo layer remains focused on DB semantics either way.

Arrow Compatibility

  • Broader Arrow types (compile-time): Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
  • Dynamic runtime: typed-arrow-dyn builds from any arrow_schema::SchemaRef and supports nested Struct/List variants out-of-the-box.
  • Native Arrow builders/arrays: typed-arrow code uses arrow-rs typed builders/arrays directly.
  • Interop: improved integration with DataFusion/Parquet and tooling; schemas/arrays are standard.
  • Tonbo-specific layer remains: _null/_ts, PK sorting/paths, tombstones, and key ordering — implemented as thin adapters.

Detailed Plan & Phases

Phase 1 — Make typed-arrow the default path for typed records (no feature flag)

  • Implement and default to TonboTypedSchema<R>, TonboTypedBuilders<R>, and TonboTypedArrays for any struct annotated with #[tonbo::typed::record].
  • Attribute macro generates only DB glue (Record/Ref + Encode/Decode). It injects PK metadata and #[derive(typed_arrow::Record)] automatically.
  • Existing non-typed derive #[derive(tonbo_macros::Record)] remains supported and unchanged (legacy path stays intact).

Phase 2 — Projection, tombstones, Parquet parity

  • Ensure RecordRef::from_record_batch() on typed batches keeps PK cols and applies projection identically to legacy.
  • Validate tombstones: append _null = true rows while still writing PK column values; enforce sort order _ts(desc), pk(asc).

Phase 3 — Dynamic/runtime path

  • Replace custom dynamic arrays with typed-arrow-dyn + Tonbo adapter; keep DynRecord compatibility until fully retired.

Phase 4 — Cleanup

  • Remove per-type Tonbo builder/arrays codegen for typed records (already unused by the attribute path). Keep legacy derive for users not ready to migrate.

Public API & Compatibility

  • New single-annotation usage for typed records:
    • #[tonbo::typed::record] on the struct, with one or more #[record(primary_key)] fields.
    • The macro injects #[derive(typed_arrow::Record)] and generates Tonbo DB glue (Record/Ref + Encode/Decode).
  • Legacy #[derive(tonbo_macros::Record)] remains supported; it keeps generating per-type builders/arrays and schema as before.
  • Arrow schema shape stays: _null, _ts, then user fields with the same names and nullability.
  • WAL format, transactions, and compaction are unaffected.

Testing & Benchmarks

  • Mirror existing tests for typed records using the new adapters: projection, from_record_batch, tombstones, compaction flows.
  • Criterion benches: compare append throughput/memory and projection cost for typed vs legacy per record.
  • Parquet/Dataset round-trips to verify sorting/paths and DataFusion compatibility.

Risks & Mitigations

  • Semantics drift (PK, tombstones): codify behavior in tests; adapters own sentinels and PK handling.
  • Performance regressions: measure with microbenches; if needed, optimize sentinel writes and PK preservation.
  • Type coverage mismatches: typed-arrow widens coverage; if gaps arise, add focused extensions in typed-arrow rather than Tonbo.

Alternatives

  • Big-bang migration to a new repo: simpler starting skeleton but high integration risk and slower validation. In-place refactor offers incremental value with lower risk.

Rollout & Milestones

  • M1: Adapters + pilot record behind feature; tests green; initial benches.
  • M2: Derive bridge and attribute mapping; 2–3 records ported.
  • M3: Dynamic path via typed-arrow-dyn; end-to-end parity on reads/scans/projection.
  • M4: Parquet validation and DataFusion interop checks.
  • M5: Default on adapters; deprecate legacy code; documentation updates.

Success Metrics

  • Functional parity: existing tests pass; examples unchanged.
  • Performance parity: within ±5% on write/read microbenches.
  • Arrow coverage: nested types and advanced Arrow features available to Tonbo with minimal glue.

Appendix: Key Types (Implemented/Planned)

  • record::typed::compose_arrow_schema_with_sentinels(user_fields) → Arrow SchemaRef with [_null, _ts] + fields.
  • record::typed::compute_pk_paths_and_sorting(user_fields, pk_user_indices) → Parquet PK paths + sorting columns.
  • record::typed::tonbo_schema_from_typed<R: SchemaMeta>() → full schema + PK paths/sorting + full indices.
  • record::typed::TonboTypedBuilders<R: BuildRows + Default> → sentinels + typed delegate; tombstone handling.
  • record::typed::TonboTypedArrays → wraps RecordBatch; ArrowArrays impl.
  • #[tonbo::typed::record] attribute macro → injects typed_arrow::Record derive; encodes PK metadata; generates only Tonbo DB glue.

Example usage (typed record + DB):

#[tonbo::typed::record]
#[derive(Debug, Clone, Default)]
pub struct Person {
    #[record(primary_key)]
    id: i64,
    name: String,
    age: Option<i16>,
}

let options = DbOption::new(Path::from_filesystem_path("./db_path/people")?, &PersonSchema);
let db: DB<Person, TokioExecutor> = DB::new(options, TokioExecutor::default(), PersonSchema).await?;
db.insert(Person { id: 1, name: "Alice".into(), age: Some(30) }).await?;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions