RFC: typed-arrow Integration that Reduces Tonbo Complexity

## Summary

Refactor Tonbo in place to adopt the typed-arrow family (`typed-arrow`, `typed-arrow-dyn`, and optionally `typed-arrow-unified`) as the primary engine for Arrow schema/array handling. Tonbo will keep only its database semantics (primary key, `_null`/`_ts`, tombstones, Parquet sorting/pushdown, projection) as a thin layer. This reduces Tonbo macro/code complexity while increasing Arrow compatibility and maintaining behavior and performance.

## Motivation

- Better Arrow compatibility: support Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
- Reduce Tonbo macro+array/builder code: centralize Arrow typing in typed-arrow; keep Tonbo focused on storage/LSM/WAL/compaction/scan.
- Improve maintainability and interop (DataFusion/Parquet): use standard Arrow builders/arrays and schema metadata.
- Avoid migration risk: keep working storage paths and public APIs stable while swapping the Arrow layer incrementally.

## Background (Today)

- Compile-time path: `tonbo_macros::#[derive(Record)]` generates `<Type>Schema`, `<Type>Ref<'_>`, `<Type>ImmutableArrays`, `<Type>Builder>`; schema prepends `_null`, `_ts`, sets PK indices and Parquet sorting; projection/tombstones supported.
- Runtime path: `DynSchema` + `DynRecord{,Ref}` + dynamic arrays/builder; supports lists (deep), partial coverage for other nested types; tombstones preserve PK.
- Parquet/DataFusion: rely on Arrow `RecordBatch` and Parquet sorting/paths (TS then PK).

## Goals

- Maintain Tonbo DB semantics: PK extraction, `_null`/`_ts` sentinels, tombstones that still append PK, Parquet sorting/paths, projection.
- Expand Arrow type coverage via typed-arrow.
- Keep public API stable where possible (existing derive and traits), with migration shims if needed.
- Achieve performance parity (or better) for row append and read/projection.

## Non-Goals

- Rewriting LSM, WAL, compaction, or transaction semantics.
- Changing Tonbo’s high-level user API/ergonomics beyond necessary adapters.

## High-Level Design

1) typed-arrow as the source of truth for schemas/arrays/builders
- Users write a single annotation `#[tonbo::typed::record]` on a struct, marking PK fields with `#[record(primary_key)]`.
- The attribute macro injects `#[derive(typed_arrow::Record)]` and encodes PK user-indices in schema metadata.
- No per-type Tonbo builders/arrays are generated anymore for typed records.

2) Minimal Tonbo DB glue, generic over the typed record
- `TonboTypedSchema<R>` implements Tonbo’s `record::Schema` by composing `[_null, _ts] + R::fields()` and providing PK paths/sorting and indices. It embeds metadata (`tonbo.primary_key_user_indices`, `tonbo.primary_key_full_indices`, `tonbo.sorting_columns`).
- `TonboTypedBuilders<R>` implements `ArrowArraysBuilder` and delegates user columns to `R::Builders`, while appending sentinels and preserving PK in tombstones.
- `TonboTypedArrays` implements `ArrowArrays` and exposes a full `RecordBatch`. `get()` performs projection while keeping PK columns.
- The macro still generates only the type glue needed by Tonbo DB: `impl Record for T` (with `type Schema = TonboTypedSchema<T>` and a `TRef<'_>`), plus Encode/Decode for WAL.

3) Runtime (dynamic) path via typed-arrow-dyn
- Replace custom dynamic arrays with `typed-arrow-dyn` under a thin Tonbo adapter that adds sentinels and preserves PK on tombstones. Provide an interim `Value/ValueRef` translation if needed.

4) Optional unified facade
- `typed-arrow-unified` can offer a single surface across compile-time and runtime records. The Tonbo layer remains focused on DB semantics either way.

## Arrow Compatibility

- Broader Arrow types (compile-time): Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
- Dynamic runtime: `typed-arrow-dyn` builds from any `arrow_schema::SchemaRef` and supports nested Struct/List variants out-of-the-box.
- Native Arrow builders/arrays: typed-arrow code uses arrow-rs typed builders/arrays directly.
- Interop: improved integration with DataFusion/Parquet and tooling; schemas/arrays are standard.
- Tonbo-specific layer remains: `_null`/`_ts`, PK sorting/paths, tombstones, and key ordering — implemented as thin adapters.

## Detailed Plan & Phases

Phase 1 — Make typed-arrow the default path for typed records (no feature flag)
- Implement and default to `TonboTypedSchema<R>`, `TonboTypedBuilders<R>`, and `TonboTypedArrays` for any struct annotated with `#[tonbo::typed::record]`.
- Attribute macro generates only DB glue (Record/Ref + Encode/Decode). It injects PK metadata and `#[derive(typed_arrow::Record)]` automatically.
- Existing non-typed derive `#[derive(tonbo_macros::Record)]` remains supported and unchanged (legacy path stays intact).

Phase 2 — Projection, tombstones, Parquet parity
- Ensure `RecordRef::from_record_batch()` on typed batches keeps PK cols and applies projection identically to legacy.
- Validate tombstones: append `_null = true` rows while still writing PK column values; enforce sort order `_ts(desc), pk(asc)`.

Phase 3 — Dynamic/runtime path
- Replace custom dynamic arrays with `typed-arrow-dyn` + Tonbo adapter; keep `DynRecord` compatibility until fully retired.

Phase 4 — Cleanup
- Remove per-type Tonbo builder/arrays codegen for typed records (already unused by the attribute path). Keep legacy derive for users not ready to migrate.

## Public API & Compatibility

- New single-annotation usage for typed records:
  - `#[tonbo::typed::record]` on the struct, with one or more `#[record(primary_key)]` fields.
  - The macro injects `#[derive(typed_arrow::Record)]` and generates Tonbo DB glue (Record/Ref + Encode/Decode).
- Legacy `#[derive(tonbo_macros::Record)]` remains supported; it keeps generating per-type builders/arrays and schema as before.
- Arrow schema shape stays: `_null`, `_ts`, then user fields with the same names and nullability.
- WAL format, transactions, and compaction are unaffected.

## Testing & Benchmarks

- Mirror existing tests for typed records using the new adapters: projection, `from_record_batch`, tombstones, compaction flows.
- Criterion benches: compare append throughput/memory and projection cost for typed vs legacy per record.
- Parquet/Dataset round-trips to verify sorting/paths and DataFusion compatibility.

## Risks & Mitigations

- Semantics drift (PK, tombstones): codify behavior in tests; adapters own sentinels and PK handling.
- Performance regressions: measure with microbenches; if needed, optimize sentinel writes and PK preservation.
- Type coverage mismatches: typed-arrow widens coverage; if gaps arise, add focused extensions in typed-arrow rather than Tonbo.

## Alternatives

- Big-bang migration to a new repo: simpler starting skeleton but high integration risk and slower validation. In-place refactor offers incremental value with lower risk.

## Rollout & Milestones

- M1: Adapters + pilot record behind feature; tests green; initial benches.
- M2: Derive bridge and attribute mapping; 2–3 records ported.
- M3: Dynamic path via typed-arrow-dyn; end-to-end parity on reads/scans/projection.
- M4: Parquet validation and DataFusion interop checks.
- M5: Default on adapters; deprecate legacy code; documentation updates.

## Success Metrics

- Functional parity: existing tests pass; examples unchanged.
- Performance parity: within ±5% on write/read microbenches.
- Arrow coverage: nested types and advanced Arrow features available to Tonbo with minimal glue.

## Appendix: Key Types (Implemented/Planned)

- `record::typed::compose_arrow_schema_with_sentinels(user_fields)` → Arrow `SchemaRef` with `[_null, _ts] + fields`.
- `record::typed::compute_pk_paths_and_sorting(user_fields, pk_user_indices)` → Parquet PK paths + sorting columns.
- `record::typed::tonbo_schema_from_typed<R: SchemaMeta>()` → full schema + PK paths/sorting + full indices.
- `record::typed::TonboTypedBuilders<R: BuildRows + Default>` → sentinels + typed delegate; tombstone handling.
- `record::typed::TonboTypedArrays` → wraps `RecordBatch`; `ArrowArrays` impl.
- `#[tonbo::typed::record]` attribute macro → injects `typed_arrow::Record` derive; encodes PK metadata; generates only Tonbo DB glue.

Example usage (typed record + DB):

```rust
#[tonbo::typed::record]
#[derive(Debug, Clone, Default)]
pub struct Person {
    #[record(primary_key)]
    id: i64,
    name: String,
    age: Option<i16>,
}

let options = DbOption::new(Path::from_filesystem_path("./db_path/people")?, &PersonSchema);
let db: DB<Person, TokioExecutor> = DB::new(options, TokioExecutor::default(), PersonSchema).await?;
db.insert(Person { id: 1, name: "Alice".into(), age: Some(30) }).await?;
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: typed-arrow Integration that Reduces Tonbo Complexity #468

Summary

Motivation

Background (Today)

Goals

Non-Goals

High-Level Design

Arrow Compatibility

Detailed Plan & Phases

Public API & Compatibility

Testing & Benchmarks

Risks & Mitigations

Alternatives

Rollout & Milestones

Success Metrics

Appendix: Key Types (Implemented/Planned)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: typed-arrow Integration that Reduces Tonbo Complexity #468

Description

Summary

Motivation

Background (Today)

Goals

Non-Goals

High-Level Design

Arrow Compatibility

Detailed Plan & Phases

Public API & Compatibility

Testing & Benchmarks

Risks & Mitigations

Alternatives

Rollout & Milestones

Success Metrics

Appendix: Key Types (Implemented/Planned)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions