-
Notifications
You must be signed in to change notification settings - Fork 79
Open
Description
Summary
Refactor Tonbo in place to adopt the typed-arrow family (typed-arrow
, typed-arrow-dyn
, and optionally typed-arrow-unified
) as the primary engine for Arrow schema/array handling. Tonbo will keep only its database semantics (primary key, _null
/_ts
, tombstones, Parquet sorting/pushdown, projection) as a thin layer. This reduces Tonbo macro/code complexity while increasing Arrow compatibility and maintaining behavior and performance.
Motivation
- Better Arrow compatibility: support Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
- Reduce Tonbo macro+array/builder code: centralize Arrow typing in typed-arrow; keep Tonbo focused on storage/LSM/WAL/compaction/scan.
- Improve maintainability and interop (DataFusion/Parquet): use standard Arrow builders/arrays and schema metadata.
- Avoid migration risk: keep working storage paths and public APIs stable while swapping the Arrow layer incrementally.
Background (Today)
- Compile-time path:
tonbo_macros::#[derive(Record)]
generates<Type>Schema
,<Type>Ref<'_>
,<Type>ImmutableArrays
,<Type>Builder>
; schema prepends_null
,_ts
, sets PK indices and Parquet sorting; projection/tombstones supported. - Runtime path:
DynSchema
+DynRecord{,Ref}
+ dynamic arrays/builder; supports lists (deep), partial coverage for other nested types; tombstones preserve PK. - Parquet/DataFusion: rely on Arrow
RecordBatch
and Parquet sorting/paths (TS then PK).
Goals
- Maintain Tonbo DB semantics: PK extraction,
_null
/_ts
sentinels, tombstones that still append PK, Parquet sorting/paths, projection. - Expand Arrow type coverage via typed-arrow.
- Keep public API stable where possible (existing derive and traits), with migration shims if needed.
- Achieve performance parity (or better) for row append and read/projection.
Non-Goals
- Rewriting LSM, WAL, compaction, or transaction semantics.
- Changing Tonbo’s high-level user API/ergonomics beyond necessary adapters.
High-Level Design
- typed-arrow as the source of truth for schemas/arrays/builders
- Users write a single annotation
#[tonbo::typed::record]
on a struct, marking PK fields with#[record(primary_key)]
. - The attribute macro injects
#[derive(typed_arrow::Record)]
and encodes PK user-indices in schema metadata. - No per-type Tonbo builders/arrays are generated anymore for typed records.
- Minimal Tonbo DB glue, generic over the typed record
TonboTypedSchema<R>
implements Tonbo’srecord::Schema
by composing[_null, _ts] + R::fields()
and providing PK paths/sorting and indices. It embeds metadata (tonbo.primary_key_user_indices
,tonbo.primary_key_full_indices
,tonbo.sorting_columns
).TonboTypedBuilders<R>
implementsArrowArraysBuilder
and delegates user columns toR::Builders
, while appending sentinels and preserving PK in tombstones.TonboTypedArrays
implementsArrowArrays
and exposes a fullRecordBatch
.get()
performs projection while keeping PK columns.- The macro still generates only the type glue needed by Tonbo DB:
impl Record for T
(withtype Schema = TonboTypedSchema<T>
and aTRef<'_>
), plus Encode/Decode for WAL.
- Runtime (dynamic) path via typed-arrow-dyn
- Replace custom dynamic arrays with
typed-arrow-dyn
under a thin Tonbo adapter that adds sentinels and preserves PK on tombstones. Provide an interimValue/ValueRef
translation if needed.
- Optional unified facade
typed-arrow-unified
can offer a single surface across compile-time and runtime records. The Tonbo layer remains focused on DB semantics either way.
Arrow Compatibility
- Broader Arrow types (compile-time): Struct, List/LargeList/FixedSizeList, Map, Union, Dictionary, timestamps, fixed-size binary.
- Dynamic runtime:
typed-arrow-dyn
builds from anyarrow_schema::SchemaRef
and supports nested Struct/List variants out-of-the-box. - Native Arrow builders/arrays: typed-arrow code uses arrow-rs typed builders/arrays directly.
- Interop: improved integration with DataFusion/Parquet and tooling; schemas/arrays are standard.
- Tonbo-specific layer remains:
_null
/_ts
, PK sorting/paths, tombstones, and key ordering — implemented as thin adapters.
Detailed Plan & Phases
Phase 1 — Make typed-arrow the default path for typed records (no feature flag)
- Implement and default to
TonboTypedSchema<R>
,TonboTypedBuilders<R>
, andTonboTypedArrays
for any struct annotated with#[tonbo::typed::record]
. - Attribute macro generates only DB glue (Record/Ref + Encode/Decode). It injects PK metadata and
#[derive(typed_arrow::Record)]
automatically. - Existing non-typed derive
#[derive(tonbo_macros::Record)]
remains supported and unchanged (legacy path stays intact).
Phase 2 — Projection, tombstones, Parquet parity
- Ensure
RecordRef::from_record_batch()
on typed batches keeps PK cols and applies projection identically to legacy. - Validate tombstones: append
_null = true
rows while still writing PK column values; enforce sort order_ts(desc), pk(asc)
.
Phase 3 — Dynamic/runtime path
- Replace custom dynamic arrays with
typed-arrow-dyn
+ Tonbo adapter; keepDynRecord
compatibility until fully retired.
Phase 4 — Cleanup
- Remove per-type Tonbo builder/arrays codegen for typed records (already unused by the attribute path). Keep legacy derive for users not ready to migrate.
Public API & Compatibility
- New single-annotation usage for typed records:
#[tonbo::typed::record]
on the struct, with one or more#[record(primary_key)]
fields.- The macro injects
#[derive(typed_arrow::Record)]
and generates Tonbo DB glue (Record/Ref + Encode/Decode).
- Legacy
#[derive(tonbo_macros::Record)]
remains supported; it keeps generating per-type builders/arrays and schema as before. - Arrow schema shape stays:
_null
,_ts
, then user fields with the same names and nullability. - WAL format, transactions, and compaction are unaffected.
Testing & Benchmarks
- Mirror existing tests for typed records using the new adapters: projection,
from_record_batch
, tombstones, compaction flows. - Criterion benches: compare append throughput/memory and projection cost for typed vs legacy per record.
- Parquet/Dataset round-trips to verify sorting/paths and DataFusion compatibility.
Risks & Mitigations
- Semantics drift (PK, tombstones): codify behavior in tests; adapters own sentinels and PK handling.
- Performance regressions: measure with microbenches; if needed, optimize sentinel writes and PK preservation.
- Type coverage mismatches: typed-arrow widens coverage; if gaps arise, add focused extensions in typed-arrow rather than Tonbo.
Alternatives
- Big-bang migration to a new repo: simpler starting skeleton but high integration risk and slower validation. In-place refactor offers incremental value with lower risk.
Rollout & Milestones
- M1: Adapters + pilot record behind feature; tests green; initial benches.
- M2: Derive bridge and attribute mapping; 2–3 records ported.
- M3: Dynamic path via typed-arrow-dyn; end-to-end parity on reads/scans/projection.
- M4: Parquet validation and DataFusion interop checks.
- M5: Default on adapters; deprecate legacy code; documentation updates.
Success Metrics
- Functional parity: existing tests pass; examples unchanged.
- Performance parity: within ±5% on write/read microbenches.
- Arrow coverage: nested types and advanced Arrow features available to Tonbo with minimal glue.
Appendix: Key Types (Implemented/Planned)
record::typed::compose_arrow_schema_with_sentinels(user_fields)
→ ArrowSchemaRef
with[_null, _ts] + fields
.record::typed::compute_pk_paths_and_sorting(user_fields, pk_user_indices)
→ Parquet PK paths + sorting columns.record::typed::tonbo_schema_from_typed<R: SchemaMeta>()
→ full schema + PK paths/sorting + full indices.record::typed::TonboTypedBuilders<R: BuildRows + Default>
→ sentinels + typed delegate; tombstone handling.record::typed::TonboTypedArrays
→ wrapsRecordBatch
;ArrowArrays
impl.#[tonbo::typed::record]
attribute macro → injectstyped_arrow::Record
derive; encodes PK metadata; generates only Tonbo DB glue.
Example usage (typed record + DB):
#[tonbo::typed::record]
#[derive(Debug, Clone, Default)]
pub struct Person {
#[record(primary_key)]
id: i64,
name: String,
age: Option<i16>,
}
let options = DbOption::new(Path::from_filesystem_path("./db_path/people")?, &PersonSchema);
let db: DB<Person, TokioExecutor> = DB::new(options, TokioExecutor::default(), PersonSchema).await?;
db.insert(Person { id: 1, name: "Alice".into(), age: Some(30) }).await?;
Metadata
Metadata
Assignees
Labels
No labels