Skip to content

Release Note 3.1.0 #55502

@morrySnow

Description

@morrySnow

Thanks to our devoted developers and supportive community users, the much-expected Apache Doris 3.1.0 is now available!

VARIANT

Sparse Columns and Sub-columns With Vertical Compaction

Traditional OLAP systems often encounter metadata bloat, compaction amplification, and query degradation when dealing with "extremely wide tables/excessive columns" (ranging from thousands to tens of thousands). Doris 3.1 leverages the sparsity of VARIANT sub-columns and sub-column-level Vertical Compaction to increase the manageable column limit to the order of tens of thousands.

Through in-depth optimizations at the storage layer, VARIANT delivers the following benefits to users:

  • Stable support for "thousands to tens of thousands" of sub-columns (columnar storage), with smoother query and compaction latencies.
  • Controllable metadata and indexes, avoiding exponential growth.
  • Proven capability to extract over 10,000 sub-columns (columnar storage) with efficient Compaction performance.

Schema Template

Using Schema Template provides the following benefits when working with the VARIANT data type:

  • Type Stability: Critical sub-paths can have their types fixed in the DDL, preventing query errors, index invalidation, and overhead from implicit conversions caused by type drift.
  • Faster and More Accurate Retrieval: Inverted indexing strategies (tokenized/non-tokenized, parsers, phrase search, etc.) can be customized for different sub-paths, resulting in lower latency and more stable hit rates for common queries.
  • Controllable Indexing and Costs: Moves away from "uniform column-wide index inheritance" (an approach in 2.1 that easily leads to bloat) to "fine-grained configuration by sub-path," significantly reducing the number of indexes, write amplification, and storage costs.
  • Improved Maintainability and Collaboration: Equivalent to adding a "data contract" to JSON, ensuring semantic consistency across teams; type and index states are more observable, making issues easier to diagnose.
  • Evolution-Friendly: Core high-frequency paths can be templated with optional indexing, while long-tail fields retain flexible extensibility, preserving scalability.

Inverted Index

Inverted Index Storage Format V3

Further storage optimizations compared to V2. Index files are smaller, reducing disk usage and I/O overhead. Based on test results from the httplogs and logsbench datasets, storage space can be reduced by up to 20% with V3, making it ideal for large-scale text data and log analytics scenarios.

New Tokenizers

  • ICU(International Components for Unicode) Tokenizer - Internationalized text containing complex writing systems, particularly suitable for multilingual mixed documents
  • IK Tokenizer - Chinese Tokenizer, Advanced algorithm-based Chinese tokenization, combining dictionary and statistical models
  • Basic Tokenizer - Basic tokenization, using character type recognition for segmentation

Custom Tokenizer

The custom tokenization feature is introduced to allow users to customize combinations according to their specific tokenization needs, further improving text retrieval recall. Custom tokenization overcomes the limitations of built-in tokenizers by enabling the combination of character filters, tokenizers, and token filters based on specific requirements, precisely defining how text is segmented into searchable terms, directly determining the relevance of search results and the accuracy of data analysis.

LakeHouse

Asynchronous Materialized Views Fully Support Data Lakes

In version 3.1, asynchronous materialized views fully support partitioned incremental building and partition transparent rewriting for Paimon, Iceberg, and Hudi.

Iceberg

Version 3.1.0 introduces multiple optimizations and enhanced capabilities for the Iceberg table format, closely advancing integration with Iceberg's latest features.

  • Supports full lifecycle management of Branches and Tags
  • Supports querying Iceberg system tables
  • Supports querying Iceberg views
  • Supports modifying Iceberg table schema via ALTER statements

Paimon

Version 3.1.0 introduces multiple feature updates and capability enhancements for the Paimon table format, based on real user scenarios.

  • Supports Paimon Batch Incremental Query
  • Supports reading Branches and Tags
  • Supports querying Paimon system tables

DataLake Query Perfermance

Version 3.1.0 introduces multiple deep optimizations for query performance on data lake table formats, aiming to provide users with more stable and efficient data lake analytics capabilities in real production environments.

  • Dynamic Partition Pruning
  • Batch Shard Execution

Storage

  • Flexible Column Updates
  • Optimizes MOW performance in high-concurrency scenarios in Compute-Storage Decoupled Mode

Query Perfermance

  • Enhanced partition pruning performance and expanded applicability
  • Provides the capability to optimize queries leveraging data characteristics

Behavior Changed

VARIANT

  • variant_max_subcolumns_count constraint. Within the same table, the variant_max_subcolumns_count setting for all Variant columns must be either all 0 or all greater than 0. Mixing these values will result in an error during table creation or schema change.
  • The new VARIANT read/write/serde and Compaction paths are compatible with existing data. However, queries on VARIANT data upgraded from older versions may exhibit format differences (e.g., additional whitespace, or the use of the '.' delimiter causing unintended hierarchical structure creation, resulting in extra levels).
  • When creating an Inverted Index on a VARIANT column, if no fields in the data meet the indexing criteria, an empty index file will still be generated. This is the expected behavior.

Permissions

  • The permission requirement for "SHOW TRANSACTION" has been changed from requiring ADMIN_PRIV to requiring LOAD_PRIV on the corresponding database for imports.
  • The permissions for SHOW FRONTENDS / BACKENDS and the NODE Restful API have been unified. Access to these interfaces now requires SELECT_PRIV on the information_schema database.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions