Skip to content

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Jun 20, 2025

It may sometimes be required to enabled batch cache for __consumer_offsets topic. Added a cluster property that will allow us to change the cache settings.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • ability to control batch cache settings for __consumer_offsets topic

@mmaslankaprv mmaslankaprv requested a review from a team as a code owner June 20, 2025 08:18
@mmaslankaprv mmaslankaprv requested review from dotnwat, bharathv, bashtanov, ztlpn and joe-redpanda and removed request for a team June 20, 2025 08:49
@mmaslankaprv mmaslankaprv force-pushed the co-batch-cache-option branch from 90a4884 to 283213e Compare June 20, 2025 15:49
bashtanov
bashtanov previously approved these changes Jun 20, 2025
joe-redpanda
joe-redpanda previously approved these changes Jun 20, 2025
"topic. By default cache for consumer offsets topic is disabled. "
"Changing this property is not recommended in production systems as it "
"may affect performance. The change is applied only after the restart",
{.needs_restart = needs_restart::yes, .visibility = visibility::tunable},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a blocker, just curious why this would need a restart (probably a disruptive action in certain clusters)? It seems like it can be made to pickup for new segments without a restart.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also recommend considering making this needs_restart::no, because to Bharath's point, cloud operators will automatically detect when a needs_restart::yes property is flipped and trigger a rolling restart of a cluster. If the intent here is quick debugability, needs_restart::yes will probably cause headaches.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not so easy, the log property is set when a partition replica is created on the node. We do not change that property dynamically. I wanted to make sure the cluster view is consistent. Therefore i made it to require a restart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense. i also didn't expect it to be nuanced like this

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#67663
test_class test_method test_arguments test_kind job_url test_status passed reason
cluster_metadata_uploader_fixture test_upload_in_term unit https://buildkite.com/redpanda/redpanda/builds/67663#01978e09-6f43-480e-8d4c-791cd67fe1a4 FAIL 0/1
DatalakeDelayedEnablementTest test_enabling_iceberg_in_existing_cluster {"catalog_type": "rest_jdbc", "cloud_storage_type": 1} ducktape https://buildkite.com/redpanda/redpanda/builds/67663#01978e3b-25cb-41f7-8d44-57bf7f53e4bc FLAKY 15/21 upstream reliability is '75.75757575757575'. current run reliability is '71.42857142857143'. drift is 4.329 and the allowed drift is set to 50. The test should PASS
PartitionBalancerTest test_fuzz_admin_ops ducktape https://buildkite.com/redpanda/redpanda/builds/67663#01978e3b-25cd-426b-9ada-a604d4cd8291 FLAKY 18/21 upstream reliability is '98.85057471264368'. current run reliability is '85.71428571428571'. drift is 13.13629 and the allowed drift is set to 50. The test should PASS
RandomNodeOperationsTest test_node_operations {"cloud_storage_type": 1, "compaction_mode": "sliding_window", "enable_failures": true, "mixed_versions": false, "with_iceberg": false} ducktape https://buildkite.com/redpanda/redpanda/builds/67663#01978e3b-25d0-4971-ab85-b707208c930b FLAKY 19/21 upstream reliability is '97.97297297297297'. current run reliability is '90.47619047619048'. drift is 7.49678 and the allowed drift is set to 50. The test should PASS

…sets

It may sometimes be beneficial for troubleshooting to enable
`__consumer_offsets` topic batch cache. The cache is disabled for this
topic as it is read only once when applying batches to the stm. The
batches written to the topic are usually very small. Skipping storing
them in the cache relieve the cache index pressure.

Signed-off-by: Michał Maślanka <[email protected]>
Added code handing a configuration property that controls the presence
of the cache for the `__consumer_offsets` topic

Signed-off-by: Michał Maślanka <[email protected]>
Added a test case that validates if `__consumer_offsets` cache is
enabled when cluster property changes.

Signed-off-by: Michał Maślanka <[email protected]>
@mmaslankaprv mmaslankaprv merged commit 543ce49 into redpanda-data:dev Jun 24, 2025
18 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v25.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-26522-v24.3.x-137 remotes/upstream/v24.3.x
git cherry-pick -x d402ebdddf e3995b1260 094ddcb295

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants