Skip to content

Conversation

r-vasquez
Copy link
Contributor

Backport of #26091

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v25.1.x
  • v24.3.x
  • v24.2.x

Release Notes

Improvements

  • rpk debug bundle: improve reliability of debug bundle collection in k8s environments.

r-vasquez added 6 commits May 21, 2025 10:24
For Mac users.

We won't need this now that we have upgraded to
Bazel 8.

(cherry picked from commit 17cbc78)
We had build tags for the other files already.

We previously missed the build tag on this file.

(cherry picked from commit 1dc1a5e)
rpk debug bundle works on a best-effort basis. It
always tries to return a bundle—even if some steps
(like logs or resource collection) fail.
Sometimes, is because the system is unhealthy
leading to expected errors. Seeing errors doesn't
mean the bundle is useless.

This change aims to make this clearer for the user.

(cherry picked from commit 066732a)
If --namespace is not provided, fallback to $NAMESPACE, then
/var/run/secrets/kubernetes.io/serviceaccount/namespace, and
finally default to "redpanda". This avoids hardcoding and better
supports various deployment environments.

(cherry picked from commit adb9d38)
When collecting a debug bundle from a
k8s environment, we now:

- Log a warning if admin addresses
  cannot be retrieved from the k8s API.
- Use the union of the profile-defined
  admin addresses and those returned by
  the k8s API.
- Fall back to 127.0.0.1 if no addresses
  are available from either source.
- Log the final list of admin addresses
  used.

This improves reliability and visibility
in environments with incomplete or
misconfigured cluster info.

(cherry picked from commit 939e1ba)
We will stop assuming `redpanda` to be the default
container name. Instead we are going to fetch the
logs of all containers and initContainers in
the namespace/pod.

Logs will still be stored under logs/

(cherry picked from commit d66c6b0)
@r-vasquez
Copy link
Contributor Author

Depends on #26217 as we recently bumped Go and we need to bump Tiny Go in v24.2.x branch.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented May 22, 2025

Retry command for Build#66319

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_transfer_controller_leadership
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_unavailable_nodes
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}

@vbotbuildovich
Copy link
Collaborator

CI test results

test results on build#66319
test_class test_method test_arguments test_kind job_url test_status passed reason
CloudArchiveRetentionTest test_delete {"cloud_storage_type": 1, "retention_type": "retention.ms"} ducktape https://buildkite.com/redpanda/redpanda/builds/66319#0196f91c-1c98-4ed8-ac11-a21197c44d85 FLAKY 20/21 upstream reliability is '89.65517241379311'. current run reliability is '95.23809523809523'. drift is -5.58292 and the allowed drift is set to 0. The test should PASS
CloudStorageScrubberTest test_scrubber {"cloud_storage_type": 2} ducktape https://buildkite.com/redpanda/redpanda/builds/66319#0196f963-702b-40b6-8abd-e26fa1fa9e74 FLAKY 20/21 upstream reliability is '100.0'. current run reliability is '95.23809523809523'. drift is 4.7619 and the allowed drift is set to 0. The test should FAIL
PartitionBalancerTest test_transfer_controller_leadership ducktape https://buildkite.com/redpanda/redpanda/builds/66319#0196f963-702a-4ef3-8086-f8a30f8c900e FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 0. The test should FAIL
PartitionBalancerTest test_unavailable_nodes ducktape https://buildkite.com/redpanda/redpanda/builds/66319#0196f963-702b-40b6-8abd-e26fa1fa9e74 FLAKY 16/21 upstream reliability is '100.0'. current run reliability is '76.19047619047619'. drift is 23.80952 and the allowed drift is set to 0. The test should FAIL

@r-vasquez
Copy link
Contributor Author

/ci-repeat 1
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_transfer_controller_leadership
tests/rptest/tests/partition_balancer_test.py::PartitionBalancerTest.test_unavailable_nodes
tests/rptest/tests/cloud_storage_scrubber_test.py::CloudStorageScrubberTest.test_scrubber@{"cloud_storage_type":2}

@r-vasquez r-vasquez enabled auto-merge May 23, 2025 02:53
@r-vasquez r-vasquez merged commit 9623e32 into redpanda-data:v24.2.x May 23, 2025
23 checks passed
@r-vasquez r-vasquez deleted the backport-pr-26091-v24.2.x-158 branch August 21, 2025 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants