Skip to content

Conversation

kkourt
Copy link
Contributor

@kkourt kkourt commented Apr 30, 2025

For every map we load, we find its program, load its spec and then load the map. This introduces unnecessary overhead when loading the maps. This commit, caches the specs and only loads them if we have not seen the program before.

This speeds up loading a sensor, but also tests.

For example, Without the patch, running:
go test -exec sudo ./pkg/sensors/tracing -bpf-lib $(pwd)/bpf/objs -test.run TestKprobeSelectors -count 1

three times, results in:
ok github.com/cilium/tetragon/pkg/sensors/tracing 8.773s
ok github.com/cilium/tetragon/pkg/sensors/tracing 8.703s
ok github.com/cilium/tetragon/pkg/sensors/tracing 8.739s

With the patch, the same command results in:
ok github.com/cilium/tetragon/pkg/sensors/tracing 7.419s
ok github.com/cilium/tetragon/pkg/sensors/tracing 7.532s
ok github.com/cilium/tetragon/pkg/sensors/tracing 7.491s

Which is a ~14% improvement.

@kkourt kkourt added the release-note/misc This PR makes changes that have no direct user impact. label Apr 30, 2025
@kkourt kkourt force-pushed the pr/kkourt/loader-cache-spec branch from 92fefa3 to 7422adb Compare April 30, 2025 14:02
@kkourt kkourt marked this pull request as ready for review April 30, 2025 14:04
@kkourt kkourt requested a review from a team as a code owner April 30, 2025 14:04
@kkourt kkourt requested a review from tixxdz April 30, 2025 14:04
For every map we load, we find its program, load its spec and then load
the map. This introduces unnecessary overhead when loading the maps.
This commit, caches the specs and only loads them if we have not seen
the program before.

This speeds up loading a sensor, but also tests.

For example, Without the patch, running:
  go test -exec sudo  ./pkg/sensors/tracing -bpf-lib $(pwd)/bpf/objs -test.run TestKprobeSelectors -count 1

three times, results in:
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  8.773s
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  8.703s
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  8.739s

With the patch, the same command results in:
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.419s
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.532s
  ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.491s

Which is a ~14% improvement.

Signed-off-by: Kornilios Kourtis <[email protected]>
@kkourt kkourt force-pushed the pr/kkourt/loader-cache-spec branch from 7422adb to e9ca74d Compare April 30, 2025 14:13
Copy link
Contributor

@olsajiri olsajiri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, I wonder we could put loaderCache in Sensor object and factor the code so we could use it from within doLoadProgram .. but that might up with some struggle, so could be a follow up ;-) thanks

@jrfastab
Copy link
Contributor

jrfastab commented May 1, 2025

Might be nice to check memory usage from caching here.

@jrfastab jrfastab merged commit 8ce47ef into main May 1, 2025
4 checks passed
@jrfastab jrfastab deleted the pr/kkourt/loader-cache-spec branch May 1, 2025 15:45
kkourt added a commit that referenced this pull request May 2, 2025
Commit 8ce47ef, introduced a cache for the program specs when
loading maps. The go tests did not run in the corresponding PR
(#3685) and, as a result, we
ended up with the following failure in main for some (rhel8.9, 5.15,
5.10, 5.3) kernels for the TestKproveOverideMulti test:

    observer_test_helper.go:465: SensorManager.AddTracingPolicy error: sensor generic_kprobe from collection sys-openat-signal-override failed to load: failed prog /home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o kern_version 331700 loadInstance: opening collection '/home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o' failed: using replacement map override_tasks: MaxEntries: 32768 changed to 1: map spec is incompatible with existing map

The problem is that the map spec is updated and, because it is now
cached, it is reused across multiple instances of the map. These
instances have different pin paths and may have different
configurations. Specifically, in the TestKproveOverideMulti one instance
of the map is configured with a single entry, while all others are
configured with 32768 entries.

To address this problem, this commits makes a copy of the mapSpec before
modifying it. Compared to 8ce47ef, this leads to worse performance
but still better than without the cache.

Running:
  go test -exec sudo  ./pkg/sensors/tracing -bpf-lib $(pwd)/bpf/objs -test.run TestKprobeSelectors -count 1

This commit:
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.831s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.849s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.781s

Without the cache (21b326c):
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.078s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.074s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.068s

Which is a ~13.5% improvement.

NB: not sure why the "without the cache" numbers are different from the
ones reported in 8ce47ef. Maybe something changed in overhead in
the meantime.

Fixes: 8ce47ef ("sensors: cache spec when loading maps")

Signed-off-by: Kornilios Kourtis <[email protected]>
kkourt added a commit that referenced this pull request May 2, 2025
Commit 8ce47ef, introduced a cache for the program specs when
loading maps. The go tests did not run in the corresponding PR
(#3685) and, as a result, we
ended up with the following failure in main for some (rhel8.9, 5.15,
5.10, 5.3) kernels for the TestKprobeOverrideMulti test:

    observer_test_helper.go:465: SensorManager.AddTracingPolicy error: sensor generic_kprobe from collection sys-openat-signal-override failed to load: failed prog /home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o kern_version 331700 loadInstance: opening collection '/home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o' failed: using replacement map override_tasks: MaxEntries: 32768 changed to 1: map spec is incompatible with existing map

The problem is that the map spec is updated and, because it is now
cached, it is reused across multiple instances of the map. These
instances have different pin paths and may have different
configurations. Specifically, in the TestKproveOverideMulti one instance
of the map is configured with a single entry, while all others are
configured with 32768 entries.

To address this problem, this commits makes a copy of the mapSpec before
modifying it. Compared to 8ce47ef, this leads to worse performance
but still better than without the cache.

Running:
  go test -exec sudo  ./pkg/sensors/tracing -bpf-lib $(pwd)/bpf/objs -test.run TestKprobeSelectors -count 1

This commit:
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.831s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.849s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.781s

Without the cache (21b326c):
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.078s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.074s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.068s

Which is a ~13.5% improvement.

NB: not sure why the "without the cache" numbers are different from the
ones reported in 8ce47ef. Maybe something changed in overhead in
the meantime.

Fixes: 8ce47ef ("sensors: cache spec when loading maps")

Signed-off-by: Kornilios Kourtis <[email protected]>
@kkourt
Copy link
Contributor Author

kkourt commented May 2, 2025

looks good, I wonder we could put loaderCache in Sensor object and factor the code so we could use it from within doLoadProgram .. but that might up with some struggle, so could be a follow up ;-) thanks

Yeah, my initial patch did this, but I thought it was more substantial changes for not much performance benefit. We can revisit though.

@kkourt
Copy link
Contributor Author

kkourt commented May 2, 2025

Might be nice to check memory usage from caching here.

The cache is only maintained when we preload the maps of a sensor. When we are done with them, it goes away, so there shouldn't be a memory issue here.

kkourt added a commit that referenced this pull request May 2, 2025
Commit 8ce47ef, introduced a cache for the program specs when
loading maps. The go tests did not run in the corresponding PR
(#3685) and, as a result, we
ended up with the following failure in main for some (rhel8.9, 5.15,
5.10, 5.3) kernels for the TestKprobeOverrideMulti test:

    observer_test_helper.go:465: SensorManager.AddTracingPolicy error: sensor generic_kprobe from collection sys-openat-signal-override failed to load: failed prog /home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o kern_version 331700 loadInstance: opening collection '/home/kkourt/src/tetragon/bpf/objs/bpf_generic_kprobe_v511.o' failed: using replacement map override_tasks: MaxEntries: 32768 changed to 1: map spec is incompatible with existing map

The problem is that the map spec is updated and, because it is now
cached, it is reused across multiple instances of the map. These
instances have different pin paths and may have different
configurations. Specifically, in the TestKproveOverideMulti one instance
of the map is configured with a single entry, while all others are
configured with 32768 entries.

To address this problem, this commits makes a copy of the mapSpec before
modifying it. Compared to 8ce47ef, this leads to worse performance
but still better than without the cache.

Running:
  go test -exec sudo  ./pkg/sensors/tracing -bpf-lib $(pwd)/bpf/objs -test.run TestKprobeSelectors -count 1

This commit:
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.831s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.849s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  7.781s

Without the cache (21b326c):
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.078s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.074s
ok      github.com/cilium/tetragon/pkg/sensors/tracing  9.068s

Which is a ~13.5% improvement.

NB: not sure why the "without the cache" numbers are different from the
ones reported in 8ce47ef. Maybe something changed in overhead in
the meantime.

Fixes: 8ce47ef ("sensors: cache spec when loading maps")

Signed-off-by: Kornilios Kourtis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/misc This PR makes changes that have no direct user impact.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants