Skip to content

Conversation

Sachuman
Copy link
Contributor

@Sachuman Sachuman commented Aug 28, 2025

db: SeekPrefixGE lazy positioning

This commit implements the lazy positioning of SeekPrefixGE. During SeekPrefixGE, previously we had to perform the seek in each level.
We loaded the heap with all the seek results we got and returned the next entry, which would be at the top of the heap.
This optimization implements the idea of deferring an actual seek unless needed. We use the Block Interval Property to extract the latest suffix from the table, and create a fake or a synthetic key instead of actually loading the block, we do this if the following requirements are met

  1. The latest suffix we extracted must be less than the seeked suffix.
  2. Perform the synthetic key optimization if the prefix is wholly contained within the sstable's bounds

If the conditions are met we create a synthetic key, mark it as one and return it. We will keep it as a synthetic key unless we rise to the top of the heap, if we are at the top of the heap we must perform the actual seek. The following are the realized benefits of this optimization.

  1. Suppose we made three synthetic keys, and two of them never reached the top of the heap, we potentially saved two block reads.
  2. During these saves, we also saved bloom filter reads.

Here are some disadvantages of this optimization

  1. We will have to disable the trySeekUsingNext optimization during SeekPrefixGE calls if the key earlier we had was a synthetic one, since we never actually seeked we don't want this optimization to be used. The problem is that we also lose the opportunity to use this even when applicable.
  2. There can be times where we create synthetic keys all over the place and we end up actually seeking each one of them, in which case this optimization is basically useless.

Implements #2002

db: add probe tests to TestIterHistories

This commit adds probe tests to iterator_histories, also adds unit tests for the new optimization.

@Sachuman Sachuman requested a review from a team as a code owner August 28, 2025 22:38
@Sachuman Sachuman requested a review from xinhaoz August 28, 2025 22:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@Sachuman Sachuman marked this pull request as draft August 28, 2025 22:38
@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch from 7882e7b to e660b00 Compare September 2, 2025 17:17
This commit implements the lazy positioning of SeekPrefixGE. During SeekPrefixGE, previously we had to perform the seek in each level.
We loaded the heap with all the seek results we got and returned the next entry, which would be at the top of the heap.
This optimization implements the idea of deferring an actual seek unless needed. We use the Block Interval Property to extract the
latest suffix from the table, and create a fake or a synthetic key instead of actually loading the block, we do this if the following requirements are met

1. The latest suffix we extracted must be less than the seeked suffix.
2. Perform the synthetic key optimization if the prefix is wholly contained within the sstable's bounds

If the conditions are met we create a synthetic key, mark it as one and return it. We will keep it as a synthetic key unless we rise to the top of the heap,
if we are at the top of the heap we must perform the actual seek. The following are the realized benefits of this optimization.

1. Suppose we made three synthetic keys, and two of them never reached the top of the heap, we potentially saved two block reads.
2. During these saves, we also saved bloom filter reads.

Here are some disadvantages of this optimization

1. We will have to disable the trySeekUsingNext optimization during SeekPrefixGE calls if the key earlier we had was a synthetic one,
since we never actually seeked we don't want this optimization to be used. The problem is that we also lose the opportunity to use this even when applicable.
2. There can be times where we create synthetic keys all over the place and we end up actually seeking each one of them, in which case this
optimization is basically useless.

Implements cockroachdb#2002
@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch from e660b00 to 7a62afd Compare September 2, 2025 17:54
@Sachuman Sachuman marked this pull request as ready for review September 2, 2025 17:56
@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch from 7a62afd to b3e4587 Compare September 2, 2025 18:27
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good!

@jbowens reviewed 3 of 8 files at r1.
Reviewable status: 1 of 11 files reviewed, 8 unresolved discussions (waiting on @xinhaoz)


sstable/block_property_test_utils.go line 149 at r2 (raw file):

func (testprop MaxTestKeysSuffixProperty) Extract(
	encodedProperty []byte,

I think we should extend this func (in the interface) to take a dst []byte for the first parameter. Then down below we'll append the @ byte and AppendUint the value to dst, and return it. This allows the caller to reuse a buffer, reducing allocations


sstable/block_property_test_utils.go line 161 at r2 (raw file):

		return nil, false, nil
	}
	ret := strconv.AppendUint([]byte("@"), (interval.Upper - 1), 10)

What motivated the - 1? Just trying to get a tighter bound since Upper is exclusive, or was there a correctness issue you ran into?


file_cache.go line 735 at r1 (raw file):

				References:   blobReferences,
			},
			MaximumSuffixProperty: sstable.MaxTestKeysSuffixProperty{},

this should be opts.MaximumSuffixProperty


sstable/reader_iter_two_lvl.go line 861 at r2 (raw file):

func (i *twoLevelIterator[I, PI, D, PD]) Next() *base.InternalKV {

	if i.secondLevel.synthetic.atSyntheticKey {

I'm confused by this; why can't this usenseekPrefixGE?


sstable/reader.go line 111 at r2 (raw file):

type MaximumSuffixProperty interface {
	Name() string
	Extract(encodedProperty []byte) (suffix []byte, ok bool, err error)

let's document MaximumSuffixProperty and Extract, and include in Extract's comment that the implementation must not mutate encodedProperty


sstable/reader.go line 114 at r2 (raw file):

}

type SyntheticKey struct {

nit: this doesn't need to be exported (capitalized), right? I don't think we're accessing it from outside of this package.


internal/base/internal.go line 99 at r2 (raw file):

	//InternalKeyKindNoop                     InternalKeyKind = 13
	//InternalKeyKindColumnFamilyRangeDelete  InternalKeyKind = 14
	InternalKeyKindSyntheticKey InternalKeyKind = 14

what was the motivation for using InternalKeyKind=14?


sstable/reader_iter_single_lvl.go line 837 at r2 (raw file):

					}
				} else {
					maxSuffix, ok, err = i.maximumSuffixProperty.Extract([]byte(prop))

let's do an "unsafe" byte-slice conversion here:

maxSuffix, ok, err = i.maximumSuffixProperty.Extract(unsafe.Slice(unsafe.StringData(prop), len(prop)))

This will prevent an allocation. And it's safe because Extract does not mutate property.

@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch from b3e4587 to d6cf16c Compare September 2, 2025 22:12
Copy link
Contributor Author

@Sachuman Sachuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 13 files reviewed, 7 unresolved discussions (waiting on @12, @13, @jbowens, and @xinhaoz)


file_cache.go line 735 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

this should be opts.MaximumSuffixProperty

Done.


internal/base/internal.go line 99 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

what was the motivation for using InternalKeyKind=14?

The reason was that TestInvalidInternalKey FAILED since x19 && 0xff returned 25. I fixed it by changing the test case to x20

Code snippet (i):

// Kind returns the key kind component of the trailer.
func (t InternalKeyTrailer) Kind() InternalKeyKind {
	return InternalKeyKind(t & 0xff)
}

Code snippet (ii):

		"foo\x08\x07\x06\x05\x04\x03\x02",
		"foo\x20\x07\x06\x05\x04\x03\x02\x01",

sstable/block_property_test_utils.go line 149 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think we should extend this func (in the interface) to take a dst []byte for the first parameter. Then down below we'll append the @ byte and AppendUint the value to dst, and return it. This allows the caller to reuse a buffer, reducing allocations

Good point. Done.


sstable/block_property_test_utils.go line 161 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

What motivated the - 1? Just trying to get a tighter bound since Upper is exclusive, or was there a correctness issue you ran into?

Its trying to get a tighter bound, I am imaging a case where we seek @13 and the latest suffix is @12, if we did not
do -1 there we would not perform the optimization even though we should ?


sstable/reader.go line 111 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

let's document MaximumSuffixProperty and Extract, and include in Extract's comment that the implementation must not mutate encodedProperty

Done.


sstable/reader.go line 114 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: this doesn't need to be exported (capitalized), right? I don't think we're accessing it from outside of this package.

Done.


sstable/reader_iter_single_lvl.go line 837 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

let's do an "unsafe" byte-slice conversion here:

maxSuffix, ok, err = i.maximumSuffixProperty.Extract(unsafe.Slice(unsafe.StringData(prop), len(prop)))

This will prevent an allocation. And it's safe because Extract does not mutate property.

Good point. Done.


sstable/reader_iter_two_lvl.go line 861 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm confused by this; why can't this usenseekPrefixGE?

During the two level SeekPrefixGE, we never loaded its index block. So during next we must load the index block before calling single level iterator.
If you were talking about replacing secondLevel.Next with seekPrefixGE. We can certainly do that and get some results.

Code snippet:

		if !dontSeekWithinSingleLevelIter {
			kv := i.secondLevel.Next()
			return kv
		}

@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch 2 times, most recently from 54babe7 to 992d055 Compare September 3, 2025 15:07
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbowens reviewed 3 of 12 files at r7.
Reviewable status: 4 of 13 files reviewed, 9 unresolved discussions (waiting on @12, @13, and @xinhaoz)


sstable/block_property_test_utils.go line 161 at r2 (raw file):

Previously, Sachuman (Sachin) wrote…

Its trying to get a tighter bound, I am imaging a case where we seek @13 and the latest suffix is @12, if we did not
do -1 there we would not perform the optimization even though we should ?

Makes sense


sstable/reader_iter_single_lvl.go line 801 at r7 (raw file):

	if i.synthetic.atSyntheticKey {
		// TODO : We have to disable the optimization to avoid false data invalidation if there back to back SeekPrefixGE calls
		// so currently we cant take advantage of trySeekUsingNext in case of synthetic reseeks.

nit: wrap the comment lines at ~80-100 chars, and format the TODO prefix as TODO(sachin): ...


sstable/reader_iter_single_lvl.go line 819 at r7 (raw file):

	// a suffix (len(key) > len(prefix)), we might be able to defer actually
	// performing the seek and potentially loading additional blocks.
	// However, perform the synthetic key optimization if the prefix is wholly contained within the sstable's bounds

nit: I think this should say something like "However, we can only perform ..."

(also wrap at 80-100 chars)


sstable/reader_iter_single_lvl.go line 822 at r7 (raw file):

	// We do so because there might be cases which result in generating a synthetic key which was not supposed to be generated
	// But due to the Bounds we might end up creating the synthetic key, which will result in a seek failure. A case that can
	// be considered is when we have a RangeDel key at the upper bound of the sstable, which possibly can result in fake synthetic key.

nit: how about instead, something like:

"If the sought prefix is NOT wholly contained within the sstable, it might be split across multiple sstables. If the first sstable with bounds that overlap the prefix does not actually contain a point key with the prefix, the sstable's properties' max suffix doesn't reflect the max of the prefix."

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 13 files reviewed, 16 unresolved discussions (waiting on @12, @13, @Sachuman, and @xinhaoz)


sstable/reader_iter_two_lvl.go line 861 at r2 (raw file):

Previously, Sachuman (Sachin) wrote…

During the two level SeekPrefixGE, we never loaded its index block. So during next we must load the index block before calling single level iterator.
If you were talking about replacing secondLevel.Next with seekPrefixGE. We can certainly do that and get some results.

Ah I think I was confusing the code structure of twoLevelIterator and singleLevelIterator.

Can we extract the body of the seek prefix GE and share the code between twoLevelIterator.SeekPrefixGE and twoLevelIterator.Next?


sstable/reader_iter_single_lvl.go line 1299 at r7 (raw file):

func (i *singleLevelIterator[I, PI, D, PD]) Next() *base.InternalKV {

	if i.synthetic.atSyntheticKey {

nit: let's put a comment above this if statement explaining the optimization and that SeekPrefixGE may have returned a synthetic key with the "smallest" (smallest according to base.Compare) suffix contained in the sstable, avoiding performing the actual seek. If the caller is calling Next, they want to move past the synthetic key and Next is responsible for performing the actual seek.


internal/itertest/testdata/probes line 26 at r7 (raw file):

# Create probes to exercise the UserKey predicate. On Next/Prev return a KV
# with user key "bar"; on other ops return a KV with user key "foo". Then if
# the returned KV's user key is "bar", override the result with a specific KV.

I'm a bit confused as to what this test case is seeking to test?


iterator_histories_test.go line 358 at r7 (raw file):

						o.UseL6Filters = true
					case "probe-points":

nit: remove empty line


iterator_histories_test.go line 367 at r7 (raw file):

						pointProbes[base.TableNum(i)] = itertest.MustParseProbes(parser, arg.Vals[1:]...)
					case "probe-rangedels":

nit: remove empty line


sstable/reader_iter_two_lvl.go line 429 at r7 (raw file):

	// We do so because there might be cases which result in generating a synthetic key which was not supposed to be generated
	// But due to the Bounds we might end up creating the synthetic key, which will result in a seek failure. A case that can
	// be considered is when we have a RangeDel key at the upper bound of the sstable, which possibly can result in fake synthetic key.

nit: ditto the comment on the singleLevelIterator


sstable/reader_iter_two_lvl.go line 436 at r7 (raw file):

		largest = i.secondLevel.reader.Comparer.Split.Prefix(largest)

		if i.secondLevel.cmp(prefix, smallest) > 0 && i.secondLevel.cmp(prefix, largest) < 0 {

nit: let's move checking this condition to be the innermost. the vast majority of the time this condition will be met, whereas the other conditions (eg, the seek key's suffix being < the max suffix) are less likely to be met.


testdata/iter_histories/probe_prefix line 178 at r7 (raw file):

----

combined-iter probe-points=(4,(Log "OpSeekPrefixGE")) probe-rangedels=(4,(Log "OpSeekPrefixGE"))

these "OpSeekPrefixGE" prefixes are a bit confusing. how about

combined-iter probe-points=(4,(Log "000004.")) probe-rangedels=(4,(Log "000004.RangeDels."))

(and analogous changes throughout the test file)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 4 of 13 files reviewed, 18 unresolved discussions (waiting on @12, @13, @Sachuman, and @xinhaoz)


sstable/reader_iter_two_lvl.go line 448 at r7 (raw file):

					}
				} else {
					maxSuffix, ok, err = i.secondLevel.maximumSuffixProperty.Extract(maxSuffix, unsafe.Slice(unsafe.StringData(prop), len(prop)))

in order to use the dst parameter to reduce allocations, we need to retain a buffer between calls to SeekPrefixGE. We can add a maxSuffixBuf []byte to i.synthetic.

Then we would do something like:

maxSuffix, ok, err = i.secondLevel.maximumSuffixProperty.Extract(i.synthetic.maxSuffixBuf[:0], unsafe.Slice(unsafe.StringData(prop), len(prop)))
if err != nil {
	i.secondLevel.err = err
	return nil
} else if ok {
	i.synthetic.maxSuffixBuf = maxSuffix
}

sstable/reader_iter_two_lvl.go line 471 at r7 (raw file):

				// to still be open by the time singleLevelIterator.Next is called
				// and we use the seek key to actually perform the seek.
				i.secondLevel.synthetic.seekKey = append(i.secondLevel.synthetic.seekKey[:0], key...)

We should retain seekKey and the new maxSuffix field on i.synthetic across Closes. To do that, I think we should move i.synthetic below i.clearForResetBoundary in the struct definition (so i.synthetic no longer has its memory zeroed automatically in resetForReuse), and then explicitly update resetForReuse to do:

clear(synthetic.seekKey[:cap(synthetic.seekKey)])
clear(synthetic.maxSuffix[:cap(synthetic.maxSuffix)])
synthetic.seekKey = synthetic.seekKey[:0]
synthetic.maxSuffix = synthetic.maxSuffix[:0]

and explicitly clear out the other fields of i.synthetic

@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch 2 times, most recently from 2469966 to b6f987a Compare September 4, 2025 20:31
Copy link
Contributor Author

@Sachuman Sachuman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 13 files reviewed, 11 unresolved discussions (waiting on @12, @13, @jbowens, and @xinhaoz)


sstable/reader_iter_single_lvl.go line 801 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: wrap the comment lines at ~80-100 chars, and format the TODO prefix as TODO(sachin): ...

Done.


sstable/reader_iter_single_lvl.go line 819 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: I think this should say something like "However, we can only perform ..."

(also wrap at 80-100 chars)

Done.


sstable/reader_iter_single_lvl.go line 1299 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: let's put a comment above this if statement explaining the optimization and that SeekPrefixGE may have returned a synthetic key with the "smallest" (smallest according to base.Compare) suffix contained in the sstable, avoiding performing the actual seek. If the caller is calling Next, they want to move past the synthetic key and Next is responsible for performing the actual seek.

Done.


sstable/reader_iter_two_lvl.go line 861 at r2 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Ah I think I was confusing the code structure of twoLevelIterator and singleLevelIterator.

Can we extract the body of the seek prefix GE and share the code between twoLevelIterator.SeekPrefixGE and twoLevelIterator.Next?

Done.


sstable/reader_iter_two_lvl.go line 429 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: ditto the comment on the singleLevelIterator

Done.


sstable/reader_iter_two_lvl.go line 436 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: let's move checking this condition to be the innermost. the vast majority of the time this condition will be met, whereas the other conditions (eg, the seek key's suffix being < the max suffix) are less likely to be met.

Done.


sstable/reader_iter_two_lvl.go line 448 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

in order to use the dst parameter to reduce allocations, we need to retain a buffer between calls to SeekPrefixGE. We can add a maxSuffixBuf []byte to i.synthetic.

Then we would do something like:

maxSuffix, ok, err = i.secondLevel.maximumSuffixProperty.Extract(i.synthetic.maxSuffixBuf[:0], unsafe.Slice(unsafe.StringData(prop), len(prop)))
if err != nil {
	i.secondLevel.err = err
	return nil
} else if ok {
	i.synthetic.maxSuffixBuf = maxSuffix
}

I completely missed the point of reducing allocations with buffer. Thanks for this, made changes as suggested.


testdata/iter_histories/probe_prefix line 178 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

these "OpSeekPrefixGE" prefixes are a bit confusing. how about

combined-iter probe-points=(4,(Log "000004.")) probe-rangedels=(4,(Log "000004.RangeDels."))

(and analogous changes throughout the test file)

Agreed, Done.


internal/itertest/testdata/probes line 26 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm a bit confused as to what this test case is seeking to test?

I added a test case for testing dsl function 'UserKey' since there wasn't one I believe. Made some change to the test case to make it more clear.


iterator_histories_test.go line 358 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: remove empty line

Done.


iterator_histories_test.go line 367 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: remove empty line

Done.

This commit adds probe tests to iterator_histories, also adds unit tests
for the new optimization.
@Sachuman Sachuman force-pushed the optimize-MVCCGets-synthetic-keys branch from b6f987a to 1f2d281 Compare September 4, 2025 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants