Skip to content

Conversation

NoahStapp
Copy link
Contributor

Please complete the following before merging:

  • Update changelog.
  • Test changes in at least one language driver.
  • Test these changes against all server versions and topologies (including standalone, replica set, and sharded
    clusters).

Python Django implementation: mongodb/django-mongodb-backend#366.

@@ -0,0 +1 @@
{"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format


### Benchmark Server

The MongoDB ODM Performance Benchmark must be run against a standalone MongoDB server running the latest stable database
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can open up this to be a standalone or a replica set with a size of 1. (This is because some ODMs leverage transactions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a replica set of size 1 makes more sense here, agreed.


### Benchmark placement and scheduling

The MongoDB ODM Performance Benchmark should be placed within the ODM's test directory as an independent test suite. Due
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think we should leave an option for folks to create their own benchmarking repo if that helps out. I'm open to others take on this one seeing as I worry about maintainers not wanting a benchmark repo.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't agree that they should be in the tests directory but haven't ruled out including them in the ODM. For the purposes of getting the spec done, I wonder if requiring the ODM to document the location of the test suite is enough. If not, I would definitely remove the "test directory" requirement and make it "should be placed within the ODM". I think that is enough to make it clear that the goal is to have the perf tests included in the ODM.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a separate benchmark repo is a good choice here. We could reach out to existing maintainers and see if they want to weigh in, but I imagine having a separate repo for benchmarking is more trouble than it's worth.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that ODMs might not be receptive to a large addition of performance tests to their repository. The ticket makes it sound like we (DBX) planned to run these tests ourselves, probably in a CI we maintain:

The testing criteria would be documented in a human readable form (such as either a docs page or a markdown file), and once benchmarks have been developed we would run these against each new notable release of the client library. Providing well documented benchmarks will hopefully also encourage the developer community to contribute additional tests to further improve coverage.

And from the comments: https://jira.mongodb.org/browse/DRIVERS-2917#:~:text=Shane%20Harvey%20FYI%20that%20on%20https%3A//www.mongodb.com/services/support/mongoose%2Dodm%2Dsupport%20we%20specifically%20call%20out%20that%20%22MongoDB%E2%80%99s%20team%20of%20experts%20rigorously%20tests%20new%20Mongoose%20releases%20to%20ensure%20they%20are%20compatible%20with%20MongoDB%20and%20meet%20appropriate%20performance%20benchmarks.%22.

I don't see any mention of where these tests will live in the scope, either.

Why do we plan on contributing spec tests to ODM repos, instead of creating a pipeline similar to the ai testing pipeline? Or just integrating perf testing within drivers' existing performance testing? We already have the test runners and infrastructure to run these ourselves. And to @ajcvickers 's point, we already have dedicated performance test hosts in evergreen that are stable and isolated from other hosts in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe there was any concrete plan one way or the other at the time the ticket and scope were created.

In my view, there are a few fundamental differences between the libraries being tested here versus for AI integrations.

  • Many ODMs are or are planned to be first-party libraries rather than contributions to third-party AI frameworks.
  • The AI space moves extremely rapidly and broken CI/CD or testing suites are extremely common. Both factors were significant motivators in the creation of our AI testing pipeline. Those motivations don't seem to exist here.
  • AI frameworks tend to have several to dozens of integrations all housed within a single repo, each with their own dependencies and tests. Third-party ODMs are more often standalone repos with far less complexity in this manner, so adding a single additional test file for performance testing is much less significant.

What would integrating perf testing within the existing drivers perf testing look like? Would all of the ODM benchmarks live in a separate repo, with each driver cloning and using the specific subdirectory that contains the ODMs they want to test?

Using the same skeleton of test runners and infrastructure for the ODM testing makes it very easy to get these tests up and running without polluting the existing drivers tests.

Copy link
Contributor Author

@NoahStapp NoahStapp Sep 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the scope doc, which covers the motivations of this work: https://docs.google.com/document/d/1GCle2vTQLdoSaDJJXyXeXYqtcAfymr8pM5oyV4gSI4A/edit?tab=t.0#heading=h.b1os3ai9s8t3.

Integrating testing into ODM processes is preferable for both visibility and maintenance reasons. Users will likely be more comfortable using a library with very public and integrated performance tests, and having all testing for an ODM live within a single repo streamlines maintenance work. Having the performance tests be integrated also shows a higher level of accountability and transparency, especially if we end up adding performance tests that directly compare against Postgres or other SQL databases.

That said, I agree that maintainers refusing to allow us adding the perf test suite to a third-party repo puts us in a difficult spot. One option would be a split approach: first-party ODMs have performance tests within their own repos, third-party ODMs have theirs in an odm-testing-pipeline repo explicitly for that purpose. Then if maintainers tell us that they'd actually prefer to have the performance tests inside the ODM repo directly, we can migrate that suite out of the odm-testing-pipeline repo.

Do any third-party ODMs already have robust performance testing that we would be competing with? What are the most common reasons we've gotten for pushback against similar work being contributed in the past?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't intend to leverage existing in-house performance tooling, how are we intending to track any actual performance regressions? (To address the scope goal: "Identify performance regressions and provide an ongoing, automated mechanism for doing so.") Likewise, how will we track apples to oranges or apples to apples comparison goals? Internally, we have, e.g., dedicated distros (like Bailey mentioned), perf release dashboards that can easily pull the relevant data into one place, etc. It's fine to have tests living in the ODMs own repos so that there is transparency and so that they can be used to inform local development, but I don't see how we can realistically meet the scope goals I mentioned if we don't run them in our own pipeline.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With regard to including perf tests in externally owned ODM repos, I think it's good form to let community know what we're doing, and to have them be involved and even host if they want to--this gives us visibility in that community which is great. That being said, I would be very surprised if any community wants to do this. Perf testing is controversial, especially when people are designing and implementing tests for the perf of their own code. So, realistically, I think this all going to end up internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ODMs where we own the repo, it's easy to run the performance suite on our existing test infrastructure. For external ODMs, unless the maintainers are fine with us adding config to run evergreen tasks directly into the repo, we'll have to set up a separate ODM testing pipeline repo similar to the AI/ML one that already exists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it's worth adjusting the wording here accordingly, then

to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every
PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating
this suite of tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per your suggestion earlier, we should include some new information about testing mainline usecases.

Comment on lines +379 to +382
As discussed earlier in this document, ODM feature sets vary significantly across libraries. Many ODMs have features
unique to them or their niche in the wider ecosystem, which makes specifying concrete benchmark test cases for every
possible API unfeasible. Instead, ODM authors should determine what mainline use cases of their library are not covered
by the benchmarks specified above and expand this testing suite with additional benchmarks to cover those areas.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is attempting to specify that ODMs should implement additional benchmark tests to cover mainline use cases that do not fall into those included in this specification. One example would be the use of Django's in filter operator: Model.objects.filter(field__in=["some_val"]).

@NoahStapp NoahStapp marked this pull request as ready for review August 21, 2025 21:22
@NoahStapp NoahStapp requested a review from a team as a code owner August 21, 2025 21:22
@NoahStapp NoahStapp requested review from JamesKovacs, alexbevi, aclark4life, ajcvickers, rozza, damieng and R-shubham and removed request for a team August 21, 2025 21:22
@rozza rozza removed their request for review August 26, 2025 08:55
### Benchmark Server

The MongoDB ODM Performance Benchmark must be run against a MongoDB replica set of size 1 running the latest stable
database version without authentication or SSL enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we concerned at all about accounting for performance variation due to server performance differences? In the drivers, we keep the server version patch-pinned and upgrade rarely and intentionally via independent commits in order to ensure that our performance testing results are meaningful and are only reflective of the changes in the system under test (the driver, or, in this case, the ODM). If the goal is only to track the performance of ODMs relative to each other and relative to the corresponding drivers, is the intention to have the drivers also implement these tests against the latest server so that we could get that apples-to-apples comparison?

Copy link

@aclark4life aclark4life Aug 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we concerned at all about accounting for performance variation due to server performance differences?

From the Django implementation:

This is NOT intended to be a comprehensive test suite for every operation, only the most common and widely applicable

@NoahStapp and @Jibola are working on this project for DBX Python (although I am reviewing the implementation PR), so this is just a drive by comment from me, but my impression is that the spec is at least initially focused on getting all the ODMs to agree on what to test.

In the drivers, we keep the server version patch-pinned and upgrade rarely and intentionally via independent commits in order to ensure that our performance testing results are meaningful and are only reflective of the changes in the system under test (the driver, or, in this case, the ODM). If the goal is only to track the performance of ODMs relative to each other and relative to the corresponding drivers, is the intention to have the drivers also implement these tests against the latest server so that we could get that apples-to-apples comparison?

One more drive by comment: I'd expect each ODM to "perform well" under similar server circumstances (testing the driver is a good call out!) but I'm not sure apples-to-apples is the goal. If other ODMs test their performance using the spec and can demonstrate "good performance" and/or catch performance issues they would otherwise have missed, that would indicate some measure of success to me in the spec design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose latest stable server version here for the following reason: we've made server performance an explicit company-wide goal. When users experience performance issues on older server versions, one of the first things we recommend is that they upgrade to a newer version. At least in the Python driver, we only run performance tests against 8.0. Using the latest stable version ensures that our performance tests always take advantage of any server improvements and isolate performance issues in the ODM or underlying driver.

Implementing these same tests in the driver for a direct apples-to-apples comparison is a significant amount of work. Several of the tests here use similar datasets as the driver tests for easier comparison, so using the same version of the server as the driver tests to reduce differences could be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the latest stable version ensures that our performance tests always take advantage of any server improvements and isolate performance issues in the ODM or underlying driver.

I think we should be careful about our goals here: if it is to take advantage of any server improvements and track performance explicitly relative to the most current server performance, then this approach is fine. However, this approach will not isolate performance issues in the ODM or driver because: 1) server performance is not guaranteed to always improve in every release for every feature: the overall trends of the server performance for most features will hopefully keep moving up, but between releases there may be "acceptable" regressions to certain features that are considered a tradeoff to an improvement in another area, and 2) server performance improvements could mask ODM regressions that happen concurrently with the server upgrade. We should be explicit about accepting both of these risks if we are going to move forward with this approach (i.e., note this somewhere in the spec text).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callouts. What if we test the benchmarks against both the latest stable version as well as the latest major release? Currently that would be 8.1 and 8.0, for example. That would give us a yearly cadence of upgrading that should allow us to catch server regressions without blindly masking ODM regressions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the stability. If we see a perf regression (or improvement), we then have to consider whether we actually made things worse (or better) or if we happened to run on a newer server version that had different perf characteristics. We have correctness tests against different server versions. I don't think there is value in testing the server's performance in our ODM tests. Thus I would suggest we choose 8.0.13 (latest stable as of today) and make an explicit choice to update it on an annual cadence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main advantage of testing against rapid server versions is query performance improvements. Since ODMs necessarily construct database queries for the user, they don't have any control over what's actually sent to the server barring a feature like raw_aggregate that allows them to specify the actual query itself. With the server improving query performance and optimization (for example, $in inside $expr using indexes starting in 8.1: SERVER-32549), it's possible we run into situations where the best way to improve performance is for a user to upgrade their server version. Some of these, such as using $expr where it's not necessary, can be fixed with ODM code improvements, but that isn't a guarantee. Being able to tell users that upgrading to the latest rapid release will improve performance for their use case could be helpful, but I can see the downside of testing an additional server version besides latest stable.

Copy link
Contributor

@JamesKovacs JamesKovacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good. The most pressing concerns are around the percentile calculation and picking a stable server version to test against.

- Sort the array into ascending order (i.e. shortest time first)
- Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1`

*N.B. This is the [Nearest Rank](https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) algorithm, chosen for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#The_Nearest_Rank_method anchor should be #The_nearest-rank_method.


- Given a 0-indexed array A of N iteration wall clock times
- Sort the array into ascending order (i.e. shortest time first)
- Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the maximum iteration count is 10 (see line 109 above), the 90th, 95th, 98th, and 99th percentiles will all be A[8] since int(float) truncates the float.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in the Wikipedia article, fewer than 100 measurements will result in the same value being reported for multiple percentiles.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points--this whole section is copied from the existing driver benchmark spec for consistency, which raises the question of should we (as a separate ticket) update that spec as well? I would say yes to keep both benchmarking specs as consistent in behavior and design as possible.

The data size, shape, and specific operation of a benchmark are the limiting factors for how many iterations are ultimately run. We expect most of the tests to run more than 100 iterations in the allotted time, but the more expensive ones don't. Have we historically actually used these percentiles or plan to in the future? From my experience, at least the Python team primarily uses the MB/s metric to identify regressions. If this is a consistent pattern across teams and continues to be, recording this additional data doesn't seem useful.

Unless otherwise specified, the number of iterations to measure per task is variable:

- iterations should loop for at least 30 seconds cumulative execution time
- iterations should stop after 10 iterations or 1 minute cumulative execution time, whichever is shorter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those two conditions seem to be working at cross purposes. The measurement should loop for at least 30 seconds but not more than 60, but stop after 10 iterations. This caps the number of iterations at 10, possibly fewer if each iteration takes longer than 6 seconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing on my part (also taken from the driver benchmarking spec).

The intent is to have a 30 second minimum execution time with a 120 second execution time cap. Once the minimum time is reached, we stop the benchmark being executed once it reaches 120 seconds of execution time or once at least 10 iterations have completed.

The data will be stored as strict JSON with no extended types. These JSON representations must be converted into
equivalent models as part of each benchmark task.

Flat model benchmark tasks include:s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extraneous s at the end of the line.


| Phase | Description |
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Setup | Load the SMALL_DOC dataset into memory as an ODM-appropriate model object. Insert 10,000 instances into the database, saving the inserted `id` field for each into a list. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the _id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll update the wording to clarify since ODM naming conventions for the document _id will vary.

Summary: This benchmark tests ODM performance creating a single large model.

Dataset: The dataset (LARGE_DOC) is contained within `large_doc.json` and consists of a sample document stored as strict
JSON with an encoded length of approximately 8,000 bytes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8,000 bytes is still relatively small. Do we want to have perf tests for huge documents close to the 16MB limit? While we may not recommend large models, customers will run into these scenarios especially if their models contain large arrays of subdocuments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close to the 16MB limit seems excessive and would increase execution time significantly. Increasing the size here to be a few MB, similar to what the driver benchmarks use for their large document tests, would likely result in similar performance characteristics without as large of a latency hit. The downside to increasing the size of documents here is that we need to define the data's structure carefully to not significantly complicate the process of model creation for implementing ODMs, which is not a concern for the driver benchmarks.

### Benchmark Server

The MongoDB ODM Performance Benchmark must be run against a MongoDB replica set of size 1 running the latest stable
database version without authentication or SSL enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the stability. If we see a perf regression (or improvement), we then have to consider whether we actually made things worse (or better) or if we happened to run on a newer server version that had different perf characteristics. We have correctness tests against different server versions. I don't think there is value in testing the server's performance in our ODM tests. Thus I would suggest we choose 8.0.13 (latest stable as of today) and make an explicit choice to update it on an annual cadence.

@NoahStapp NoahStapp removed the request for review from alexbevi September 9, 2025 19:12
- Nested models -- reading and writing nested models of various sizes, to explore basic operation efficiency for complex
data

The suite is intentionally kept small for several reasons. One, ODM feature sets vary significantly across libraries.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May prefer bulleted list here e.g.


The suite is intentionally kept small for the following reasons:

  • ODM feature sets vary …
  • Several popular MongoDB ODMs are maintained by third-parties …

Copy link

@aclark4life aclark4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending clarification of if "in the repo" can replace "in the repo's test dir".


We expect substantial performance differences between ODMs based on both their language families (e.g. static vs.
dynamic or compiled vs. virtual-machine-based) as well as their inherent design (e.g. web frameworks such as Django vs.
application-agnostic such as Mongoose). However we still expect "vertical" comparison within families of ODMs to expose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is worthwhile to compare different ODMs to each other. The performance of ODMs doing different types of things varies widely based on the approach taken by the ODM, as opposed to anything the provider/adapter for Mongo is doing.

I do think it could be valuable to compare a given ODM with Mongo to that same ODM but with a similar (e.g. Cosmos) database, and a different (e.g. PostgreSQL) database. Whether or not this will show differences in the client is dependent on many things. For example, in .NET making the PostgreSQL provider faster is measurable because the data transfer and server can keep up. On the other hand, making the SQL Server provider faster makes no difference, because the wire protocol and server blocking is already the limiting factor.

It may also be useful to test raw driver to ODM perf, especially since customers often ask about this. However, in most cases the performance overhead will come from the core ODM code, rather than anything we are doing, so I doubt that there will be much actionable to come out of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing ODMs to each other could be useful in identifying potential design or implementation issues. For example, if one ODM implements embedded document querying in an inefficient way, comparing its performance on a benchmark to a similar ODM with much better performance could unlock improvements that would be difficult to identify otherwise. Outside of that specific case, I agree that ODM comparisons are not very useful.

Comparing performance across databases is an interesting idea. Django did apples-to-apples comparisons with benchmarks against both MongoDB and Postgres and got a lot of useful data out of that. ODMs make doing so relatively easy as only the backend configuration and models (for differences like embedded documents and relational links) need to change. We'd need to be careful to isolate performance differences to the database alone as much as possible, due to all the factors you state.

Comparing raw driver to ODM perf is part of the stated goals of this project. Determining exactly which benchmarks should be directly compared is still under consideration, for both maintainability and overhead concerns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to @JamesKovacs and then @NoahStapp offline about my concerns. In a nutshell, there is a lot here, and perf testing is hard and resource (human and otherwise) intensive. It would be my preference to start with one goal, such as testing for ODM perf regressions, and get that working well end-to-end. In terms of the details here (iterations, payloads, and the scenarios themselves) they look fine as starting points, and then we can modify them as we start getting data based on what we are seeing.

I'm going to say LGTM here, because this feels like a plan we can get started with and then iterate on.

to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every
PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating
this suite of tests.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a dedicated, isolated perf lab, with machines that won't get changes unless we know about it? My experience with perf testing over many years is that unless you have such a system, then the noise makes it very difficult to see when things really change. For example, OS updates, platform/language changes, virus checking, auto-updates kicking in mid run, and so on, all make the data hard to interpret.

How do you currently handle driver perf test machines? Can you point me to charts, or even raw data I guess, that should variation/noise over time? Also, how often do they run? Is there only a single change between each run so that it's feasible to trace back a perf difference to a single change, be that external or a code change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's an example of what the Python driver perf tests output. The driver perf tests have experienced all of the issues you've stated, but still provide useful metrics that let us catch regressions and identify places for improvement. Running on Evergreen doesn't allow us (AFAIK) to have our own dedicated set of machines.

The Python driver perf tests run weekly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drivers do have access to a special host in evergreen to run dedicated performance tasks on to ensure stability and consistency (rhel90-dbx-perf-large).

| After task | n/a. |
| Teardown | Drop the collection associated with the LARGE_DOC_NESTED model. |

## Benchmark platform, configuration and environments
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call out client and server, but not the underlying platform configuration. Given how much the underlying hardware can impact performance, if we want meaningfully comparable perf results across different ODMs (or between ODMs and drivers), we need to keep as much of the surrounding environment fixed as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is calling out a specific evergreen host/distro configuration sufficient for that purpose? The drivers benchmarking spec doesn't call out specifics, so we should choose a standard for both and ensure it's present.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we should be consistent. For drivers benchmarks, we call it out in our internal "Drivers Performance Testing Infrastructure Guidelines" doc, and it was a requirement added in DRIVERS-2666. We don't capture that in the spec itself because the precise distro names are specific to our internal tooling. We also don't capture driver-specific platform pinning requirements (e.g., the Node.js runtime version for the node driver) because these can vary a lot by driver and also depend on the precise CI setup these drivers use to run the perf benchmarks. But I don't think it's unreasonable to have a general requirement to keep invariant as much of the system in which the perf tests execute as can be reasonably achieved.

@aclark4life aclark4life self-requested a review September 19, 2025 15:10
Copy link

@aclark4life aclark4life left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @ajcvickers ! I think this is a reasonable framework to get folks started doing something collaboratively across ODMs. We'll see how it goes and update the spec accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants