-
Notifications
You must be signed in to change notification settings - Fork 436
Update Evaluate pipeline with easy, med, hard prompts and restructure artifact #474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ready for pull request
WalkthroughThe changes introduce support for JSON-formatted datasets, composite artifact creation with detailed metrics and error tracking, expanded prompt complexity options, and enhanced evaluation result management. The CLI and documentation are updated for clarity and new features, while verbose logging is suppressed and error handling is improved throughout the evaluation pipeline. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant EvaluationPipeline
participant SyntheticIntentGenerator
participant SearchEvaluator
participant W&B
User->>CLI: Run evaluation/generation command
CLI->>EvaluationPipeline: Initialize with prompt_type, dataset, etc.
alt Generate-and-evaluate mode
EvaluationPipeline->>SyntheticIntentGenerator: Generate synthetic intents (prompt_type)
SyntheticIntentGenerator-->>EvaluationPipeline: Return DataFrame with prompts
EvaluationPipeline->>SearchEvaluator: Evaluate dataset
SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
EvaluationPipeline->>W&B: Log composite artifact (dataset, metrics, errors)
else Generate-only mode
EvaluationPipeline->>SyntheticIntentGenerator: Generate and save dataset as JSON
SyntheticIntentGenerator->>W&B: Save dataset artifact
else Evaluate-only mode
EvaluationPipeline->>W&B: Download dataset artifact (JSON)
EvaluationPipeline->>SearchEvaluator: Evaluate dataset
SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
EvaluationPipeline->>W&B: Log composite artifact (metrics, errors)
end
EvaluationPipeline-->>CLI: Print summary metrics
Possibly related PRs
Suggested reviewers
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. πͺ§ TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
π§Ή Nitpick comments (4)
backend/evals/README.md (3)
62-62
: Fix markdown heading style.Remove the trailing colon from the heading to comply with markdown best practices.
-##### Examples: +##### Examplesπ§° Tools
πͺ markdownlint-cli2 (0.17.2)
62-62: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
91-91
: Fix markdown heading style.Remove the trailing colon from the heading to comply with markdown best practices.
-##### Use Cases: +##### Use Casesπ§° Tools
πͺ markdownlint-cli2 (0.17.2)
91-91: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
118-118
: Fix markdown heading style.Remove the trailing colon from the heading to comply with markdown best practices.
-##### Important Notes: +##### Important Notesπ§° Tools
πͺ markdownlint-cli2 (0.17.2)
118-118: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
backend/evals/evaluation_pipeline.py (1)
92-96
: Simplify control flow by removing unnecessary else block.The else block is unnecessary after the return statement. This refactoring improves code readability.
- if dataset_filename.endswith(".json"): - return pd.read_json(dataset_path, orient="records") - else: - return pd.read_csv(dataset_path) + if dataset_filename.endswith(".json"): + return pd.read_json(dataset_path, orient="records") + return pd.read_csv(dataset_path)π§° Tools
πͺ Pylint (3.3.7)
[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
π Files selected for processing (6)
.gitignore
(1 hunks)backend/evals/README.md
(1 hunks)backend/evals/evaluation_pipeline.py
(9 hunks)backend/evals/intent_prompts.py
(1 hunks)backend/evals/search_evaluator.py
(3 hunks)backend/evals/synthetic_intent_generator.py
(3 hunks)
π§° Additional context used
𧬠Code Graph Analysis (1)
backend/evals/evaluation_pipeline.py (2)
backend/evals/search_evaluator.py (1)
SearchEvaluator
(20-184)backend/evals/synthetic_intent_generator.py (3)
SyntheticIntentGenerator
(17-196)_fetch_app_function_data
(52-72)_generate_intent
(80-97)
πͺ markdownlint-cli2 (0.17.2)
backend/evals/README.md
62-62: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
91-91: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
118-118: Trailing punctuation in heading
Punctuation: ':'
(MD026, no-trailing-punctuation)
πͺ Pylint (3.3.7)
backend/evals/evaluation_pipeline.py
[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it
(R1705)
[refactor] 97-97: Too many arguments (6/5)
(R0913)
[refactor] 97-97: Too many positional arguments (6/5)
(R0917)
[refactor] 97-97: Too many local variables (19/15)
(R0914)
[refactor] 138-138: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 148-148: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 158-158: Consider using 'with' for resource-allocating operations
(R1732)
[refactor] 415-415: Too many arguments (6/5)
(R0913)
[refactor] 415-415: Too many positional arguments (6/5)
(R0917)
π Additional comments (12)
.gitignore (1)
167-168
: LGTM! Appropriate ignore patterns added.The addition of
tmp/
andwandb/
entries is appropriate for ignoring temporary files and Weights & Biases artifacts generated by the evaluation system.backend/evals/synthetic_intent_generator.py (3)
8-8
: LGTM! Import reordering is clean.The wandb import repositioning doesn't affect functionality and maintains good import organization.
143-149
: LGTM! Good file format flexibility with backward compatibility.The enhancement to support both JSON and CSV formats based on file extension is well-implemented. The default to CSV ensures backward compatibility while adding flexibility for different use cases.
186-189
: LGTM! Dynamic prompt column naming aligns with new prompt types.The change to use
self.prompt_type
as the column name is consistent with the new multi-prompt system and maintains proper data organization.backend/evals/search_evaluator.py (3)
16-17
: LGTM! Appropriate logging level adjustment.Suppressing verbose httpx logging to WARNING level reduces noise while maintaining important error visibility.
55-57
: LGTM! Reasonable timeout increase for search operations.The 120-second timeout is appropriate for potentially complex search operations and provides better reliability than default timeouts.
148-182
: LGTM! Improved evaluation loop with better progress tracking.The refactored evaluation loop provides better control over progress reporting and maintains the same evaluation logic. The use of the walrus operator in the results construction is correct, though complex - it efficiently tracks when the expected function is found while building the results list.
backend/evals/README.md (1)
1-182
: Excellent comprehensive documentation!The README provides thorough coverage of the evaluation system, including prompt types, usage examples, and configuration details. The documentation will greatly help users understand and utilize the multi-difficulty evaluation framework.
π§° Tools
πͺ markdownlint-cli2 (0.17.2)
62-62: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
91-91: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
118-118: Trailing punctuation in heading
Punctuation: ':'(MD026, no-trailing-punctuation)
backend/evals/intent_prompts.py (4)
4-21
: LGTM! Well-designed easy prompt template.The easy prompt provides clear, direct instructions that will generate straightforward user intents. The constraints are appropriate for the difficulty level.
24-41
: LGTM! Appropriate medium difficulty prompt.The medium prompt strikes a good balance between conversational naturalness and clear intent expression. The contextual requirements will generate more realistic user scenarios.
44-64
: LGTM! Sophisticated hard prompt design.The hard prompt effectively creates complex, enterprise-level scenarios that will thoroughly test the search system's ability to understand indirect and contextual intents. The constraints appropriately increase the challenge.
68-72
: LGTM! Clean prompt dictionary interface.The PROMPTS dictionary provides a clean, extensible interface for accessing the different prompt types throughout the system.
df = self.generator._fetch_app_function_data() | ||
|
||
if df.empty: | ||
raise ValueError( | ||
"No app and function data found in the database. Please seed the database." | ||
) | ||
|
||
if generation_limit: | ||
df = df[:generation_limit] | ||
|
||
# Generate intents | ||
from tqdm import tqdm | ||
|
||
from evals.intent_prompts import PROMPTS | ||
|
||
df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1) | ||
df["synthetic_output"] = [ | ||
self.generator._generate_intent(prompt) | ||
for prompt in tqdm(df[self.prompt_type]) | ||
] | ||
|
||
logger.info(f"Generated {len(df)} synthetic intents") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π οΈ Refactor suggestion
Refactor inline generation logic to maintain proper encapsulation.
The current implementation bypasses the generator's public interface by directly accessing private methods (_fetch_app_function_data
and _generate_intent
). This violates encapsulation principles and duplicates logic that should be centralized in the SyntheticIntentGenerator
class.
Consider adding a method to the generator class that returns the dataset without saving it as an artifact, then use that method here:
# In SyntheticIntentGenerator, add:
def generate_without_saving(self, limit: int | None = None) -> pd.DataFrame:
"""Generate synthetic intents without saving to wandb."""
df = self._fetch_app_function_data()
if df.empty:
raise ValueError(
"No app and function data found in the database. Please seed the database."
)
if limit:
df = df[:limit]
# Generate intents
from tqdm import tqdm
df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1)
df["synthetic_output"] = [
self._generate_intent(prompt)
for prompt in tqdm(df[self.prompt_type])
]
return df
# Then in this method:
df = self.generator.generate_without_saving(limit=generation_limit)
logger.info(f"Generated {len(df)} synthetic intents")
π€ Prompt for AI Agents
In backend/evals/evaluation_pipeline.py around lines 351 to 373, the code
directly calls private methods of the generator (_fetch_app_function_data and
_generate_intent), breaking encapsulation. To fix this, add a new public method
generate_without_saving(limit) to the SyntheticIntentGenerator class that
encapsulates the data fetching, validation, limiting, and intent generation
logic, returning the resulting DataFrame. Then replace the current inline logic
with a single call to this new method, passing generation_limit as the limit
argument, and log the result count as before.
|
||
from evals.intent_prompts import PROMPTS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate import statement.
The PROMPTS
module is already imported at line 10. This duplicate import is unnecessary.
- from tqdm import tqdm
-
- from evals.intent_prompts import PROMPTS
+ from tqdm import tqdm
π Committable suggestion
βΌοΈ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
from evals.intent_prompts import PROMPTS | |
from tqdm import tqdm |
π€ Prompt for AI Agents
In backend/evals/evaluation_pipeline.py around lines 363 to 364, there is a
duplicate import of PROMPTS from evals.intent_prompts which is already imported
at line 10. Remove the import statement at line 363 to eliminate redundancy.
backend/evals/evaluation_pipeline.py
Outdated
@@ -73,7 +79,108 @@ def _load_dataset_from_wandb(self, artifact_name: str, dataset_filename: str) -> | |||
""" | |||
artifact = wandb.use_artifact(f"{artifact_name}:latest") | |||
artifact_dir = artifact.download() | |||
return pd.read_csv(os.path.join(artifact_dir, dataset_filename)) | |||
|
|||
# Support both JSON and CSV formats for backward compatibility |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop support for csv and delete the old artifacts and only keep for json
backend/evals/evaluation_pipeline.py
Outdated
# For evaluation artifacts, the dataset file is prefixed with "dataset_" | ||
if "_evaluation_" in artifact_name and any(char.isdigit() for char in artifact_name): | ||
# This is an evaluation artifact, look for dataset_<filename> | ||
dataset_path = os.path.join(artifact_dir, f"dataset_{dataset_filename}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove the dataset
prefix so we won't need this logic
backend/evals/evaluation_pipeline.py
Outdated
else: | ||
return pd.read_csv(dataset_path) | ||
|
||
def _create_comprehensive_artifact( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A better name for this would be composite artifact
backend/evals/evaluation_pipeline.py
Outdated
from datetime import datetime | ||
|
||
# Create comprehensive artifact with timestamp | ||
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timestamp is unnescassery, its already in the metadata of the artifact and using the timestamp we wouldn't get the advantage of versioning from wandb
backend/evals/evaluation_pipeline.py
Outdated
|
||
comprehensive_artifact = wandb.Artifact( | ||
name=comprehensive_artifact_name, | ||
type="evaluation_results", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the type I think its better to have it as dataset
as we are providing the dataset with the metrics etc. From evaluation results i would expect to only have the accuracy, mrr etc.
@@ -140,8 +140,13 @@ def _save_to_wandb(self, df: pd.DataFrame, dataset_artifact: str, dataset_filena | |||
}, | |||
) | |||
|
|||
# Write dataframe to the temporary file | |||
df.to_csv(dataset_filename, index=False) | |||
# Save dataframe in the appropriate format based on file extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to handle csv files anymore
backend/evals/README.md
Outdated
@@ -0,0 +1,181 @@ | |||
# Function Search Evaluation System | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a README here we should move the stuff we already had in the backend readme
backend/evals/README.md
Outdated
python evaluation_pipeline.py --mode generate-and-evaluate --prompt-type prompt_hard --dataset hard_intent_dataset | ||
``` | ||
|
||
#### Dataset Filename Parameter (`--dataset-filename`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get rid of this section the dataset-filename can be anything the user wants and not very important
backend/evals/README.md
Outdated
- **Backward compatibility**: Existing datasets without custom filenames will continue to work | ||
- **Case sensitivity**: Filenames are case-sensitive | ||
|
||
### Programmatic Usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just by looking at the intent_prompts file it should be clear what it does, do you think this is needed?
backend/evals/README.md
Outdated
python example_usage.py | ||
``` | ||
|
||
## Environment Variables |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be at the setup similar to how we have it in the backend readme
β¨ No issues found! Your code is sparkling clean! β¨ Need help? Join our Discord for support! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
β»οΈ Duplicate comments (1)
backend/evals/evaluation_pipeline.py (1)
332-353
: Address encapsulation violation and duplicate import.This implementation has two issues that were flagged in previous reviews:
Encapsulation violation: The code directly calls private methods
_fetch_app_function_data
and_generate_intent
, bypassing the generator's public interface.Duplicate import:
PROMPTS
is imported twice - once at the top of the file (line 8) and again at line 345.Apply this diff to fix the duplicate import:
- from evals.intent_prompts import PROMPTS
For the encapsulation issue, consider implementing the previously suggested solution of adding a
generate_without_saving
method to theSyntheticIntentGenerator
class, then replace this inline logic with a call to that method.
π§Ή Nitpick comments (1)
backend/evals/evaluation_pipeline.py (1)
416-425
: Consider simplifying the dataset naming logic.The current logic works but could be more readable. Consider consolidating the conditions:
- # Automatically append prompt type to dataset name for better organization - # BUT don't append if it's already an evaluation artifact - if "_evaluation" in dataset: - # This is already an evaluation artifact, use as-is - dataset_with_prompt = dataset - elif dataset == DEFAULT_DATASET_ARTIFACT: - dataset_with_prompt = f"{dataset}_{prompt_type}" - else: - # If user provided custom dataset name, still append prompt type - dataset_with_prompt = f"{dataset}_{prompt_type}" + # Automatically append prompt type to dataset name for better organization + # Skip if it's already an evaluation artifact + if "_evaluation" in dataset: + dataset_with_prompt = dataset + else: + dataset_with_prompt = f"{dataset}_{prompt_type}"This eliminates the redundant logic in the
elif
andelse
branches.
π Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
π Files selected for processing (2)
backend/.pre-commit-config.yaml
(1 hunks)backend/evals/evaluation_pipeline.py
(9 hunks)
β Files skipped from review due to trivial changes (1)
- backend/.pre-commit-config.yaml
π§° Additional context used
𧬠Code Graph Analysis (1)
backend/evals/evaluation_pipeline.py (2)
backend/evals/search_evaluator.py (1)
SearchEvaluator
(20-264)backend/evals/synthetic_intent_generator.py (3)
SyntheticIntentGenerator
(17-192)_fetch_app_function_data
(52-72)_generate_intent
(80-97)
π Additional comments (9)
backend/evals/evaluation_pipeline.py (9)
7-8
: LGTM: Import additions are appropriate.The addition of
wandb
andPROMPTS
imports aligns with the new functionality for composite artifacts and prompt type support.
15-16
: LGTM: Suppressing verbose logging improves user experience.Suppressing httpx verbose logging is a good practice to reduce noise in the output, consistent with similar changes in the search evaluator.
19-19
: LGTM: JSON format change is consistent with overall architecture.Changing the default dataset filename from CSV to JSON aligns with the broader move to JSON format throughout the codebase.
40-40
: LGTM: Default prompt type change aligns with new prompt structure.Changing the default from "task" to "prompt_easy" is consistent with the new prompt functions introduced in the codebase.
80-82
: LGTM: JSON dataset loading is consistent with format change.The change from
pd.read_csv
topd.read_json
withorient="records"
is consistent with the switch to JSON format for datasets.
84-166
: LGTM: Well-implemented composite artifact creation with proper cleanup.The
_create_composite_artifact
method is well-structured with:
- Proper temporary file handling
- Explicit file closing to ensure data is written
- Comprehensive cleanup in the finally block
- Clear logging of results
The method effectively bundles dataset, metrics, and incorrect results into a single artifact.
225-262
: LGTM: Enhanced evaluation with detailed metrics and composite artifacts.The evaluation method improvements include:
- Extracting incorrect results for separate analysis
- Creating detailed metrics with evaluation configuration
- Using the new composite artifact method
- Logging only summary metrics to wandb to avoid clutter
This provides better tracking and analysis capabilities while maintaining clean wandb logs.
375-380
: LGTM: Well-designed CLI option for prompt type selection.The new
--prompt-type
option provides good user experience with:
- Clear choices from the PROMPTS dictionary
- Sensible default value
- Helpful description and default display
This allows users to easily select different prompt complexity levels.
382-386
: LGTM: Simplified option naming improves CLI usability.Renaming
--dataset-artifact
to--dataset
makes the CLI more concise and user-friendly while maintaining the same functionality.
π·οΈ Ticket
#473
π Description
Solution proposed in the issue part.
π₯ Demo (if applicable)
πΈ Screenshots (if applicable)
β Checklist
Summary by CodeRabbit
New Features
Improvements
Documentation
Chores
.gitignore
to exclude wandb-related files.