Skip to content

Conversation

irmadong
Copy link

@irmadong irmadong commented Jun 10, 2025

🏷️ Ticket

#473

πŸ“ Description

Solution proposed in the issue part.

πŸŽ₯ Demo (if applicable)

πŸ“Έ Screenshots (if applicable)

βœ… Checklist

  • [ Y] I have signed the Contributor License Agreement (CLA) and read the contributing guide (required)
  • [Y ] I have linked this PR to an issue or a ticket (required)
  • [ Y ] I have updated the documentation related to my change if needed
  • [ Y ] I have updated the tests accordingly (required for a bug fix or a new feature)
  • [ Y ] All checks on CI passed

Summary by CodeRabbit

  • New Features

    • Added support for multiple prompt types ("easy", "medium", "hard") for generating user intents, selectable via a new command-line option.
    • Introduced composite artifact creation, bundling datasets, detailed metrics, and incorrect results for enhanced evaluation tracking.
    • Added API connection testing functionality.
  • Improvements

    • Switched dataset format from CSV to JSON for all evaluation and generation processes.
    • Enhanced error handling and logging for search operations.
    • Refined dataset naming conventions and prompt column handling.
    • Improved progress tracking and result detail in evaluation output.
  • Documentation

    • Expanded and restructured the evaluation pipeline documentation, clarifying setup, usage, metrics, and results management.
  • Chores

    • Updated .gitignore to exclude wandb-related files.

Copy link
Contributor

coderabbitai bot commented Jun 10, 2025

Caution

Review failed

The head commit changed during the review from 4428ea0 to b7df0e4.

Walkthrough

The changes introduce support for JSON-formatted datasets, composite artifact creation with detailed metrics and error tracking, expanded prompt complexity options, and enhanced evaluation result management. The CLI and documentation are updated for clarity and new features, while verbose logging is suppressed and error handling is improved throughout the evaluation pipeline.

Changes

File(s) Change Summary
.gitignore Added wandb/ to ignore Weights & Biases local directory.
backend/evals/evaluation_pipeline.py Updated to load/save JSON datasets, create composite artifacts with metrics and errors, support prompt type selection, refine CLI (renamed flags, added --prompt-type), and enhance artifact naming logic. Suppressed verbose logging and changed default prompt type and dataset filename.
backend/evals/intent_prompts.py Replaced single prompt function with three: prompt_easy, prompt_medium, prompt_hard, each offering increasing complexity. Updated PROMPTS mapping.
backend/evals/search_evaluator.py Suppressed httpx logging. Improved error handling and timeout in _search. Enhanced metrics tracking (added "correct" count), rewrote rank finding, and refactored result processing. Added test_connection and _search_functions methods.
backend/evals/synthetic_intent_generator.py Changed dataset output from CSV to JSON. Made prompt column dynamic based on prompt_type. Adjusted synthetic output generation to use new column naming.
backend/README.md Expanded and restructured documentation: clarified evaluation pipeline purpose, prerequisites, metrics, CLI usage, prompt types, and artifact/result management. Updated for JSON datasets and new CLI options.
backend/.pre-commit-config.yaml Added pre-commit configuration with ruff hook for automatic code fixing on commit.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant EvaluationPipeline
    participant SyntheticIntentGenerator
    participant SearchEvaluator
    participant W&B

    User->>CLI: Run evaluation/generation command
    CLI->>EvaluationPipeline: Initialize with prompt_type, dataset, etc.
    alt Generate-and-evaluate mode
        EvaluationPipeline->>SyntheticIntentGenerator: Generate synthetic intents (prompt_type)
        SyntheticIntentGenerator-->>EvaluationPipeline: Return DataFrame with prompts
        EvaluationPipeline->>SearchEvaluator: Evaluate dataset
        SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
        EvaluationPipeline->>W&B: Log composite artifact (dataset, metrics, errors)
    else Generate-only mode
        EvaluationPipeline->>SyntheticIntentGenerator: Generate and save dataset as JSON
        SyntheticIntentGenerator->>W&B: Save dataset artifact
    else Evaluate-only mode
        EvaluationPipeline->>W&B: Download dataset artifact (JSON)
        EvaluationPipeline->>SearchEvaluator: Evaluate dataset
        SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
        EvaluationPipeline->>W&B: Log composite artifact (metrics, errors)
    end
    EvaluationPipeline-->>CLI: Print summary metrics
Loading

Possibly related PRs

  • Fixed docsΒ #467: Updated CLI documentation and flag naming in the README, which is comprehensively expanded and revised in this PR.
  • Added filenameΒ #440: Added dataset filename handling and artifact logic, which is further refined in this PR with JSON support and composite artifacts.
  • Feat/include evals for function searchΒ #367: Introduced the initial evaluation pipeline, which the current PR extends with prompt complexity, artifact management, and improved metrics.

Suggested reviewers

  • thisisfixer

Poem

In JSON fields the data grows,
With prompts from easy to hard it flows.
Artifacts bundled, errors in tow,
Metrics and logs now clearly show.
Rabbits hop, evaluations run,
With every tweak, more work is doneβ€”
Hooray for progress, code and fun! πŸ‡βœ¨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❀️ Share
πŸͺ§ Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
backend/evals/README.md (3)

62-62: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.

-##### Examples:
+##### Examples
🧰 Tools
πŸͺ› markdownlint-cli2 (0.17.2)

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


91-91: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.

-##### Use Cases:
+##### Use Cases
🧰 Tools
πŸͺ› markdownlint-cli2 (0.17.2)

91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


118-118: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.

-##### Important Notes:
+##### Important Notes
🧰 Tools
πŸͺ› markdownlint-cli2 (0.17.2)

118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

backend/evals/evaluation_pipeline.py (1)

92-96: Simplify control flow by removing unnecessary else block.

The else block is unnecessary after the return statement. This refactoring improves code readability.

-        if dataset_filename.endswith(".json"):
-            return pd.read_json(dataset_path, orient="records")
-        else:
-            return pd.read_csv(dataset_path)
+        if dataset_filename.endswith(".json"):
+            return pd.read_json(dataset_path, orient="records")
+        return pd.read_csv(dataset_path)
🧰 Tools
πŸͺ› Pylint (3.3.7)

[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

πŸ“œ Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 04d3db5 and d2f3297.

πŸ“’ Files selected for processing (6)
  • .gitignore (1 hunks)
  • backend/evals/README.md (1 hunks)
  • backend/evals/evaluation_pipeline.py (9 hunks)
  • backend/evals/intent_prompts.py (1 hunks)
  • backend/evals/search_evaluator.py (3 hunks)
  • backend/evals/synthetic_intent_generator.py (3 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
backend/evals/evaluation_pipeline.py (2)
backend/evals/search_evaluator.py (1)
  • SearchEvaluator (20-184)
backend/evals/synthetic_intent_generator.py (3)
  • SyntheticIntentGenerator (17-196)
  • _fetch_app_function_data (52-72)
  • _generate_intent (80-97)
πŸͺ› markdownlint-cli2 (0.17.2)
backend/evals/README.md

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

πŸͺ› Pylint (3.3.7)
backend/evals/evaluation_pipeline.py

[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)


[refactor] 97-97: Too many arguments (6/5)

(R0913)


[refactor] 97-97: Too many positional arguments (6/5)

(R0917)


[refactor] 97-97: Too many local variables (19/15)

(R0914)


[refactor] 138-138: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 148-148: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 158-158: Consider using 'with' for resource-allocating operations

(R1732)


[refactor] 415-415: Too many arguments (6/5)

(R0913)


[refactor] 415-415: Too many positional arguments (6/5)

(R0917)

πŸ”‡ Additional comments (12)
.gitignore (1)

167-168: LGTM! Appropriate ignore patterns added.

The addition of tmp/ and wandb/ entries is appropriate for ignoring temporary files and Weights & Biases artifacts generated by the evaluation system.

backend/evals/synthetic_intent_generator.py (3)

8-8: LGTM! Import reordering is clean.

The wandb import repositioning doesn't affect functionality and maintains good import organization.


143-149: LGTM! Good file format flexibility with backward compatibility.

The enhancement to support both JSON and CSV formats based on file extension is well-implemented. The default to CSV ensures backward compatibility while adding flexibility for different use cases.


186-189: LGTM! Dynamic prompt column naming aligns with new prompt types.

The change to use self.prompt_type as the column name is consistent with the new multi-prompt system and maintains proper data organization.

backend/evals/search_evaluator.py (3)

16-17: LGTM! Appropriate logging level adjustment.

Suppressing verbose httpx logging to WARNING level reduces noise while maintaining important error visibility.


55-57: LGTM! Reasonable timeout increase for search operations.

The 120-second timeout is appropriate for potentially complex search operations and provides better reliability than default timeouts.


148-182: LGTM! Improved evaluation loop with better progress tracking.

The refactored evaluation loop provides better control over progress reporting and maintains the same evaluation logic. The use of the walrus operator in the results construction is correct, though complex - it efficiently tracks when the expected function is found while building the results list.

backend/evals/README.md (1)

1-182: Excellent comprehensive documentation!

The README provides thorough coverage of the evaluation system, including prompt types, usage examples, and configuration details. The documentation will greatly help users understand and utilize the multi-difficulty evaluation framework.

🧰 Tools
πŸͺ› markdownlint-cli2 (0.17.2)

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)


118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

backend/evals/intent_prompts.py (4)

4-21: LGTM! Well-designed easy prompt template.

The easy prompt provides clear, direct instructions that will generate straightforward user intents. The constraints are appropriate for the difficulty level.


24-41: LGTM! Appropriate medium difficulty prompt.

The medium prompt strikes a good balance between conversational naturalness and clear intent expression. The contextual requirements will generate more realistic user scenarios.


44-64: LGTM! Sophisticated hard prompt design.

The hard prompt effectively creates complex, enterprise-level scenarios that will thoroughly test the search system's ability to understand indirect and contextual intents. The constraints appropriately increase the challenge.


68-72: LGTM! Clean prompt dictionary interface.

The PROMPTS dictionary provides a clean, extensible interface for accessing the different prompt types throughout the system.

Comment on lines +351 to 373
df = self.generator._fetch_app_function_data()

if df.empty:
raise ValueError(
"No app and function data found in the database. Please seed the database."
)

if generation_limit:
df = df[:generation_limit]

# Generate intents
from tqdm import tqdm

from evals.intent_prompts import PROMPTS

df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1)
df["synthetic_output"] = [
self.generator._generate_intent(prompt)
for prompt in tqdm(df[self.prompt_type])
]

logger.info(f"Generated {len(df)} synthetic intents")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ› οΈ Refactor suggestion

Refactor inline generation logic to maintain proper encapsulation.

The current implementation bypasses the generator's public interface by directly accessing private methods (_fetch_app_function_data and _generate_intent). This violates encapsulation principles and duplicates logic that should be centralized in the SyntheticIntentGenerator class.

Consider adding a method to the generator class that returns the dataset without saving it as an artifact, then use that method here:

# In SyntheticIntentGenerator, add:
def generate_without_saving(self, limit: int | None = None) -> pd.DataFrame:
    """Generate synthetic intents without saving to wandb."""
    df = self._fetch_app_function_data()
    
    if df.empty:
        raise ValueError(
            "No app and function data found in the database. Please seed the database."
        )
    
    if limit:
        df = df[:limit]
    
    # Generate intents
    from tqdm import tqdm
    df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1)
    df["synthetic_output"] = [
        self._generate_intent(prompt)
        for prompt in tqdm(df[self.prompt_type])
    ]
    
    return df

# Then in this method:
df = self.generator.generate_without_saving(limit=generation_limit)
logger.info(f"Generated {len(df)} synthetic intents")
πŸ€– Prompt for AI Agents
In backend/evals/evaluation_pipeline.py around lines 351 to 373, the code
directly calls private methods of the generator (_fetch_app_function_data and
_generate_intent), breaking encapsulation. To fix this, add a new public method
generate_without_saving(limit) to the SyntheticIntentGenerator class that
encapsulates the data fetching, validation, limiting, and intent generation
logic, returning the resulting DataFrame. Then replace the current inline logic
with a single call to this new method, passing generation_limit as the limit
argument, and log the result count as before.

Comment on lines +363 to +364

from evals.intent_prompts import PROMPTS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove duplicate import statement.

The PROMPTS module is already imported at line 10. This duplicate import is unnecessary.

-                    from tqdm import tqdm
-
-                    from evals.intent_prompts import PROMPTS
+                    from tqdm import tqdm
πŸ“ Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from evals.intent_prompts import PROMPTS
from tqdm import tqdm
πŸ€– Prompt for AI Agents
In backend/evals/evaluation_pipeline.py around lines 363 to 364, there is a
duplicate import of PROMPTS from evals.intent_prompts which is already imported
at line 10. Remove the import statement at line 363 to eliminate redundancy.

@@ -73,7 +79,108 @@ def _load_dataset_from_wandb(self, artifact_name: str, dataset_filename: str) ->
"""
artifact = wandb.use_artifact(f"{artifact_name}:latest")
artifact_dir = artifact.download()
return pd.read_csv(os.path.join(artifact_dir, dataset_filename))

# Support both JSON and CSV formats for backward compatibility
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can drop support for csv and delete the old artifacts and only keep for json

# For evaluation artifacts, the dataset file is prefixed with "dataset_"
if "_evaluation_" in artifact_name and any(char.isdigit() for char in artifact_name):
# This is an evaluation artifact, look for dataset_<filename>
dataset_path = os.path.join(artifact_dir, f"dataset_{dataset_filename}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the dataset prefix so we won't need this logic

else:
return pd.read_csv(dataset_path)

def _create_comprehensive_artifact(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A better name for this would be composite artifact

from datetime import datetime

# Create comprehensive artifact with timestamp
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp is unnescassery, its already in the metadata of the artifact and using the timestamp we wouldn't get the advantage of versioning from wandb


comprehensive_artifact = wandb.Artifact(
name=comprehensive_artifact_name,
type="evaluation_results",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the type I think its better to have it as dataset as we are providing the dataset with the metrics etc. From evaluation results i would expect to only have the accuracy, mrr etc.

@@ -140,8 +140,13 @@ def _save_to_wandb(self, df: pd.DataFrame, dataset_artifact: str, dataset_filena
},
)

# Write dataframe to the temporary file
df.to_csv(dataset_filename, index=False)
# Save dataframe in the appropriate format based on file extension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to handle csv files anymore

@@ -0,0 +1,181 @@
# Function Search Evaluation System

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have a README here we should move the stuff we already had in the backend readme

python evaluation_pipeline.py --mode generate-and-evaluate --prompt-type prompt_hard --dataset hard_intent_dataset
```

#### Dataset Filename Parameter (`--dataset-filename`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of this section the dataset-filename can be anything the user wants and not very important

- **Backward compatibility**: Existing datasets without custom filenames will continue to work
- **Case sensitivity**: Filenames are case-sensitive

### Programmatic Usage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just by looking at the intent_prompts file it should be clear what it does, do you think this is needed?

python example_usage.py
```

## Environment Variables
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be at the setup similar to how we have it in the backend readme

@irmadong irmadong temporarily deployed to CICD_FOR_FORKED_REPO July 11, 2025 03:57 — with GitHub Actions Inactive
@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 11, 2025 03:57 — with GitHub Actions Failure
@irmadong irmadong requested a review from alex-aipolabs July 13, 2025 03:01
Copy link

recurseml bot commented Jul 14, 2025

✨ No issues found! Your code is sparkling clean! ✨

Need help? Join our Discord for support!
https://discord.gg/qEjHQk64Z9

@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:11 — with GitHub Actions Failure
@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:11 — with GitHub Actions Failure
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
backend/evals/evaluation_pipeline.py (1)

332-353: Address encapsulation violation and duplicate import.

This implementation has two issues that were flagged in previous reviews:

  1. Encapsulation violation: The code directly calls private methods _fetch_app_function_data and _generate_intent, bypassing the generator's public interface.

  2. Duplicate import: PROMPTS is imported twice - once at the top of the file (line 8) and again at line 345.

Apply this diff to fix the duplicate import:

-                    from evals.intent_prompts import PROMPTS

For the encapsulation issue, consider implementing the previously suggested solution of adding a generate_without_saving method to the SyntheticIntentGenerator class, then replace this inline logic with a call to that method.

🧹 Nitpick comments (1)
backend/evals/evaluation_pipeline.py (1)

416-425: Consider simplifying the dataset naming logic.

The current logic works but could be more readable. Consider consolidating the conditions:

-    # Automatically append prompt type to dataset name for better organization
-    # BUT don't append if it's already an evaluation artifact
-    if "_evaluation" in dataset:
-        # This is already an evaluation artifact, use as-is
-        dataset_with_prompt = dataset
-    elif dataset == DEFAULT_DATASET_ARTIFACT:
-        dataset_with_prompt = f"{dataset}_{prompt_type}"
-    else:
-        # If user provided custom dataset name, still append prompt type
-        dataset_with_prompt = f"{dataset}_{prompt_type}"
+    # Automatically append prompt type to dataset name for better organization
+    # Skip if it's already an evaluation artifact
+    if "_evaluation" in dataset:
+        dataset_with_prompt = dataset
+    else:
+        dataset_with_prompt = f"{dataset}_{prompt_type}"

This eliminates the redundant logic in the elif and else branches.

πŸ“œ Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

πŸ“₯ Commits

Reviewing files that changed from the base of the PR and between 2157346 and 6866fb6.

πŸ“’ Files selected for processing (2)
  • backend/.pre-commit-config.yaml (1 hunks)
  • backend/evals/evaluation_pipeline.py (9 hunks)
βœ… Files skipped from review due to trivial changes (1)
  • backend/.pre-commit-config.yaml
🧰 Additional context used
🧬 Code Graph Analysis (1)
backend/evals/evaluation_pipeline.py (2)
backend/evals/search_evaluator.py (1)
  • SearchEvaluator (20-264)
backend/evals/synthetic_intent_generator.py (3)
  • SyntheticIntentGenerator (17-192)
  • _fetch_app_function_data (52-72)
  • _generate_intent (80-97)
πŸ”‡ Additional comments (9)
backend/evals/evaluation_pipeline.py (9)

7-8: LGTM: Import additions are appropriate.

The addition of wandb and PROMPTS imports aligns with the new functionality for composite artifacts and prompt type support.


15-16: LGTM: Suppressing verbose logging improves user experience.

Suppressing httpx verbose logging is a good practice to reduce noise in the output, consistent with similar changes in the search evaluator.


19-19: LGTM: JSON format change is consistent with overall architecture.

Changing the default dataset filename from CSV to JSON aligns with the broader move to JSON format throughout the codebase.


40-40: LGTM: Default prompt type change aligns with new prompt structure.

Changing the default from "task" to "prompt_easy" is consistent with the new prompt functions introduced in the codebase.


80-82: LGTM: JSON dataset loading is consistent with format change.

The change from pd.read_csv to pd.read_json with orient="records" is consistent with the switch to JSON format for datasets.


84-166: LGTM: Well-implemented composite artifact creation with proper cleanup.

The _create_composite_artifact method is well-structured with:

  • Proper temporary file handling
  • Explicit file closing to ensure data is written
  • Comprehensive cleanup in the finally block
  • Clear logging of results

The method effectively bundles dataset, metrics, and incorrect results into a single artifact.


225-262: LGTM: Enhanced evaluation with detailed metrics and composite artifacts.

The evaluation method improvements include:

  • Extracting incorrect results for separate analysis
  • Creating detailed metrics with evaluation configuration
  • Using the new composite artifact method
  • Logging only summary metrics to wandb to avoid clutter

This provides better tracking and analysis capabilities while maintaining clean wandb logs.


375-380: LGTM: Well-designed CLI option for prompt type selection.

The new --prompt-type option provides good user experience with:

  • Clear choices from the PROMPTS dictionary
  • Sensible default value
  • Helpful description and default display

This allows users to easily select different prompt complexity levels.


382-386: LGTM: Simplified option naming improves CLI usability.

Renaming --dataset-artifact to --dataset makes the CLI more concise and user-friendly while maintaining the same functionality.

@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:36 — with GitHub Actions Failure
@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:36 — with GitHub Actions Failure
@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:37 — with GitHub Actions Failure
@irmadong irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:37 — with GitHub Actions Failure
@irmadong irmadong requested a review from alex-aipolabs July 15, 2025 15:38
@dev-aipolabs dev-aipolabs requested review from dev-aipolabs and removed request for alex-aipolabs July 23, 2025 10:16
@dev-aipolabs dev-aipolabs requested a review from zizixcm July 25, 2025 21:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants