Update Evaluate pipeline with easy, med, hard prompts and restructure artifact #474

irmadong · 2025-06-10T02:23:44Z

🏷️ Ticket

#473

📝 Description

Solution proposed in the issue part.

🎥 Demo (if applicable)

📸 Screenshots (if applicable)

✅ Checklist

[ Y] I have signed the Contributor License Agreement (CLA) and read the contributing guide (required)
[Y ] I have linked this PR to an issue or a ticket (required)
[ Y ] I have updated the documentation related to my change if needed
[ Y ] I have updated the tests accordingly (required for a bug fix or a new feature)
[ Y ] All checks on CI passed

Summary by CodeRabbit

New Features
- Added support for multiple prompt types ("easy", "medium", "hard") for generating user intents, selectable via a new command-line option.
- Introduced composite artifact creation, bundling datasets, detailed metrics, and incorrect results for enhanced evaluation tracking.
- Added API connection testing functionality.
Improvements
- Switched dataset format from CSV to JSON for all evaluation and generation processes.
- Enhanced error handling and logging for search operations.
- Refined dataset naming conventions and prompt column handling.
- Improved progress tracking and result detail in evaluation output.
Documentation
- Expanded and restructured the evaluation pipeline documentation, clarifying setup, usage, metrics, and results management.
Chores
- Updated .gitignore to exclude wandb-related files.

ready for pull request

coderabbitai · 2025-06-10T02:23:52Z

Caution

Review failed

The head commit changed during the review from 4428ea0 to b7df0e4.

Walkthrough

The changes introduce support for JSON-formatted datasets, composite artifact creation with detailed metrics and error tracking, expanded prompt complexity options, and enhanced evaluation result management. The CLI and documentation are updated for clarity and new features, while verbose logging is suppressed and error handling is improved throughout the evaluation pipeline.

Changes

File(s)	Change Summary
`.gitignore`	Added `wandb/` to ignore Weights & Biases local directory.
`backend/evals/evaluation_pipeline.py`	Updated to load/save JSON datasets, create composite artifacts with metrics and errors, support prompt type selection, refine CLI (renamed flags, added `--prompt-type`), and enhance artifact naming logic. Suppressed verbose logging and changed default prompt type and dataset filename.
`backend/evals/intent_prompts.py`	Replaced single prompt function with three: `prompt_easy`, `prompt_medium`, `prompt_hard`, each offering increasing complexity. Updated `PROMPTS` mapping.
`backend/evals/search_evaluator.py`	Suppressed httpx logging. Improved error handling and timeout in `_search`. Enhanced metrics tracking (added "correct" count), rewrote rank finding, and refactored result processing. Added `test_connection` and `_search_functions` methods.
`backend/evals/synthetic_intent_generator.py`	Changed dataset output from CSV to JSON. Made prompt column dynamic based on `prompt_type`. Adjusted synthetic output generation to use new column naming.
`backend/README.md`	Expanded and restructured documentation: clarified evaluation pipeline purpose, prerequisites, metrics, CLI usage, prompt types, and artifact/result management. Updated for JSON datasets and new CLI options.
`backend/.pre-commit-config.yaml`	Added pre-commit configuration with `ruff` hook for automatic code fixing on commit.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant EvaluationPipeline
    participant SyntheticIntentGenerator
    participant SearchEvaluator
    participant W&B

    User->>CLI: Run evaluation/generation command
    CLI->>EvaluationPipeline: Initialize with prompt_type, dataset, etc.
    alt Generate-and-evaluate mode
        EvaluationPipeline->>SyntheticIntentGenerator: Generate synthetic intents (prompt_type)
        SyntheticIntentGenerator-->>EvaluationPipeline: Return DataFrame with prompts
        EvaluationPipeline->>SearchEvaluator: Evaluate dataset
        SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
        EvaluationPipeline->>W&B: Log composite artifact (dataset, metrics, errors)
    else Generate-only mode
        EvaluationPipeline->>SyntheticIntentGenerator: Generate and save dataset as JSON
        SyntheticIntentGenerator->>W&B: Save dataset artifact
    else Evaluate-only mode
        EvaluationPipeline->>W&B: Download dataset artifact (JSON)
        EvaluationPipeline->>SearchEvaluator: Evaluate dataset
        SearchEvaluator-->>EvaluationPipeline: Return metrics, incorrect results
        EvaluationPipeline->>W&B: Log composite artifact (metrics, errors)
    end
    EvaluationPipeline-->>CLI: Print summary metrics

Possibly related PRs

Fixed docs #467: Updated CLI documentation and flag naming in the README, which is comprehensively expanded and revised in this PR.
Added filename #440: Added dataset filename handling and artifact logic, which is further refined in this PR with JSON support and composite artifacts.
Feat/include evals for function search #367: Introduced the initial evaluation pipeline, which the current PR extends with prompt complexity, artifact management, and improved metrics.

Suggested reviewers

thisisfixer

Poem

In JSON fields the data grows,
With prompts from easy to hard it flows.
Artifacts bundled, errors in tow,
Metrics and logs now clearly show.
Rabbits hop, evaluations run,
With every tweak, more work is done—
Hooray for progress, code and fun! 🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

backend/evals/README.md (3)
62-62: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.
-##### Examples:
+##### Examples
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

91-91: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.
-##### Use Cases:
+##### Use Cases
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

118-118: Fix markdown heading style.

Remove the trailing colon from the heading to comply with markdown best practices.
-##### Important Notes:
+##### Important Notes
🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)
backend/evals/evaluation_pipeline.py (1)
92-96: Simplify control flow by removing unnecessary else block.

The else block is unnecessary after the return statement. This refactoring improves code readability.
-        if dataset_filename.endswith(".json"):
-            return pd.read_json(dataset_path, orient="records")
-        else:
-            return pd.read_csv(dataset_path)
+        if dataset_filename.endswith(".json"):
+            return pd.read_json(dataset_path, orient="records")
+        return pd.read_csv(dataset_path)
🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 04d3db5 and d2f3297.

📒 Files selected for processing (6)

.gitignore (1 hunks)
backend/evals/README.md (1 hunks)
backend/evals/evaluation_pipeline.py (9 hunks)
backend/evals/intent_prompts.py (1 hunks)
backend/evals/search_evaluator.py (3 hunks)
backend/evals/synthetic_intent_generator.py (3 hunks)

🧰 Additional context used

🧬 Code Graph Analysis (1)

backend/evals/evaluation_pipeline.py (2)

backend/evals/search_evaluator.py (1)

SearchEvaluator (20-184)

backend/evals/synthetic_intent_generator.py (3)

SyntheticIntentGenerator (17-196)

_fetch_app_function_data (52-72)

_generate_intent (80-97)

🪛 markdownlint-cli2 (0.17.2)

backend/evals/README.md

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

🪛 Pylint (3.3.7)

backend/evals/evaluation_pipeline.py

[refactor] 92-95: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

[refactor] 97-97: Too many arguments (6/5)

(R0913)

[refactor] 97-97: Too many positional arguments (6/5)

(R0917)

[refactor] 97-97: Too many local variables (19/15)

(R0914)

[refactor] 138-138: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 148-148: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 158-158: Consider using 'with' for resource-allocating operations

(R1732)

[refactor] 415-415: Too many arguments (6/5)

(R0913)

[refactor] 415-415: Too many positional arguments (6/5)

(R0917)

🔇 Additional comments (12)

.gitignore (1)

167-168: LGTM! Appropriate ignore patterns added.

The addition of tmp/ and wandb/ entries is appropriate for ignoring temporary files and Weights & Biases artifacts generated by the evaluation system.

backend/evals/synthetic_intent_generator.py (3)

8-8: LGTM! Import reordering is clean.

The wandb import repositioning doesn't affect functionality and maintains good import organization.

143-149: LGTM! Good file format flexibility with backward compatibility.

The enhancement to support both JSON and CSV formats based on file extension is well-implemented. The default to CSV ensures backward compatibility while adding flexibility for different use cases.

186-189: LGTM! Dynamic prompt column naming aligns with new prompt types.

The change to use self.prompt_type as the column name is consistent with the new multi-prompt system and maintains proper data organization.

backend/evals/search_evaluator.py (3)

16-17: LGTM! Appropriate logging level adjustment.

Suppressing verbose httpx logging to WARNING level reduces noise while maintaining important error visibility.

55-57: LGTM! Reasonable timeout increase for search operations.

The 120-second timeout is appropriate for potentially complex search operations and provides better reliability than default timeouts.

148-182: LGTM! Improved evaluation loop with better progress tracking.

The refactored evaluation loop provides better control over progress reporting and maintains the same evaluation logic. The use of the walrus operator in the results construction is correct, though complex - it efficiently tracks when the expected function is found while building the results list.

backend/evals/README.md (1)

1-182: Excellent comprehensive documentation!

The README provides thorough coverage of the evaluation system, including prompt types, usage examples, and configuration details. The documentation will greatly help users understand and utilize the multi-difficulty evaluation framework.

🧰 Tools

🪛 markdownlint-cli2 (0.17.2)

62-62: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

91-91: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

118-118: Trailing punctuation in heading
Punctuation: ':'

(MD026, no-trailing-punctuation)

backend/evals/intent_prompts.py (4)

4-21: LGTM! Well-designed easy prompt template.

The easy prompt provides clear, direct instructions that will generate straightforward user intents. The constraints are appropriate for the difficulty level.

24-41: LGTM! Appropriate medium difficulty prompt.

The medium prompt strikes a good balance between conversational naturalness and clear intent expression. The contextual requirements will generate more realistic user scenarios.

44-64: LGTM! Sophisticated hard prompt design.

The hard prompt effectively creates complex, enterprise-level scenarios that will thoroughly test the search system's ability to understand indirect and contextual intents. The constraints appropriately increase the challenge.

68-72: LGTM! Clean prompt dictionary interface.

The PROMPTS dictionary provides a clean, extensible interface for accessing the different prompt types throughout the system.

coderabbitai · 2025-06-10T02:30:20Z

backend/evals/evaluation_pipeline.py

+                    df = self.generator._fetch_app_function_data()
+
+                    if df.empty:
+                        raise ValueError(
+                            "No app and function data found in the database. Please seed the database."
+                        )
+
+                    if generation_limit:
+                        df = df[:generation_limit]
+
+                    # Generate intents
+                    from tqdm import tqdm
+
+                    from evals.intent_prompts import PROMPTS
+
+                    df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1)
+                    df["synthetic_output"] = [
+                        self.generator._generate_intent(prompt)
+                        for prompt in tqdm(df[self.prompt_type])
+                    ]
+
+                    logger.info(f"Generated {len(df)} synthetic intents")



🛠️ Refactor suggestion

Refactor inline generation logic to maintain proper encapsulation.

The current implementation bypasses the generator's public interface by directly accessing private methods (_fetch_app_function_data and _generate_intent). This violates encapsulation principles and duplicates logic that should be centralized in the SyntheticIntentGenerator class.

Consider adding a method to the generator class that returns the dataset without saving it as an artifact, then use that method here:

# In SyntheticIntentGenerator, add: def generate_without_saving(self, limit: int | None = None) -> pd.DataFrame: """Generate synthetic intents without saving to wandb.""" df = self._fetch_app_function_data() if df.empty: raise ValueError( "No app and function data found in the database. Please seed the database." ) if limit: df = df[:limit] # Generate intents from tqdm import tqdm df[self.prompt_type] = df.apply(PROMPTS[self.prompt_type], axis=1) df["synthetic_output"] = [ self._generate_intent(prompt) for prompt in tqdm(df[self.prompt_type]) ] return df # Then in this method: df = self.generator.generate_without_saving(limit=generation_limit) logger.info(f"Generated {len(df)} synthetic intents")

🤖 Prompt for AI Agents

In backend/evals/evaluation_pipeline.py around lines 351 to 373, the code directly calls private methods of the generator (_fetch_app_function_data and _generate_intent), breaking encapsulation. To fix this, add a new public method generate_without_saving(limit) to the SyntheticIntentGenerator class that encapsulates the data fetching, validation, limiting, and intent generation logic, returning the resulting DataFrame. Then replace the current inline logic with a single call to this new method, passing generation_limit as the limit argument, and log the result count as before.

coderabbitai · 2025-06-10T02:30:20Z

backend/evals/evaluation_pipeline.py

+
+                    from evals.intent_prompts import PROMPTS


⚠️ Potential issue

Remove duplicate import statement.

The PROMPTS module is already imported at line 10. This duplicate import is unnecessary.

- from tqdm import tqdm - - from evals.intent_prompts import PROMPTS + from tqdm import tqdm

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

from evals.intent_prompts import PROMPTS

from tqdm import tqdm

🤖 Prompt for AI Agents

In backend/evals/evaluation_pipeline.py around lines 363 to 364, there is a duplicate import of PROMPTS from evals.intent_prompts which is already imported at line 10. Remove the import statement at line 363 to eliminate redundancy.

alex-aipolabs · 2025-06-11T08:18:39Z

backend/evals/evaluation_pipeline.py

@@ -73,7 +79,108 @@ def _load_dataset_from_wandb(self, artifact_name: str, dataset_filename: str) ->
        """
        artifact = wandb.use_artifact(f"{artifact_name}:latest")
        artifact_dir = artifact.download()
-        return pd.read_csv(os.path.join(artifact_dir, dataset_filename))
+
+        # Support both JSON and CSV formats for backward compatibility


We can drop support for csv and delete the old artifacts and only keep for json

alex-aipolabs · 2025-06-11T08:20:21Z

backend/evals/evaluation_pipeline.py

+        # For evaluation artifacts, the dataset file is prefixed with "dataset_"
+        if "_evaluation_" in artifact_name and any(char.isdigit() for char in artifact_name):
+            # This is an evaluation artifact, look for dataset_<filename>
+            dataset_path = os.path.join(artifact_dir, f"dataset_{dataset_filename}")


We can remove the dataset prefix so we won't need this logic

alex-aipolabs · 2025-06-11T08:21:13Z

backend/evals/evaluation_pipeline.py

+        else:
+            return pd.read_csv(dataset_path)
+
+    def _create_comprehensive_artifact(


A better name for this would be composite artifact

alex-aipolabs · 2025-06-11T08:22:24Z

backend/evals/evaluation_pipeline.py

+        from datetime import datetime
+
+        # Create comprehensive artifact with timestamp
+        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")


Timestamp is unnescassery, its already in the metadata of the artifact and using the timestamp we wouldn't get the advantage of versioning from wandb

alex-aipolabs · 2025-06-11T08:25:00Z

backend/evals/evaluation_pipeline.py

+
+        comprehensive_artifact = wandb.Artifact(
+            name=comprehensive_artifact_name,
+            type="evaluation_results",


For the type I think its better to have it as dataset as we are providing the dataset with the metrics etc. From evaluation results i would expect to only have the accuracy, mrr etc.

alex-aipolabs · 2025-06-11T08:35:56Z

backend/evals/synthetic_intent_generator.py

@@ -140,8 +140,13 @@ def _save_to_wandb(self, df: pd.DataFrame, dataset_artifact: str, dataset_filena
            },
        )

-        # Write dataframe to the temporary file
-        df.to_csv(dataset_filename, index=False)
+        # Save dataframe in the appropriate format based on file extension


No need to handle csv files anymore

alex-aipolabs · 2025-06-11T08:36:46Z

backend/evals/README.md

@@ -0,0 +1,181 @@
+# Function Search Evaluation System
+


If we have a README here we should move the stuff we already had in the backend readme

alex-aipolabs · 2025-06-11T08:40:35Z

backend/evals/README.md

+python evaluation_pipeline.py --mode generate-and-evaluate --prompt-type prompt_hard --dataset hard_intent_dataset
+```
+
+#### Dataset Filename Parameter (`--dataset-filename`)


I think we should get rid of this section the dataset-filename can be anything the user wants and not very important

alex-aipolabs · 2025-06-11T08:42:03Z

backend/evals/README.md

+- **Backward compatibility**: Existing datasets without custom filenames will continue to work
+- **Case sensitivity**: Filenames are case-sensitive
+
+### Programmatic Usage


I think just by looking at the intent_prompts file it should be clear what it does, do you think this is needed?

alex-aipolabs · 2025-06-11T08:42:58Z

backend/evals/README.md

+python example_usage.py
+```
+
+## Environment Variables


This should be at the setup similar to how we have it in the backend readme

… naming, update docs

recurseml · 2025-07-14T11:11:42Z

✨ No issues found! Your code is sparkling clean! ✨

^{Need help? Join our Discord for support!
https://discord.gg/qEjHQk64Z9}

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

backend/evals/evaluation_pipeline.py (1)
332-353: Address encapsulation violation and duplicate import.

This implementation has two issues that were flagged in previous reviews:

Encapsulation violation: The code directly calls private methods _fetch_app_function_data and _generate_intent, bypassing the generator's public interface.

Duplicate import: PROMPTS is imported twice - once at the top of the file (line 8) and again at line 345.

Apply this diff to fix the duplicate import:
-                    from evals.intent_prompts import PROMPTS
For the encapsulation issue, consider implementing the previously suggested solution of adding a generate_without_saving method to the SyntheticIntentGenerator class, then replace this inline logic with a call to that method.

🧹 Nitpick comments (1)

backend/evals/evaluation_pipeline.py (1)

416-425: Consider simplifying the dataset naming logic.

The current logic works but could be more readable. Consider consolidating the conditions:

-    # Automatically append prompt type to dataset name for better organization
-    # BUT don't append if it's already an evaluation artifact
-    if "_evaluation" in dataset:
-        # This is already an evaluation artifact, use as-is
-        dataset_with_prompt = dataset
-    elif dataset == DEFAULT_DATASET_ARTIFACT:
-        dataset_with_prompt = f"{dataset}_{prompt_type}"
-    else:
-        # If user provided custom dataset name, still append prompt type
-        dataset_with_prompt = f"{dataset}_{prompt_type}"
+    # Automatically append prompt type to dataset name for better organization
+    # Skip if it's already an evaluation artifact
+    if "_evaluation" in dataset:
+        dataset_with_prompt = dataset
+    else:
+        dataset_with_prompt = f"{dataset}_{prompt_type}"

This eliminates the redundant logic in the elif and else branches.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2157346 and 6866fb6.

📒 Files selected for processing (2)

backend/.pre-commit-config.yaml (1 hunks)
backend/evals/evaluation_pipeline.py (9 hunks)

✅ Files skipped from review due to trivial changes (1)

backend/.pre-commit-config.yaml

🧰 Additional context used

🧬 Code Graph Analysis (1)

backend/evals/evaluation_pipeline.py (2)

backend/evals/search_evaluator.py (1)

SearchEvaluator (20-264)

backend/evals/synthetic_intent_generator.py (3)

SyntheticIntentGenerator (17-192)

_fetch_app_function_data (52-72)

_generate_intent (80-97)

🔇 Additional comments (9)

backend/evals/evaluation_pipeline.py (9)

7-8: LGTM: Import additions are appropriate.

The addition of wandb and PROMPTS imports aligns with the new functionality for composite artifacts and prompt type support.

15-16: LGTM: Suppressing verbose logging improves user experience.

Suppressing httpx verbose logging is a good practice to reduce noise in the output, consistent with similar changes in the search evaluator.

19-19: LGTM: JSON format change is consistent with overall architecture.

Changing the default dataset filename from CSV to JSON aligns with the broader move to JSON format throughout the codebase.

40-40: LGTM: Default prompt type change aligns with new prompt structure.

Changing the default from "task" to "prompt_easy" is consistent with the new prompt functions introduced in the codebase.

80-82: LGTM: JSON dataset loading is consistent with format change.

The change from pd.read_csv to pd.read_json with orient="records" is consistent with the switch to JSON format for datasets.

84-166: LGTM: Well-implemented composite artifact creation with proper cleanup.

The _create_composite_artifact method is well-structured with:

Proper temporary file handling

Explicit file closing to ensure data is written

Comprehensive cleanup in the finally block

Clear logging of results

The method effectively bundles dataset, metrics, and incorrect results into a single artifact.

225-262: LGTM: Enhanced evaluation with detailed metrics and composite artifacts.

The evaluation method improvements include:

Extracting incorrect results for separate analysis

Creating detailed metrics with evaluation configuration

Using the new composite artifact method

Logging only summary metrics to wandb to avoid clutter

This provides better tracking and analysis capabilities while maintaining clean wandb logs.

375-380: LGTM: Well-designed CLI option for prompt type selection.

The new --prompt-type option provides good user experience with:

Clear choices from the PROMPTS dictionary

Sensible default value

Helpful description and default display

This allows users to easily select different prompt complexity levels.

382-386: LGTM: Simplified option naming improves CLI usability.

Renaming --dataset-artifact to --dataset makes the CLI more concise and user-friendly while maintaining the same functionality.

irmadong added 7 commits June 5, 2025 11:40

feat: Add comprehensive evaluation system for function search

8a4600f

finalize eval pipeline

c63210d

Merge branch 'main' of https://github.com/aipotheosis-labs/aci

eb1509d

ready for pull request

delete some files

8b29979

delete extra folder

0e1eb6b

catch up with the original branch

fc64cdc

Update README with dataset-filename usage documentation

d2f3297

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

alex-aipolabs suggested changes Jun 11, 2025

View reviewed changes

irmadong added 2 commits July 10, 2025 23:49

Merge upstream/main into main

26a5305

Refactor evaluation pipeline: unify metrics, remove CSV, fix artifact…

2157346

… naming, update docs

irmadong temporarily deployed to CICD_FOR_FORKED_REPO July 11, 2025 03:57 — with GitHub Actions Inactive

irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 11, 2025 03:57 — with GitHub Actions Failure

irmadong requested a review from alex-aipolabs July 13, 2025 03:01

alex-aipolabs approved these changes Jul 15, 2025

View reviewed changes

reformat to meet lint standard

6866fb6

irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:11 — with GitHub Actions Failure

coderabbitai bot reviewed Jul 15, 2025

View reviewed changes

Update .pre-commit-config.yaml

4428ea0

irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:36 — with GitHub Actions Failure

Delete backend/.pre-commit-config.yaml

b7df0e4

irmadong had a problem deploying to CICD_FOR_FORKED_REPO July 15, 2025 15:37 — with GitHub Actions Failure

irmadong requested a review from alex-aipolabs July 15, 2025 15:38

dev-aipolabs requested review from dev-aipolabs and removed request for alex-aipolabs July 23, 2025 10:16

dev-aipolabs requested a review from zizixcm July 25, 2025 21:49


	from evals.intent_prompts import PROMPTS
	from tqdm import tqdm

Update Evaluate pipeline with easy, med, hard prompts and restructure artifact #474

Are you sure you want to change the base?

Update Evaluate pipeline with easy, med, hard prompts and restructure artifact #474

Conversation

irmadong commented Jun 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🏷️ Ticket

📝 Description

🎥 Demo (if applicable)

📸 Screenshots (if applicable)

✅ Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

recurseml bot commented Jul 14, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

irmadong commented Jun 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 10, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)