Skip to content

Conversation

parssky
Copy link
Contributor

@parssky parssky commented May 14, 2025

This PR introduces recursive Markdown parsing to AutoRAG without breaking existing behaviour. When the parser receives a glob pattern such as **/*.md, it discovers *.md files at any depth, processes them once, and guarantees that original absolute paths are preserved in all downstream artifacts.

Features

autorag/parse.py:
• Added recursive arg to Parser.start_parsing (default False).
• Converts project_dir to Path to avoid .rglob() error.
• When recursive=True:  Collect all matching Markdown files via glob(..., recursive=True). shutil.copy2 each file into a temp flat dir with path‑encoded filenames. Run run_parser once. Rewrite temp paths in Parquet/JSONL outputs back to originals.• Cleans up temp dir on exit.

Tests:
New tests/autorag/test_parser_recursive.py adds coverage for deep folders; all tests now pass.

Lints
Fixed F823 in api/tests/test_app.py; repo‑wide trailing‑whitespace / EOF cleanup by pre‑commit.

Usage

parser = Parser(data_path_glob="./raw_data/**/*.md", project_dir="./parse")
parser.start_parsing("./parse.yaml", recursive=True)

Temp‑dir strategy
AutoRAG’s run_parser still expects a flat *.md glob. Copying files into a temp folder preserves modification time (via shutil.copy2) and avoids macOS multiprocessing issues with symlinks. After parsing, a two‑line _rewrite_output_paths() post‑process restores the real paths.

Backward Compatibility
Default remains non‑recursive; existing integrations are unchanged.
No LangChain modules were edited.

@parssky
Copy link
Contributor Author

parssky commented May 17, 2025

Hi, several tests are failing in CI, but as far as I can tell, they’re not related to my changes. My own tests are passing, and the errors seem to stem from unrelated modules (e.g., llama_cloud_services, recency_filter, time_reranker, etc.). Could you please confirm if these are unrelated to this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant