Recursive Markdown Parsing & Path Preservation #1114
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces recursive Markdown parsing to AutoRAG without breaking existing behaviour. When the parser receives a glob pattern such as **/*.md, it discovers *.md files at any depth, processes them once, and guarantees that original absolute paths are preserved in all downstream artifacts.
Features
autorag/parse.py:
• Added recursive arg to Parser.start_parsing (default False).
• Converts project_dir to Path to avoid .rglob() error.
• When recursive=True: Collect all matching Markdown files via glob(..., recursive=True). shutil.copy2 each file into a temp flat dir with path‑encoded filenames. Run run_parser once. Rewrite temp paths in Parquet/JSONL outputs back to originals.• Cleans up temp dir on exit.
Tests:
New tests/autorag/test_parser_recursive.py adds coverage for deep folders; all tests now pass.
Lints
Fixed F823 in api/tests/test_app.py; repo‑wide trailing‑whitespace / EOF cleanup by pre‑commit.
Usage
Temp‑dir strategy
AutoRAG’s run_parser still expects a flat *.md glob. Copying files into a temp folder preserves modification time (via shutil.copy2) and avoids macOS multiprocessing issues with symlinks. After parsing, a two‑line _rewrite_output_paths() post‑process restores the real paths.
Backward Compatibility
Default remains non‑recursive; existing integrations are unchanged.
No LangChain modules were edited.