This project provides a Python script to parse ClinVar RCV XML files and output the data as NDJSON files, chunked by a configurable number of records per file.
And example of expepected output is provided in the examples/output_sample
directory.
- Python 3.7+
- lxml
Install dependencies:
pip install -r requirements.txt
Run the parser from the command line:
python parse_clinvar.py -i INPUT_FILE.xml.gz -o OUTPUT_DIR [-m MAX_ROWS_PER_FILE]
-i
,--input
: Path to the input ClinVar RCV XML file. Input can be compressed (.gz) or not. (required)-o
,--output
: Directory to write NDJSON output files (required)-m
,--max_rows_per_file
: Maximum records per NDJSON output file (optional, default: 500000)
Example:
python parse_clinvar.py -i data/clinvar.xml.gz -o output/ -m 1000