Skip to content

radiant-network/parse-clinvar-xml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ClinVar RCV Parser

This project provides a Python script to parse ClinVar RCV XML files and output the data as NDJSON files, chunked by a configurable number of records per file. And example of expepected output is provided in the examples/output_sample directory.

Requirements

  • Python 3.7+
  • lxml

Install dependencies:

pip install -r requirements.txt

Usage

Run the parser from the command line:

python parse_clinvar.py -i INPUT_FILE.xml.gz -o OUTPUT_DIR [-m MAX_ROWS_PER_FILE]
  • -i, --input: Path to the input ClinVar RCV XML file. Input can be compressed (.gz) or not. (required)
  • -o, --output: Directory to write NDJSON output files (required)
  • -m, --max_rows_per_file: Maximum records per NDJSON output file (optional, default: 500000)

Example:

python parse_clinvar.py -i data/clinvar.xml.gz -o output/ -m 1000

About

Scripts for parsing Clinvar XML files

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages