-
Notifications
You must be signed in to change notification settings - Fork 2
Add random subsampling as alternative to filtering #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from 21 commits
63416b8
b874f83
5c3b49d
1f6c285
c793536
97d262b
d1ad332
b76a746
b002497
f580449
87ae761
0ca1868
464adb4
bf3424c
a19de46
94ffe9d
42e169e
28cdbfa
cd16475
aa26396
2176b83
268f7cd
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
#!/usr/bin/env bash | ||
|
||
set -euo pipefail | ||
|
||
# Usage: randomsubset.sh <input.vcf> <output.vcf> <fraction> | ||
if [[ $# -ne 3 ]]; then | ||
echo "Usage: $0 <input.vcf> <output.vcf> <fraction (e.g. 0.1 for 10%)>" | ||
exit 1 | ||
fi | ||
|
||
input_vcf="$1" | ||
output_vcf="$2" | ||
fraction="$3" | ||
|
||
# Create temp files and directories | ||
tmpdir=$(mktemp -d) | ||
tmp_vcf="$tmpdir/tmp.vcf" | ||
tmp_sorted_vcf="$tmpdir/tmp.sorted.vcf" | ||
|
||
# Calculate number of records to sample | ||
subset_count=$(bcftools stats "$input_vcf" | awk -v frac="$fraction" -F'\t' '$3=="number of records:" {print int($4*frac)}') | ||
|
||
echo "Sampling $subset_count records from $input_vcf" | ||
|
||
# Write header | ||
bcftools view --header-only "$input_vcf" > "$tmp_vcf" | ||
|
||
# Randomly sample records | ||
bcftools view --no-header "$input_vcf" | \ | ||
awk '{printf("%f\t%s\n",rand(),$0);}' | \ | ||
sort -t $'\t' -T "$tmpdir" -k1,1g | \ | ||
head -n "$subset_count" | \ | ||
cut -f 2- >> "$tmp_vcf" || true | ||
|
||
# Sort and write to output | ||
bcftools sort -T "$tmpdir" -o "$output_vcf" "$tmp_vcf" | ||
|
||
# Clean up | ||
rm -rf "$tmpdir" |
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
|
@@ -13,6 +13,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d | |||||||||
- [Tabix](#tabix) - Indexes (g.)vcf files | ||||||||||
- [GenotypeGVCFs](#genotypegvcfs) - Converts g.vcf files to vcf with GATK | ||||||||||
- [Filter VCFs](#filter-vcfs) - Filters the VCF based on a string given to the `filter` param with bcftools/view | ||||||||||
- [Subset VCFs](#subsetvcfs) - Keeps only a fraction of random variants based on the `subset` param | ||||||||||
- [Concatenate VCFs](#concatenate-vcfs) - Concatenates all vcfs that have the same id and the same label with bcftools/concat | ||||||||||
- [Rename Samples](#rename-samples) - Changes the sample name in the vcf file to the label with bcftools/reheader | ||||||||||
- [Merge VCFs](#merge-vcfs) - Merges all vcfs from the same sample with bcftools/merge | ||||||||||
|
@@ -59,6 +60,19 @@ The GATK GenotypeGVCFs module translates genotype (g) vcf files into classic vcf | |||||||||
|
||||||||||
VEP annotated VCF files can be filtered for certain flags present after VEP annotation. Notably, this enables filtering for variants with certain impact levels or consequences. Filtering will produces VCF files holding just the variants matching the specific patterns. | ||||||||||
|
||||||||||
### Subset VCFs | ||||||||||
|
||||||||||
<details markdown="1"> | ||||||||||
<summary>Output files</summary> | ||||||||||
|
||||||||||
- `bcftools/subset/{meta.label}/` | ||||||||||
- `{filename}.subset.vcf.gz`: vcf file with fraction of random variants. | ||||||||||
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file. | ||||||||||
|
||||||||||
</details> | ||||||||||
|
||||||||||
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants. | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would however suggest to have it like this: and down below: Filtering optionsString-based filtering... Random subset filtering |
||||||||||
|
||||||||||
### Concatenate VCFs | ||||||||||
|
||||||||||
<details markdown="1"> | ||||||||||
|
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -96,6 +96,12 @@ Notably, this enables filtering for variants with certain impact levels or conse | |||||
> [!NOTE] | ||||||
> The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version) | ||||||
|
||||||
### Subset VCFs | ||||||
|
||||||
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use to compare the more robust the fraction becomes. (We compared f.ex. around 90 files and landed at an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`). | ||||||
famosab marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||
|
||||||
### Updating the pipeline | ||||||
|
||||||
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json | ||
channels: | ||
- conda-forge | ||
- bioconda | ||
dependencies: | ||
- "bioconda::bcftools=1.21" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
process RANDOMSUBSET { | ||
tag "$meta.id" | ||
label 'process_medium' | ||
|
||
conda "${moduleDir}/environment.yml" | ||
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? | ||
'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/5a/5acacb55c52bec97c61fd34ffa8721fce82ce823005793592e2a80bf71632cd0/data': | ||
'community.wave.seqera.io/library/bcftools:1.21--4335bec1d7b44d11' }" | ||
|
||
input: | ||
tuple val(meta), path(vcf), path(index) | ||
val(fraction) | ||
|
||
output: | ||
tuple val(meta), path("*.vcf.gz"), emit: vcf | ||
tuple val(meta), path("*.tbi") , emit: tbi | ||
path "versions.yml" , emit: versions | ||
|
||
when: | ||
task.ext.when == null || task.ext.when | ||
|
||
script: | ||
def args = task.ext.args ?: '' | ||
def prefix = task.ext.prefix ?: "${meta.id}" | ||
""" | ||
randomsubset.sh ${vcf} ${prefix}.vcf ${fraction} | ||
|
||
bgzip ${prefix}.vcf | ||
tabix -p vcf ${prefix}.vcf.gz | ||
|
||
cat <<-END_VERSIONS > versions.yml | ||
"${task.process}": | ||
bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') | ||
END_VERSIONS | ||
""" | ||
|
||
stub: | ||
def args = task.ext.args ?: '' | ||
def prefix = task.ext.prefix ?: "${meta.id}" | ||
""" | ||
echo | gzip > ${prefix}.vcf.gz | ||
touch ${prefix}.tbi | ||
|
||
cat <<-END_VERSIONS > versions.yml | ||
"${task.process}": | ||
bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//') | ||
END_VERSIONS | ||
""" | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
nextflow_process { | ||
|
||
name "Test Process RANDOMSUBSET" | ||
script "../main.nf" | ||
process "RANDOMSUBSET" | ||
|
||
tag "modules" | ||
tag "modules_" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is this tag for? |
||
tag "randomsubset" | ||
|
||
test("sarscov2 - [vcf, tbi]") { | ||
|
||
when { | ||
process { | ||
""" | ||
// The input VCF has 9 records so we expect 4 records in the output VCF | ||
input[0] = [ | ||
[ id:'out', single_end:false ], // meta map | ||
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true), | ||
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true) | ||
] | ||
input[1] = 0.5 | ||
""" | ||
} | ||
} | ||
|
||
then { | ||
assertAll( | ||
{ assert process.success }, | ||
{ assert snapshot( | ||
path(process.out.vcf.get(0).get(1)).vcf.summary, | ||
file(process.out.tbi.get(0).get(1)).name, | ||
process.out.versions | ||
).match() }, | ||
) | ||
} | ||
|
||
} | ||
|
||
test("sarscov2 - [vcf, tbi] - stub") { | ||
|
||
options "-stub" | ||
|
||
when { | ||
process { | ||
""" | ||
input[0] = [ | ||
[ id:'out', single_end:false ], // meta map | ||
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true), | ||
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true) | ||
] | ||
input[1] = 0.00175 | ||
""" | ||
} | ||
} | ||
|
||
then { | ||
assertAll( | ||
{ assert process.success }, | ||
{ assert snapshot(process.out).match() }, | ||
) | ||
} | ||
|
||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
{ | ||
"sarscov2 - [vcf, tbi] - stub": { | ||
"content": [ | ||
{ | ||
"0": [ | ||
[ | ||
{ | ||
"id": "out", | ||
"single_end": false | ||
}, | ||
"out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940" | ||
] | ||
], | ||
"1": [ | ||
[ | ||
{ | ||
"id": "out", | ||
"single_end": false | ||
}, | ||
"out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e" | ||
] | ||
], | ||
"2": [ | ||
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e" | ||
], | ||
"tbi": [ | ||
[ | ||
{ | ||
"id": "out", | ||
"single_end": false | ||
}, | ||
"out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e" | ||
] | ||
], | ||
"vcf": [ | ||
[ | ||
{ | ||
"id": "out", | ||
"single_end": false | ||
}, | ||
"out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940" | ||
] | ||
], | ||
"versions": [ | ||
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e" | ||
] | ||
} | ||
], | ||
"meta": { | ||
"nf-test": "0.9.2", | ||
"nextflow": "25.04.3" | ||
}, | ||
"timestamp": "2025-06-18T10:53:32.966380045" | ||
}, | ||
"sarscov2 - [vcf, tbi]": { | ||
"content": [ | ||
"VcfFile [chromosomes=[MT192765.1], sampleCount=1, variantCount=4, phased=false, phasedAutodetect=false]", | ||
"out.subset.vcf.gz.tbi", | ||
[ | ||
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e" | ||
] | ||
], | ||
"meta": { | ||
"nf-test": "0.9.2", | ||
"nextflow": "25.04.3" | ||
}, | ||
"timestamp": "2025-06-18T10:53:26.441286474" | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.