Skip to content
Draft
Show file tree
Hide file tree
Changes from 21 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .nf-core.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,8 @@ repository_type: pipeline

template:
author: "Famke Bäuerle, Dorothy Ellis"
description: Nextflow pipeline to convert (g)vcfs to matrices suitable for statistical
analysis
description: Nextflow pipeline to convert (g)vcfs to matrices suitable for
statistical analysis
force: false
is_nfcore: false
name: vcftocounts
Expand All @@ -46,4 +46,4 @@ template:
- codespaces
- fastqc
- adaptivecard
version: 2.0.2dev
version: 2.1.0dev
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,12 @@
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## v2.0.2dev
## v2.1.0dev

### `Added`

- [#34](https://github.com/qbic-pipelines/vcftocounts/pull/34) - Swap CI tests to nf-test and fix small channel issue
- [#39](https://github.com/qbic-pipelines/vcftocounts/pull/39) - Add random subsampling as alternative to filtering

### `Fixed`

Expand Down
12 changes: 12 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,18 @@

## Pipeline tools

- [BCFTools](https://pubmed.ncbi.nlm.nih.gov/21903627/)

> Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.

- [GATK](https://pubmed.ncbi.nlm.nih.gov/20644199/)

> McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508.

- [Tabix](https://academic.oup.com/bioinformatics/article/27/5/718/262743)

> Li H, Tabix: fast retrieval of sequence features from generic TAB-delimited files, Bioinformatics, Volume 27, Issue 5, 1 March 2011, Pages 718–719, doi: 10.1093/bioinformatics/btq671. PubMed PMID: 21208982. PubMed Central PMCID: PMC3042176.

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,9 @@

1. Indexes (g.)vcf files ([`tabix`](http://www.htslib.org/doc/tabix.html))
2. Converts g.vcf files to vcf with `genotypegvcf` ([`GATK`](https://gatk.broadinstitute.org/hc/en-us))
3. Filters the VCF based on a string given to the `filter` param with `bcftools/view` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default.
3. Optional filtering of VCF files
3.1 Filtering based on a string given to the `filter` param with `bcftools/view` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default.
3.2 Keeping only a fraction of random variants based on the `subset` param with a custom bash script using `bcftools/stats`, `view` and `sort` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default, should be used as alternative to filtering.
4. Concatenates all vcfs that have the same id and the same label with `bcftools/concat` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
5. Changes the sample name in the vcf file to the filename with `bcftools/reheader` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - This can be turned off by adding `--rename false` to the `nextflow run` command.
6. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
Expand Down Expand Up @@ -73,8 +75,6 @@ If you would like to contribute to this pipeline, please see the [contributing g

If you use qbic-pipelines/vcftocounts for your analysis, please cite it using the following doi: [10.5281/zenodo.14616650](https://doi.org/10.5281/zenodo.14616650)

<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->

An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.

This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/main/LICENSE).
Expand Down
39 changes: 39 additions & 0 deletions bin/randomsubset.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env bash

set -euo pipefail

# Usage: randomsubset.sh <input.vcf> <output.vcf> <fraction>
if [[ $# -ne 3 ]]; then
echo "Usage: $0 <input.vcf> <output.vcf> <fraction (e.g. 0.1 for 10%)>"
exit 1
fi

input_vcf="$1"
output_vcf="$2"
fraction="$3"

# Create temp files and directories
tmpdir=$(mktemp -d)
tmp_vcf="$tmpdir/tmp.vcf"
tmp_sorted_vcf="$tmpdir/tmp.sorted.vcf"

# Calculate number of records to sample
subset_count=$(bcftools stats "$input_vcf" | awk -v frac="$fraction" -F'\t' '$3=="number of records:" {print int($4*frac)}')

echo "Sampling $subset_count records from $input_vcf"

# Write header
bcftools view --header-only "$input_vcf" > "$tmp_vcf"

# Randomly sample records
bcftools view --no-header "$input_vcf" | \
awk '{printf("%f\t%s\n",rand(),$0);}' | \
sort -t $'\t' -T "$tmpdir" -k1,1g | \
head -n "$subset_count" | \
cut -f 2- >> "$tmp_vcf" || true

# Sort and write to output
bcftools sort -T "$tmpdir" -o "$output_vcf" "$tmp_vcf"

# Clean up
rm -rf "$tmpdir"
8 changes: 8 additions & 0 deletions conf/modules.config
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,14 @@ process {
]
}

withName: 'RANDOMSUBSET' {
ext.prefix = { "${meta.id}.subset" }
publishDir = [
mode: params.publish_dir_mode,
path: { "${params.outdir}/bcftools/subset/${meta.label}/" },
]
}

withName: 'BCFTOOLS_MERGE' {
ext.args = { "--force-samples --output-type z --write-index=tbi" }
ext.prefix = { "${meta.id}.merge" }
Expand Down
Binary file modified docs/images/vcftocounts-subway.excalidraw.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Tabix](#tabix) - Indexes (g.)vcf files
- [GenotypeGVCFs](#genotypegvcfs) - Converts g.vcf files to vcf with GATK
- [Filter VCFs](#filter-vcfs) - Filters the VCF based on a string given to the `filter` param with bcftools/view
- [Subset VCFs](#subsetvcfs) - Keeps only a fraction of random variants based on the `subset` param
- [Concatenate VCFs](#concatenate-vcfs) - Concatenates all vcfs that have the same id and the same label with bcftools/concat
- [Rename Samples](#rename-samples) - Changes the sample name in the vcf file to the label with bcftools/reheader
- [Merge VCFs](#merge-vcfs) - Merges all vcfs from the same sample with bcftools/merge
Expand Down Expand Up @@ -59,6 +60,19 @@ The GATK GenotypeGVCFs module translates genotype (g) vcf files into classic vcf

VEP annotated VCF files can be filtered for certain flags present after VEP annotation. Notably, this enables filtering for variants with certain impact levels or consequences. Filtering will produces VCF files holding just the variants matching the specific patterns.

### Subset VCFs

<details markdown="1">
<summary>Output files</summary>

- `bcftools/subset/{meta.label}/`
- `{filename}.subset.vcf.gz`: vcf file with fraction of random variants.
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
- `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file.


</details>

VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
### Subsample VCFs
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would however suggest to have it like this:

and down below:

Filtering options

String-based filtering

...

Random subset filtering


### Concatenate VCFs

<details markdown="1">
Expand Down
6 changes: 6 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,12 @@ Notably, this enables filtering for variants with certain impact levels or conse
> [!NOTE]
> The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version)

### Subset VCFs

VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants.


You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use to compare the more robust the fraction becomes. (We compared f.ex. around 90 files and landed at an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).

### Updating the pipeline

When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
Expand Down
7 changes: 7 additions & 0 deletions modules/local/randomsubset/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
channels:
- conda-forge
- bioconda
dependencies:
- "bioconda::bcftools=1.21"
49 changes: 49 additions & 0 deletions modules/local/randomsubset/main.nf
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
process RANDOMSUBSET {
tag "$meta.id"
label 'process_medium'

conda "${moduleDir}/environment.yml"
container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/5a/5acacb55c52bec97c61fd34ffa8721fce82ce823005793592e2a80bf71632cd0/data':
'community.wave.seqera.io/library/bcftools:1.21--4335bec1d7b44d11' }"

input:
tuple val(meta), path(vcf), path(index)
val(fraction)

output:
tuple val(meta), path("*.vcf.gz"), emit: vcf
tuple val(meta), path("*.tbi") , emit: tbi
path "versions.yml" , emit: versions

when:
task.ext.when == null || task.ext.when

script:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
randomsubset.sh ${vcf} ${prefix}.vcf ${fraction}

bgzip ${prefix}.vcf
tabix -p vcf ${prefix}.vcf.gz

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//')
END_VERSIONS
"""

stub:
def args = task.ext.args ?: ''
def prefix = task.ext.prefix ?: "${meta.id}"
"""
echo | gzip > ${prefix}.vcf.gz
touch ${prefix}.tbi

cat <<-END_VERSIONS > versions.yml
"${task.process}":
bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//')
END_VERSIONS
"""
}
65 changes: 65 additions & 0 deletions modules/local/randomsubset/tests/main.nf.test
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
nextflow_process {

name "Test Process RANDOMSUBSET"
script "../main.nf"
process "RANDOMSUBSET"

tag "modules"
tag "modules_"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this tag for?

tag "randomsubset"

test("sarscov2 - [vcf, tbi]") {

when {
process {
"""
// The input VCF has 9 records so we expect 4 records in the output VCF
input[0] = [
[ id:'out', single_end:false ], // meta map
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true),
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true)
]
input[1] = 0.5
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(
path(process.out.vcf.get(0).get(1)).vcf.summary,
file(process.out.tbi.get(0).get(1)).name,
process.out.versions
).match() },
)
}

}

test("sarscov2 - [vcf, tbi] - stub") {

options "-stub"

when {
process {
"""
input[0] = [
[ id:'out', single_end:false ], // meta map
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true),
file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true)
]
input[1] = 0.00175
"""
}
}

then {
assertAll(
{ assert process.success },
{ assert snapshot(process.out).match() },
)
}

}
}
69 changes: 69 additions & 0 deletions modules/local/randomsubset/tests/main.nf.test.snap
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
{
"sarscov2 - [vcf, tbi] - stub": {
"content": [
{
"0": [
[
{
"id": "out",
"single_end": false
},
"out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940"
]
],
"1": [
[
{
"id": "out",
"single_end": false
},
"out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e"
]
],
"2": [
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
],
"tbi": [
[
{
"id": "out",
"single_end": false
},
"out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e"
]
],
"vcf": [
[
{
"id": "out",
"single_end": false
},
"out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940"
]
],
"versions": [
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
]
}
],
"meta": {
"nf-test": "0.9.2",
"nextflow": "25.04.3"
},
"timestamp": "2025-06-18T10:53:32.966380045"
},
"sarscov2 - [vcf, tbi]": {
"content": [
"VcfFile [chromosomes=[MT192765.1], sampleCount=1, variantCount=4, phased=false, phasedAutodetect=false]",
"out.subset.vcf.gz.tbi",
[
"versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
]
],
"meta": {
"nf-test": "0.9.2",
"nextflow": "25.04.3"
},
"timestamp": "2025-06-18T10:53:26.441286474"
}
}
3 changes: 2 additions & 1 deletion nextflow.config
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ params {
input = null
rename = true
filter = null
subset = null
removeIDs = true

// References
Expand Down Expand Up @@ -249,7 +250,7 @@ manifest {
mainScript = 'main.nf'
defaultBranch = 'master'
nextflowVersion = '!>=24.04.2'
version = '2.0.2dev'
version = '2.1.0dev'
doi = ''
}

Expand Down
4 changes: 4 additions & 0 deletions nextflow_schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@
"type": "string",
"description": "Add a filtering criterium suitable for bcftools/view. For example 'INFO/CSQ ~ \"HIGH\"'."
},
"subset": {
"type": "number",
"description": "Get a random subset of variants. Set this variable to the fraction you want to keep (f.ex. 0.5 if you want to keep half of the variants)."
},
"removeIDs": {
"type": "boolean",
"default": true,
Expand Down
Loading