qbic-pipelines · famosab · Jun 13, 2025 · Jun 16, 2025 · Jun 16, 2025 · Jun 16, 2025
diff --git a/.nf-core.yml b/.nf-core.yml
@@ -34,8 +34,8 @@ repository_type: pipeline
 
 template:
   author: "Famke Bäuerle, Dorothy Ellis"
-  description: Nextflow pipeline to convert (g)vcfs to matrices suitable for statistical
-    analysis
+  description: Nextflow pipeline to convert (g)vcfs to matrices suitable for
+    statistical analysis
   force: false
   is_nfcore: false
   name: vcftocounts
@@ -46,4 +46,4 @@ template:
     - codespaces
     - fastqc
     - adaptivecard
-  version: 2.0.2dev
+  version: 2.1.0dev
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,11 +3,12 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## v2.0.2dev
+## v2.1.0dev
 
 ### `Added`
 
 - [#34](https://github.com/qbic-pipelines/vcftocounts/pull/34) - Swap CI tests to nf-test and fix small channel issue
+- [#39](https://github.com/qbic-pipelines/vcftocounts/pull/39) - Add random subsampling as alternative to filtering
 
 ### `Fixed`
 

diff --git a/CITATIONS.md b/CITATIONS.md
@@ -10,6 +10,18 @@
 
 ## Pipeline tools
 
+- [BCFTools](https://pubmed.ncbi.nlm.nih.gov/21903627/)
+
+  > Li H: A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics. 2011 Nov 1;27(21):2987-93. doi: 10.1093/bioinformatics/btr509. PubMed PMID: 21903627; PubMed Central PMCID: PMC3198575.
+
+- [GATK](https://pubmed.ncbi.nlm.nih.gov/20644199/)
+
+  > McKenna A, Hanna M, Banks E, et al.: The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508.
+
+- [Tabix](https://academic.oup.com/bioinformatics/article/27/5/718/262743)
+
+  > Li H, Tabix: fast retrieval of sequence features from generic TAB-delimited files, Bioinformatics, Volume 27, Issue 5, 1 March 2011, Pages 718–719, doi: 10.1093/bioinformatics/btq671. PubMed PMID: 21208982. PubMed Central PMCID: PMC3042176.
+
 - [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)
 
 > Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.

diff --git a/README.md b/README.md
@@ -17,7 +17,9 @@
 
 1. Indexes (g.)vcf files ([`tabix`](http://www.htslib.org/doc/tabix.html))
 2. Converts g.vcf files to vcf with `genotypegvcf` ([`GATK`](https://gatk.broadinstitute.org/hc/en-us))
-3. Filters the VCF based on a string given to the `filter` param with `bcftools/view` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default.
+3. Optional filtering of VCF files
+   3.1 Filtering based on a string given to the `filter` param with `bcftools/view` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default.
+   3.2 Keeping only a fraction of random variants based on the `subset` param with a custom bash script using `bcftools/stats`, `view` and `sort` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - Turned off by default, should be used as alternative to filtering.
 4. Concatenates all vcfs that have the same id and the same label with `bcftools/concat` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
 5. Changes the sample name in the vcf file to the filename with `bcftools/reheader` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html)) - This can be turned off by adding `--rename false` to the `nextflow run` command.
 6. Merges all vcfs from the same sample with `bcftools/merge` ([`bcftools`](https://samtools.github.io/bcftools/bcftools.html))
@@ -73,8 +75,6 @@ If you would like to contribute to this pipeline, please see the [contributing g
 
 If you use qbic-pipelines/vcftocounts for your analysis, please cite it using the following doi: [10.5281/zenodo.14616650](https://doi.org/10.5281/zenodo.14616650)
 
-<!-- TODO nf-core: Add bibliography of tools and data used in your pipeline -->
-
 An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file.
 
 This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/main/LICENSE).

diff --git a/bin/randomsubset.sh b/bin/randomsubset.sh
@@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+
+set -euo pipefail
+
+# Usage: randomsubset.sh <input.vcf> <output.vcf> <fraction>
+if [[ $# -ne 3 ]]; then
+    echo "Usage: $0 <input.vcf> <output.vcf> <fraction (e.g. 0.1 for 10%)>"
+    exit 1
+fi
+
+input_vcf="$1"
+output_vcf="$2"
+fraction="$3"
+
+# Create temp files and directories
+tmpdir=$(mktemp -d)
+tmp_vcf="$tmpdir/tmp.vcf"
+tmp_sorted_vcf="$tmpdir/tmp.sorted.vcf"
+
+# Calculate number of records to sample
+subset_count=$(bcftools stats "$input_vcf" | awk -v frac="$fraction" -F'\t' '$3=="number of records:" {print int($4*frac)}')
+
+echo "Sampling $subset_count records from $input_vcf"
+
+# Write header
+bcftools view --header-only "$input_vcf" > "$tmp_vcf"
+
+# Randomly sample records
+bcftools view --no-header "$input_vcf" | \
+    awk '{printf("%f\t%s\n",rand(),$0);}' | \
+    sort -t $'\t' -T "$tmpdir" -k1,1g | \
+    head -n "$subset_count" | \
+    cut -f 2- >> "$tmp_vcf" || true
+
+# Sort and write to output
+bcftools sort -T "$tmpdir" -o "$output_vcf" "$tmp_vcf"
+
+# Clean up
+rm -rf "$tmpdir"
diff --git a/conf/modules.config b/conf/modules.config
@@ -67,6 +67,14 @@ process {
             ]
     }
 
+    withName: 'RANDOMSUBSET' {
+        ext.prefix = { "${meta.id}.subset" }
+        publishDir = [
+                mode: params.publish_dir_mode,
+                path: { "${params.outdir}/bcftools/subset/${meta.label}/" },
+            ]
+    }
+
     withName: 'BCFTOOLS_MERGE' {
         ext.args   = { "--force-samples --output-type z --write-index=tbi" }
         ext.prefix = { "${meta.id}.merge" }

diff --git a/docs/images/vcftocounts-subway.excalidraw.png b/docs/images/vcftocounts-subway.excalidraw.png
diff --git a/docs/output.md b/docs/output.md
@@ -13,6 +13,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
 - [Tabix](#tabix) - Indexes (g.)vcf files
 - [GenotypeGVCFs](#genotypegvcfs) - Converts g.vcf files to vcf with GATK
 - [Filter VCFs](#filter-vcfs) - Filters the VCF based on a string given to the `filter` param with bcftools/view
+- [Subset VCFs](#subsetvcfs) - Keeps only a fraction of random variants based on the `subset` param
 - [Concatenate VCFs](#concatenate-vcfs) - Concatenates all vcfs that have the same id and the same label with bcftools/concat
 - [Rename Samples](#rename-samples) - Changes the sample name in the vcf file to the label with bcftools/reheader
 - [Merge VCFs](#merge-vcfs) - Merges all vcfs from the same sample with bcftools/merge
@@ -59,6 +60,19 @@ The GATK GenotypeGVCFs module translates genotype (g) vcf files into classic vcf
 
 VEP annotated VCF files can be filtered for certain flags present after VEP annotation. Notably, this enables filtering for variants with certain impact levels or consequences. Filtering will produces VCF files holding just the variants matching the specific patterns.
 
+### Subset VCFs
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `bcftools/subset/{meta.label}/`
+  - `{filename}.subset.vcf.gz`: vcf file with fraction of random variants.
+  - `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
-  - `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
+  - `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file.
-  - `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
+  - `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file.
+
+</details>
+
+VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
-VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+### Subsample VCFs
+
+VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
-VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+### Subsample VCFs
+
+VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+
 ### Concatenate VCFs
 
 <details markdown="1">

diff --git a/docs/usage.md b/docs/usage.md
@@ -96,6 +96,12 @@ Notably, this enables filtering for variants with certain impact levels or conse
 > [!NOTE]
 > The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version)
 
+### Subset VCFs
+
+VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
-VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants.
-VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+
+You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use to compare the more robust the fraction becomes. (We compared f.ex. around 90 files and landed at an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).
+
 ### Updating the pipeline
 
 When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:

diff --git a/modules/local/randomsubset/environment.yml b/modules/local/randomsubset/environment.yml
@@ -0,0 +1,7 @@
+---
+# yaml-language-server: $schema=https://raw.githubusercontent.com/nf-core/modules/master/modules/environment-schema.json
+channels:
+  - conda-forge
+  - bioconda
+dependencies:
+  - "bioconda::bcftools=1.21"
diff --git a/modules/local/randomsubset/main.nf b/modules/local/randomsubset/main.nf
@@ -0,0 +1,49 @@
+process RANDOMSUBSET {
+    tag "$meta.id"
+    label 'process_medium'
+
+    conda "${moduleDir}/environment.yml"
+    container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ?
+        'https://community-cr-prod.seqera.io/docker/registry/v2/blobs/sha256/5a/5acacb55c52bec97c61fd34ffa8721fce82ce823005793592e2a80bf71632cd0/data':
+        'community.wave.seqera.io/library/bcftools:1.21--4335bec1d7b44d11' }"
+
+    input:
+    tuple val(meta), path(vcf), path(index)
+    val(fraction)
+
+    output:
+    tuple val(meta), path("*.vcf.gz"), emit: vcf
+    tuple val(meta), path("*.tbi")   , emit: tbi
+    path "versions.yml"              , emit: versions
+
+    when:
+    task.ext.when == null || task.ext.when
+
+    script:
+    def args = task.ext.args ?: ''
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    randomsubset.sh ${vcf} ${prefix}.vcf ${fraction}
+
+    bgzip ${prefix}.vcf
+    tabix -p vcf ${prefix}.vcf.gz
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//')
+    END_VERSIONS
+    """
+
+    stub:
+    def args = task.ext.args ?: ''
+    def prefix = task.ext.prefix ?: "${meta.id}"
+    """
+    echo | gzip > ${prefix}.vcf.gz
+    touch ${prefix}.tbi
+
+    cat <<-END_VERSIONS > versions.yml
+    "${task.process}":
+        bcftools: \$(bcftools --version 2>&1 | head -n1 | sed 's/^.*bcftools //; s/ .*\$//')
+    END_VERSIONS
+    """
+}
diff --git a/modules/local/randomsubset/tests/main.nf.test b/modules/local/randomsubset/tests/main.nf.test
@@ -0,0 +1,65 @@
+nextflow_process {
+
+    name "Test Process RANDOMSUBSET"
+    script "../main.nf"
+    process "RANDOMSUBSET"
+
+    tag "modules"
+    tag "modules_"
+    tag "randomsubset"
+
+    test("sarscov2 - [vcf, tbi]") {
+
+        when {
+            process {
+                """
+                // The input VCF has 9 records so we expect 4 records in the output VCF
+                input[0] = [
+                    [ id:'out', single_end:false ], // meta map
+                    file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true),
+                    file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true)
+                ]
+                input[1] = 0.5
+                """
+            }
+        }
+
+        then {
+            assertAll(
+                { assert process.success },
+                { assert snapshot(
+                    path(process.out.vcf.get(0).get(1)).vcf.summary,
+                    file(process.out.tbi.get(0).get(1)).name,
+                    process.out.versions
+                ).match() },
+            )
+        }
+
+    }
+
+    test("sarscov2 - [vcf, tbi] - stub") {
+
+        options "-stub"
+
+        when {
+            process {
+                """
+                input[0] = [
+                    [ id:'out', single_end:false ], // meta map
+                    file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz', checkIfExists: true),
+                    file('https://github.com/nf-core/test-datasets/raw/refs/heads/modules/data/genomics/sarscov2/illumina/vcf/test.vcf.gz.tbi', checkIfExists: true)
+                ]
+                input[1] = 0.00175
+                """
+            }
+        }
+
+        then {
+            assertAll(
+                { assert process.success },
+                { assert snapshot(process.out).match() },
+            )
+        }
+
+    }
+}
diff --git a/modules/local/randomsubset/tests/main.nf.test.snap b/modules/local/randomsubset/tests/main.nf.test.snap
@@ -0,0 +1,69 @@
+{
+    "sarscov2 - [vcf, tbi] - stub": {
+        "content": [
+            {
+                "0": [
+                    [
+                        {
+                            "id": "out",
+                            "single_end": false
+                        },
+                        "out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940"
+                    ]
+                ],
+                "1": [
+                    [
+                        {
+                            "id": "out",
+                            "single_end": false
+                        },
+                        "out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e"
+                    ]
+                ],
+                "2": [
+                    "versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
+                ],
+                "tbi": [
+                    [
+                        {
+                            "id": "out",
+                            "single_end": false
+                        },
+                        "out.subset.tbi:md5,d41d8cd98f00b204e9800998ecf8427e"
+                    ]
+                ],
+                "vcf": [
+                    [
+                        {
+                            "id": "out",
+                            "single_end": false
+                        },
+                        "out.subset.vcf.gz:md5,68b329da9893e34099c7d8ad5cb9c940"
+                    ]
+                ],
+                "versions": [
+                    "versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
+                ]
+            }
+        ],
+        "meta": {
+            "nf-test": "0.9.2",
+            "nextflow": "25.04.3"
+        },
+        "timestamp": "2025-06-18T10:53:32.966380045"
+    },
+    "sarscov2 - [vcf, tbi]": {
+        "content": [
+            "VcfFile [chromosomes=[MT192765.1], sampleCount=1, variantCount=4, phased=false, phasedAutodetect=false]",
+            "out.subset.vcf.gz.tbi",
+            [
+                "versions.yml:md5,ee7626565a01c36b7fb7a05f41e0653e"
+            ]
+        ],
+        "meta": {
+            "nf-test": "0.9.2",
+            "nextflow": "25.04.3"
+        },
+        "timestamp": "2025-06-18T10:53:26.441286474"
+    }
+}
diff --git a/nextflow.config b/nextflow.config
@@ -13,6 +13,7 @@ params {
     input                      = null
     rename                     = true
     filter                     = null
+    subset                     = null
     removeIDs                  = true
 
     // References
@@ -249,7 +250,7 @@ manifest {
     mainScript      = 'main.nf'
     defaultBranch   = 'master'
     nextflowVersion = '!>=24.04.2'
-    version         = '2.0.2dev'
+    version         = '2.1.0dev'
     doi             = ''
 }
 

diff --git a/nextflow_schema.json b/nextflow_schema.json
@@ -32,6 +32,10 @@
                     "type": "string",
                     "description": "Add a filtering criterium suitable for bcftools/view. For example 'INFO/CSQ ~ \"HIGH\"'."
                 },
+                "subset": {
+                    "type": "number",
+                    "description": "Get a random subset of variants. Set this variable to the fraction you want to keep (f.ex. 0.5 if you want to keep half of the variants)."
+                },
                 "removeIDs": {
                     "type": "boolean",
                     "default": true,