Add random subsampling as alternative to filtering #39

famosab · 2025-06-16T07:17:46Z

PR checklist

README.md

docs/output.md

SusiJo · 2025-06-18T11:53:26Z

docs/output.md

+
+</details>
+
+VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.


Suggested change

VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.

### Subsample VCFs

VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.

I would however suggest to have it like this:

Filtering options

and down below:

Filtering options

String-based filtering

...

Random subset filtering

Co-authored-by: SusiJo <[email protected]>

…t/random

README.md

SusiJo · 2025-06-18T11:58:39Z

docs/usage.md

 > [!NOTE]
 > The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version)

+### Subsample VCFs


If you change the heading in docs/output.md then please change it here as well.

docs/usage.md

SusiJo · 2025-06-18T13:45:59Z

modules/local/randomsubset/tests/main.nf.test

+    process "RANDOMSUBSET"
+
+    tag "modules"
+    tag "modules_"


What is this tag for?

SusiJo

Generally looks good :) Just have minor comments

Co-authored-by: SusiJo <[email protected]>

SusiJo · 2025-06-23T12:43:47Z

docs/usage.md


+### Subset VCFs
+
+VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.


Suggested change

VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.

VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants.

SusiJo · 2025-06-23T12:45:56Z

docs/usage.md

+
+VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
+
+You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).


Suggested change

You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).

You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This is done using a script that collects the number of variants with `bcftools stats` from both files and divides the number of variants before and after filtering. The more VCF files are used for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).

famosab · 2025-07-14T08:27:06Z

docs/output.md

+
+- `bcftools/subset/{meta.label}/`
+  - `{filename}.subset.vcf.gz`: vcf file with fraction of random variants.
+  - `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.


Suggested change

- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.

- `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file.

famosab · 2025-07-17T10:55:20Z

I will convert this to draft for now as we decided that the subsampling has to happen at a different stage in the whole analysis workflow.

famosab and others added 16 commits June 13, 2025 17:28

Add random subsampling as alternative to filtering

63416b8

Change order to see changes

b874f83

Changelog

5c3b49d

Changelog

1f6c285

Merge branch 'dev' into feat/random

c793536

Merge branch 'dev' into feat/random

97d262b

Update dev version to 2.1.0dev

d1ad332

prettier

b76a746

fix whitespace

b002497

Update nextflow schema

f580449

Update docs

87ae761

Update readme and citations

0ca1868

udpate subway map

464adb4

update snap

bf3424c

update local snap

a19de46

ignore csv md5

94ffe9d