Skip to content

Conversation

famosab
Copy link
Collaborator

@famosab famosab commented Jun 16, 2025

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).


</details>

VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.
### Subsample VCFs
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would however suggest to have it like this:

and down below:

Filtering options

String-based filtering

...

Random subset filtering

docs/usage.md Outdated
> [!NOTE]
> The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version)

### Subsample VCFs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you change the heading in docs/output.md then please change it here as well.

process "RANDOMSUBSET"

tag "modules"
tag "modules_"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this tag for?

Copy link

@SusiJo SusiJo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good :) Just have minor comments


### Subset VCFs

VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.
VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants.


VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants.

You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This is done using a script that collects the number of variants with `bcftools stats` from both files and divides the number of variants before and after filtering. The more VCF files are used for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`).


- `bcftools/subset/{meta.label}/`
- `{filename}.subset.vcf.gz`: vcf file with fraction of random variants.
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file.
- `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file.

@famosab
Copy link
Collaborator Author

famosab commented Jul 17, 2025

I will convert this to draft for now as we decided that the subsampling has to happen at a different stage in the whole analysis workflow.

@famosab famosab marked this pull request as draft July 17, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants