-
Notifications
You must be signed in to change notification settings - Fork 2
Add random subsampling as alternative to filtering #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
||
</details> | ||
|
||
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants. | |
### Subsample VCFs | |
VCF files can be randomly subsampled to keep only a specific fraction of variants. This enables comparison to the filtered variants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would however suggest to have it like this:
and down below:
Filtering options
String-based filtering
...
Random subset filtering
Co-authored-by: SusiJo <[email protected]>
docs/usage.md
Outdated
> [!NOTE] | ||
> The filtering step only works with conda for nextflow versions above 24.10.2 (use docker or singularity if you want to use an older nextflow version) | ||
|
||
### Subsample VCFs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you change the heading in docs/output.md
then please change it here as well.
process "RANDOMSUBSET" | ||
|
||
tag "modules" | ||
tag "modules_" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this tag for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks good :) Just have minor comments
Co-authored-by: SusiJo <[email protected]>
|
||
### Subset VCFs | ||
|
||
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants. | |
VCF files can be randomly selected to keep only a specific fraction of variants. This enables comparison to the filtered variants. |
|
||
VCF files can be randomly subsetted to keep only a specific fraction of variants. This enables comparison to the filtered variants. | ||
|
||
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This can be done with a script that collects the number of variants by using `bcftools stats` from both files and dividing them. The more VCF files you use for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`). | |
You can determine appropriate fractions by comparing the number of filtered variants with the total number of variants. This is done using a script that collects the number of variants with `bcftools stats` from both files and divides the number of variants before and after filtering. The more VCF files are used for comparison, the more robust the fraction becomes. (We compared around 90 files and obtained an average fraction of 0.00175 when using `--filter 'INFO/CSQ ~ "HIGH"'`). |
|
||
- `bcftools/subset/{meta.label}/` | ||
- `{filename}.subset.vcf.gz`: vcf file with fraction of random variants. | ||
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- `{filename}.seubset.vcf.gz.tbi`: tabix index of the vcf file. | |
- `{filename}.subset.vcf.gz.tbi`: tabix index of the vcf file. |
I will convert this to draft for now as we decided that the subsampling has to happen at a different stage in the whole analysis workflow. |
PR checklist
nf-core pipelines lint
).nextflow run . -profile test,docker --outdir <OUTDIR>
).nextflow run . -profile debug,test,docker --outdir <OUTDIR>
).docs/usage.md
is updated.docs/output.md
is updated.CHANGELOG.md
is updated.README.md
is updated (including new tool citations and authors/contributors).