MAPPED - Modular Automated Pipeline for Public Expression Data

MAPPED (Modular Automated Pipeline for Public Expression Data) is a comprehensive Nextflow-based workflow designed to analyze public RNA-seq data from NCBI SRA. It automates the entire process from metadata retrieval to expression matrix generation, making large-scale transcriptomics analysis accessible and reproducible.

Overview

MAPPED consists of four integrated modules that work together to process public expression data:

Metadata Download: Retrieves and formats metadata from NCBI SRA based on organism name
FASTQ Download: Efficiently downloads sequencing data using optimized protocols
Reference Genome Download: Obtains reference genome sequences and annotations
Expression Quantification: Performs quality control, trimming, and gene expression quantification

The pipeline is designed to handle large-scale datasets with built-in error handling, resume capabilities, and resource optimization.

Features

Automated end-to-end workflow: From organism name to expression matrices in a single command
Flexible reference genome selection: Use default reference strains or specify custom genome accessions
Robust error handling: Automatic retries and graceful failure management
Resume capability: Continue from any interruption point without re-processing
Resource optimization: Configurable CPU allocation and efficient storage management
Clean mode: Automatic cleanup of intermediate files to save disk space
Docker integration: No manual dependency installation required
Comprehensive quality control: FastQC and MultiQC reports included
Strain filtering: Optionally restrict samples by strain token in ScientificName

Prerequisites

Nextflow (version 21.04.0 or later)
Docker (version 20.10 or later)

Installation

Clone the MAPPED repository:

git clone https://github.com/your-org/MAPPED.git
cd MAPPED

Ensure the wrapper script is executable:

chmod +x run_MAPPED.sh

Verify Nextflow and Docker are installed:

nextflow -version
docker --version

Quick Start

Process RNA-seq data for an organism using the default reference genome:

./run_MAPPED.sh \
    --organism "Escherichia coli" \
    --outdir ./results \
    --workdir ./work \
    --library_layout paired \
    --cpu 48

Usage

Basic Usage

The run_MAPPED.sh wrapper script orchestrates all pipeline modules:

./run_MAPPED.sh [OPTIONS]

Using a Specific Reference Genome

To use a specific genome assembly instead of the default reference strain:

./run_MAPPED.sh \
    --organism "Streptomyces coelicolor" \
    --ref-accession GCA_008931305.1 \
    --outdir ./results \
    --workdir ./work \
    --library_layout paired \
    --cpu 24

Clean Mode

To automatically clean up intermediate files after successful completion:

./run_MAPPED.sh \
    --organism "Pseudomonas putida" \
    --outdir ./results \
    --workdir ./work \
    --library_layout paired \
    --cpu 16 \
    --clean-mode

Pipeline Modules

1. Download Metadata (Module 1)

Queries NCBI SRA for RNA-seq experiments matching the specified organism
Filters samples based on library layout (single-end, paired-end, or both)
Generates formatted metadata files for downstream processing

2. Download FASTQ (Module 2)

Downloads raw sequencing data
Validates downloaded files
Creates a samplesheet for downstream analysis

3. Download Reference Genome (Module 3)

Downloads reference genome assemblies from NCBI
Retrieves genome sequence (FASTA), annotations (GFF), and protein sequences (FAA)
Supports two modes:
- Default mode: Automatically selects the largest reference genome for the organism
- Accession mode: Downloads a specific genome assembly using its accession number

4. Generate Count Matrix (Module 4)

Performs quality control on raw reads (FastQC)
Trims adapters and low-quality bases (TrimGalore)
Quantifies gene expression using Salmon
Generates normalized expression matrices (TPM and raw counts)
Creates comprehensive quality reports (MultiQC)

Parameters

Required Parameters

Parameter	Description	Example
`--organism`	Full taxonomic name of the target organism	`"Escherichia coli"`
`--outdir`	Output directory for all results	`/path/to/results`
`--workdir`	Nextflow work directory for temporary files	`/path/to/work`
`--library_layout`	Sequencing library type: `single`, `paired`, or `both`	`paired`

Optional Parameters

Parameter	Description	Default	Example
`--cpu`	Number of CPUs to allocate per process	System dependent	`16`
`--ref-accession`	Specific reference genome accession	Auto-selected	`GCA_008931305.1`
`--max_concurrent_downloads`	Maximum number of concurrent FASTQ downloads	`20`	`10`
`--strain`	Filter by strain token in `ScientificName` (case-insensitive token equals/contains)	none	`K-12`
`--clean-mode`	Remove intermediate files after completion	`false`	(flag)
`-h, --help`	Display help message	-	(flag)

Output Structure

The pipeline creates a well-organized output directory structure:

${outdir}/
├── metadata/                    # Downloaded and formatted metadata
│   ├── <Organism>_metadata.tsv  # Cleaned metadata (optionally strain-filtered)
│   └── sample_id.csv            # List of SRA accessions (optionally strain-filtered)
├── samplesheet/                 # Sample information for processing
│   ├── samplesheet_download.csv # metadata for all the available samples from NCBI
│   └── samplesheet.csv          # metadata for the samples that passed QC and quantified in the workflow
├── seqFiles/                    # Reference genome files
│   └── ref_genome/
│       ├── *.fna                # Genome sequence (FASTA)
│       ├── *.gff                # Gene annotations (GFF3)
│       ├── *.faa                # Protein sequences (FASTA)
│       └── datasets_summary.json
├── fastqc/                      # Quality control reports
│   ├── *_fastqc.html            # Per-sample QC reports
│   └── *_fastqc.zip             # QC data files
├── trimmed/                     # Adapter-trimmed FASTQ files
│   ├── *_trimmed.fq.gz          # Trimmed sequences
│   └── *_trimming_report.txt
├── salmon/                      # Expression quantification
│   └── <sample_id>/
│       └── quant.sf             # Quantification results
├── expression_matrices/         # Final expression data
│   ├── counts.csv               # Raw count matrix
│   ├── tpm.csv                  # TPM normalized matrix
│   ├── log_tpm.csv              # Log-transformed TPM
│   └── log_tpm_norm.csv         # Log-transformed normalized TPM
└── multiqc/                     # Aggregated quality reports
    ├── multiqc_report.html      # Interactive report
    └── multiqc_data/            # Raw MultiQC data

Clean Mode Output

When using --clean-mode, only essential outputs are retained:

${outdir}/
├── expression_matrices/     # Final expression matrices
├── samplesheet/            # Sample metadata
└── ref_genome/             # Reference genome files
### Strain Filtering

Restrict analysis to samples whose `ScientificName` contains a specific strain token. The value is matched case-insensitively against space-delimited tokens of `ScientificName`; a row is kept if any token equals or contains the provided string.

Example:

```bash
./run_MAPPED.sh \
    --organism "Escherichia coli" \
    --strain "K-12" \
    --outdir ./results \
    --workdir ./work \
    --library_layout paired \
    --cpu 24

This filters metadata to samples whose ScientificName tokens match K-12 (e.g., token equals K-12 or contains K-12). The filtered set propagates to metadata/sample_id.csv and all downstream steps.

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
1_download_metadata_efetch		1_download_metadata_efetch
2_download_fastq		2_download_fastq
3_download_reference_genome		3_download_reference_genome
4_generate_count_matrix		4_generate_count_matrix
.gitignore		.gitignore
MAPPED_workflow_description.txt		MAPPED_workflow_description.txt
README.md		README.md
run_MAPPED.sh		run_MAPPED.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MAPPED - Modular Automated Pipeline for Public Expression Data

Overview

Features

Prerequisites

Installation

Quick Start

Usage

Basic Usage

Using a Specific Reference Genome

Clean Mode

Pipeline Modules

1. Download Metadata (Module 1)

2. Download FASTQ (Module 2)

3. Download Reference Genome (Module 3)

4. Generate Count Matrix (Module 4)

Parameters

Required Parameters

Optional Parameters

Output Structure

Clean Mode Output

About

Uh oh!

Releases

Packages

Languages

Gaoyuan-Li/MAPPED

Folders and files

Latest commit

History

Repository files navigation

MAPPED - Modular Automated Pipeline for Public Expression Data

Overview

Features

Prerequisites

Installation

Quick Start

Usage

Basic Usage

Using a Specific Reference Genome

Clean Mode

Pipeline Modules

1. Download Metadata (Module 1)

2. Download FASTQ (Module 2)

3. Download Reference Genome (Module 3)

4. Generate Count Matrix (Module 4)

Parameters

Required Parameters

Optional Parameters

Output Structure

Clean Mode Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages