ReproVM - Reproducible Task Virtual Machine with Content-Addressed Caching

Overview
Why ReproVM?
Key Concepts
- Key Technologies
Quickstart
Installation & Build
Manifest Specification
Task Lifecycle
Cache & CAS Internals
CLI Usage Reference
Environment & Configuration
Simulated Realistic Sessions & Logs
Advanced Usage Patterns
Extension Points / Developer Notes
Performance Tuning
Security Considerations
Troubleshooting & Error Codes
FAQ
Contributing
Glossary
Parallel Execution
- What's Different
- Building the Parallel Binary
- Usage
- Examples
- Failure Behavior
- Integration Notes
- Simulated Parallel Run Output (Realistic)
- Tuning
Docker
Example File Layout After Successful Run
License

Overview

ReproVM is a lightweight yet powerful task execution virtual machine written entirely in C (C99). It is designed to make complex workflows — builds, data pipelines, test orchestrations, analyses — reproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking. Think of it as a miniaturized, transparent cousin of modern build systems and pipeline orchestrators (e.g., parts of Docker, Bazel, or data lineage tooling) without hidden state or heavyweight dependencies.

It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on. ReproVM automatically computes identities (hashes), reuses work when possible, restores outputs, and visualizes the dependency graph with status in the terminal.

Why ReproVM?

Real-world workflows suffer from repeated, wasted computation, hidden dependencies, and brittle pipelines. ReproVM addresses these problems:

Reproducibility: Every piece of work is content-addressed; if inputs and commands are identical, outputs are reused deterministically.
Incrementality: Only tasks whose inputs, commands, or upstream results have changed are rerun.
Transparency: ASCII diagrams show exactly what ran, what was skipped, and why.
Portability: No external runtime; pure C implementation, minimal assumptions.
Extensibility: Clear separation of CAS, task graph, hashing, and execution logic makes it a foundation for more advanced systems.

Key Concepts

Task: Named unit of work with cmd, inputs, outputs, and declared dependencies.
Task Hash: Identity of a task: derived from its command string, input blob hashes, and dependencies’ result hashes.
Result Hash: Hash representing the outputs of a task; feeds into dependents for invalidation logic.
CAS (Content-Addressed Storage): Stores file blobs (inputs/outputs) keyed by their SHA-256 hash.
Cache Record: Metadata per task hash stored under .reprovm/cache/ that records what was produced, allowing skip on repeat.
Dependency Graph: Directed acyclic graph (DAG) of tasks; ReproVM topologically orders and executes it.

Key Technologies

C99 / GCC: Core VM and logic implemented in portable, optimized C.
Make: Build orchestration for building ReproVM itself.
Bash / POSIX shell: Task commands run via shell; manifest authoring assumes standard shell semantics.
pthreads: Optional parallel executor for dependency-aware concurrent task execution.
CAS + SHA-256: Content-addressed storage and hashing form the backbone of caching and reproducibility.
Git: Recommended for versioning manifests/source.
Coreutils (curl, tar, sha256sum): Used in example pipelines for data fetching, packaging, checksum generation.
Python: Example extension for richer pipeline steps (e.g., data preprocessing).
ASCII UI / CLI: Terminal-based visualization and control.
Unix compatibility: Targeted at Unix-like environments.
MIT License: Permissive open-source licensing.

flowchart TB
    U[User invokes CLI with manifest & optional targets/env flags] --> CLI["CLI Entry Point<br>(read env: FORCE, VERBOSE, targets)"]

    subgraph "Input & Initialization"
        CLI --> Parser["Manifest Parser<br>(parse DSL, validate syntax)"]
        Parser --> DAG["Dependency DAG Builder / Scheduler<br>(build task graph, apply target filtering)"]
        DAG --> ReadyCheck{"Any ready tasks?<br>(dependencies satisfied)"}
    end

    subgraph "Execution Engine"
        ReadyCheck -- yes --> Dispatch["Parallel Executor / Worker Pool<br>(schedule ready tasks)"]
        Dispatch --> TaskInstance["Task Instance"]

        subgraph "Task Instance Logic"
            TaskInstance --> ComputeHash["Compute Task Hash<br>(cmd + sorted input blob hashes + deps result hashes)"]
            ComputeHash --> CacheLookup{"Cache record exists?<br>(task_hash) & not forced"}
            CacheLookup -- hit & not forced --> RestoreOutputs["Restore outputs from CAS<br>(mark skipped)"]
            CacheLookup -- miss or forced --> RunCmd["Execute shell command<br>(capture output, exit code)"]
            RunCmd --> CmdSuccess{"Exit code == 0?"}
            CmdSuccess -- yes --> StoreCAS["Store outputs in CAS<br>(SHA-256 content-addressed)"]
            StoreCAS --> WriteMeta["Write cache metadata<br>(task_hash, result_hash, output map)"]
            CmdSuccess -- no --> MarkFailed["Mark task failed<br>(record error)"]
            RestoreOutputs --> UpdateSkipped["Update graph: skipped"]
            WriteMeta --> UpdateSuccess["Update graph: success"]
            MarkFailed --> UpdateFailed["Update graph: failed"]
        end

        UpdateSkipped --> DAG
        UpdateSuccess --> DAG
        UpdateFailed --> DAG
        Dispatch --> ReadyCheck
    end

    ReadyCheck -- no --> AllDone{"All tasks processed?"}
    AllDone -- yes --> Visualizer["Graph Visualizer<br>(render ASCII DAG with statuses)"]
    Visualizer --> Summary["Final Summary & Exit Code<br>(success/failure, cache hits/misses)"]
    AllDone -- no --> ReadyCheck

    MarkFailed --> OverallFailure["Overall run marked failed"]
    OverallFailure --> Visualizer
    OverallFailure --> Summary

    ChangeInput["Upstream input changed or manifest edited"] --> InvalidateDeps["Invalidate dependent task hashes"]
    InvalidateDeps --> DAG

    CLI --> ForceBypass["Force rebuild flag<br>(overrides cache hits)"]
    ForceBypass --> CacheLookup

Quickstart

Write a manifest describing your tasks (see Manifest Specification).
Build ReproVM: make
Run the VM:
```
./reprovm manifest.txt
```

If nothing has changed, subsequent runs are near-instant due to cache hits.

Installation & Build

Requirements

POSIX-compatible OS (Linux, macOS, BSD)
gcc or other C99-compatible compiler
Standard UNIX shell available for task commands (sh, bash etc.)

Build

git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine
make

Expected output:

$ make
gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o
gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o
gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o
gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o

This produces the reprovm binary.

Manifest Specification

ReproVM manifest is a simple, custom DSL. Whitespaces are flexible; comments begin with #.

Grammar Summary

manifest      := { task_block }+
task_block   := "task" <name> "{" { field_line } "}"
field_line   := <key> "=" <value>
key          := "cmd" | "inputs" | "outputs" | "deps"
value        := arbitrary string (for cmd), or comma-separated list (for others)

Valid Fields

cmd — Shell command to execute. Should produce the declared outputs and respect inputs.
inputs — Comma-separated list of file paths consumed by the task.
outputs — Comma-separated list of file paths produced by the task.
deps — Comma-separated list of other task names that must run before this one.

Example

task build {
  cmd = gcc -o hello hello.c
  inputs = hello.c
  outputs = hello
  deps =
}

task test {
  cmd = ./hello > result.txt
  inputs = hello
  outputs = result.txt
  deps = build
}

task checksum {
  cmd = sha256sum result.txt > result.sha
  inputs = result.txt
  outputs = result.sha
  deps = test
}

Notes

Dependencies (deps) are used to control ordering beyond just file-based inference.
Tasks with no dependencies can be listed with deps = or omitted after the equals (empty).

Task Lifecycle

For each task in topological order:

Compute Task Hash

Concatenate its command, sorted input blob hashes, and dependencies’ result hashes.
SHA-256 of that describes the task identity.

Cache Lookup

Look for .reprovm/cache/<task_hash>.meta.
If present, restore outputs from CAS, mark as skipped ([*]).

Execution (if no cache hit)

Run the cmd via system().
After success, compute result hash from its outputs and store each output into CAS.
Write .meta file containing task_hash, result_hash, and output-to-hash mapping.

Graph Update & Display

Update status and reprint ASCII dependency graph.

Statuses:

[ ] pending
[~] running
[*] skipped (cache hit)
[✔] success
[X] failed

Cache & CAS Internals

CAS Layout

Blobs are stored under:

.reprovm/cas/objects/<first-two-hex>/<remaining-hash>

Example:

.reprovm/cas/objects/a3/f5d9c0e1b2...  # blob for a file or output

This two-level split avoids directory explosion.

Metadata Record

Each task produces a metadata file:

.reprovm/cache/<task_hash>.meta

Example contents:

task_hash: a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7
result_hash: 7d9a4f2e5b3c1a0d6e8f9b7c4d3e2f1a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2
output hello 5e2f1a3b4c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1
output result.txt 9c4d7a1e2f3b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9

Identity Propagation

Downstream tasks include upstream result hashes in their own task hash, so any change propagates invalidation automatically.

CLI Usage Reference

./reprovm <manifest> [target1 target2 ...]

<manifest>: path to manifest file.
[target...]: optional list of task names to build. If omitted, all tasks are considered targets.

Examples

Run full pipeline:
```
./reprovm manifest.txt
```
Run specific target (and its dependencies):
```
./reprovm manifest.txt checksum
```
Force re-computation of everything:
```
rm -rf .reprovm
./reprovm manifest.txt
```

Environment & Configuration

ReproVM does not require environment variables to function, but the following are recognized/usable for extension or debugging if you choose to augment it:

REPROVM_CACHE_DIR (future extension) — override .reprovm/cache path.
REPROVM_VERBOSE=1 — hypothetical future verbose mode to log internal decisions.
REPROVM_FORCE=<task> — (not implemented in base) could be used to bypass cache for a particular task.

The current implementation uses the project root (.) as base; you can change this by modifying cas_init parameter in main.c.

Simulated Realistic Sessions & Logs

Note: The following outputs are realistic simulations, with plausible hash values and formatting.

1. First Full Run (cold cache)

Manifest: as above (build, test, checksum). hello.c prints “Hello, ReproVM!”.

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7) res=1f2e3d4c5b6a7980a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7)
[ ] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c)
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=...) res=...
[✔] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=7f8e9d0c1b2a3e4f5d6c7b8a9f0e1d2c3b4a5f6e7d8c9b0a1c2d3e4f5a6b7c8)
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ... 
[✔] test ...
[✔] checksum ...
==================

2. Repeat Run (warm cache)

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cache hit) (hash=a3f5d9c0e1b2c3d4...) res=1f2e3d4c...
[*] test (cache hit) (hash=5f6e7d8c9b0a1f2...) res=3c4d5e6f...
[*] checksum (cache hit) (hash=9a8b7c6d5e4f3a2b...) res=7f8e9d0c...
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================

3. Change in Upstream Input (`hello.c` modified)

Modify hello.c to print “Hello, Updated ReproVM!” then:

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=de4f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0) res=2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3)
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test (hash=8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5b4c3d2e1f0a9b8c7) res=4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d)
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
...

4. Failure Case: Missing Input

Manifest has a typo: inputs = missing.c but missing.c does not exist.

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello missing.c
Task 'build' failed with exit code 1
=== Task Graph ===
[X] build (hash=...) res=
[ ] test ...
[ ] checksum ...
==================
One or more tasks failed.

Advanced Usage Patterns

Partial Target Execution

Run only a high-level result:

$ ./reprovm manifest.txt checksum
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cached)
[*] test (cached)
[*] checksum (cached)
Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================

Forcing a Rebuild

To ignore the cached result of a task:

$ rm .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
$ ./reprovm manifest.txt build

Inspect Cache / Provenance

View metadata manually:

$ cat .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
task_hash: a3f5d9c0e1b2...
result_hash: 1f2e3d4c5b6a...
output hello 5e2f1a3b...

Restore any blob:

$ ./reprovm  # normal run will auto-restore, or use cas APIs to manually extract

Example: More Complex Pipeline Manifest

task fetch_data {
  cmd = curl -s -o raw.csv https://example.com/data.csv
  inputs =
  outputs = raw.csv
  deps =
}

task clean {
  cmd = python3 scripts/clean.py raw.csv > cleaned.csv
  inputs = raw.csv
  outputs = cleaned.csv
  deps = fetch_data
}

task train {
  cmd = python3 scripts/train_model.py cleaned.csv model.bin
  inputs = cleaned.csv
  outputs = model.bin
  deps = clean
}

task evaluate {
  cmd = python3 scripts/evaluate.py model.bin cleaned.csv > metrics.json
  inputs = model.bin, cleaned.csv
  outputs = metrics.json
  deps = train
}

task bundle {
  cmd = tar -czf package.tar.gz model.bin metrics.json
  inputs = model.bin, metrics.json
  outputs = package.tar.gz
  deps = evaluate
}

Run the final bundle:

./reprovm pipeline_manifest.txt bundle

Re-running after only editing clean.py results in re-running clean, train, evaluate, and bundle — earlier cached steps (fetch_data) are reused.

Extension Points / Developer Notes

Parallelism: Current ordering is serial. Hooks exist for executing independent zero-indegree tasks concurrently.
Remote CAS: The local CAS abstraction could be extended to push/pull blobs from remote storage (HTTP, S3, peer-to-peer).
Manifest Parsers: Replace the ad-hoc parser with YAML/JSON frontends while emitting the same internal Task structures.
Custom Cache Policies: Add TTL, manual invalidation, or signature-based validation.
Remote Execution: Swap system() with RPC to remote workers.
Result Verification: Integrate signature checking of cached outputs for tamper detection.

Performance Tuning

Use -O2 or higher for building ReproVM itself (Makefile already uses -O2).
Keep tasks fine-grained to maximize cache reuse.
Avoid unnecessary outputs: declaring only real outputs prevents wasted hashing overhead.
Batch small files if desired (could be an extension) to reduce CAS fragmentation.
Consider adding in-memory memoization layer for extremely hot repeated task hash computations in long sessions.

Security Considerations

Command Execution: Tasks execute arbitrary shell commands. Only run manifests from trusted sources.
Cache Poisoning: If multiple users share the .reprovm directory, ensure trust boundaries; consider signing metadata in a hardened variant.
Path Safety: Manifest authors should avoid tasks that write outside expected output declarations to prevent confusion or overwrites.
Replay Attacks: Without cryptographic signing, someone could replace CAS blob contents on disk. A secure variant would enforce integrity verification with signatures beyond hash existence.

Troubleshooting & Error Codes

Symptom / Error	Meaning	Action
`Unknown target 'foo'`	Specified target not defined	Verify spelling in manifest
`Cycle detected among tasks`	Dependency loop	Break cycle by reordering or removing dependency
`Task 'X' failed with exit code N`	Command returned nonzero	Inspect task `cmd`, check inputs, run manually for detailed error
Missing output but cache hit reported	Downstream expected file not declared as output	Ensure outputs reflect actual produced files
Corrupted `.meta` file or CAS object	Cache metadata unreadable	Delete relevant `.meta` to force recompute
`Failed to hash input file`	Input file missing or unreadable	Confirm file exists and permissions are correct

Exit codes:

0 — All tasks succeeded (possibly with skips).
1 — One or more errors (parsing, execution failure, cycle, internal error).

FAQ

Q: Can I use ReproVM inside CI to speed up repeated runs? A: Yes. Persist the .reprovm/cas and .reprovm/cache directories between pipeline runs to reuse cached steps and drastically reduce execution time.

Q: What happens if I rename a file but its content is same? A: File identity in CAS is content-based, but the manifest refers to file paths. Renaming changes input/output names in the manifest, so tasks will rerun unless you create symlinks or adapt the manifest.

Q: Can one task produce outputs not declared? A: The system won’t cache undeclared outputs properly. Always declare all side-effect files for reproducibility.

Q: Is the cache sharable across machines? A: Out of the box, the CAS is local. You can copy .reprovm/cas/objects and .reprovm/cache/ to another machine as long as relative paths and environment are compatible. For robust sharing, one could extend to a remote CAS backend.

Contributing

Contributions are welcome. Suggested ways to help:

Add parallel execution of independent tasks.
Implement remote CAS backend (e.g., via HTTP/S3).
Add manifest validation and richer syntax.
Introduce a GUI or web visualization of task DAGs.
Wrap ReproVM in other language bindings (Python, Rust) while reusing core logic.

Guidelines:

Fork the repository.
Create a feature branch.
Write tests or example manifests demonstrating your addition.
Submit a pull request with a clear description and any limitations.

Glossary

Blob: Arbitrary file contents stored in CAS, addressed by SHA-256 hash.
Task Hash: Identity of a task; changes if its command, inputs, or upstream dependencies change.
Result Hash: Digest representing the outputs of a task.
Cache Hit: When a task's metadata already exists, and its outputs are restored instead of re-running.
DAG: Directed Acyclic Graph of task dependencies.
CAS: Content-Addressed Storage, deduplicating by content.

Parallel Execution

ReproVM can be optionally extended to execute independent tasks in parallel while still respecting explicit dependencies. This is provided via a companion parallel executor and alternate entry point; it is non-invasive and does not require changes to the original codebase. If you don’t opt into parallel mode, the original serial behavior remains exactly the same.

What's Different

Tasks with no mutual dependency (i.e., whose prerequisites are satisfied and that don’t depend on each other) can run concurrently.
The dependency DAG is observed: a task only becomes eligible when all of its upstream dependencies (within the current target set) have finished (successfully or been skipped).
Failures are propagated: if any task fails, the overall run reports failure, but other currently runnable tasks may still finish to give more visibility before termination.
ASCII graph updates still happen after each task completion; the output is serialized to reduce interleaving so you can watch the evolving state even under concurrency.

Building the Parallel Binary

The parallel version lives alongside the regular reprovm binary. Compile it by adding the new sources and linking pthreads:

gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel

You can keep both versions:

gcc -std=c99 -O2 -Wall -Wextra -g main.c task.c cas.c util.c -o reprovm
gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel

Usage

./reprovm_parallel [-j N] <manifest> [target1 target2 ...]

-j N / --jobs N: number of worker threads to use. If omitted, it defaults to the number of online CPUs (fallbacking to 4).
Manifest and target semantics are identical to the serial version; dependencies are resolved automatically.

You can also influence parallelism via environment variable (future extension support):

export REPROVM_JOBS=8   # if implemented, can be read as default parallelism

Examples

Full Parallel Run (auto worker count)

$ ./reprovm_parallel manifest.txt
Will execute 3 tasks (parallel workers: 8)
==> Scheduling...
[~] build (hash=...) res=
[~] test  (waiting on build)
[~] checksum (waiting on test)

==> Running task 'build': gcc -o hello hello.c
==> Running task 'test': ./hello > result.txt       # if build finished early enough and test became eligible
[✔] build ...
[*] test  (cache hit or skipped if unchanged)
[ ] checksum ...
...

Specifying 4 Workers

$ ./reprovm_parallel -j 4 manifest.txt
Will execute 3 tasks (parallel workers: 4)
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build ...
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================

Mixed Parallelism with Independent Tasks

Given a manifest with two independent tasks A and B that both feed into C, they can run concurrently:

Will execute 3 tasks (parallel workers: 4)
==> Running task 'A': ...
==> Running task 'B': ...
[✔] A ...
[✔] B ...
==> Running task 'C': ...
[✔] C ...
Final graph:
[✔] A ...
[✔] B ...
[✔] C ...

Failure Behavior

If one worker encounters a failure (non-zero exit), the failure is recorded but other in-flight eligible tasks are allowed to finish so you get a full snapshot. The final exit code is non-zero, and the ASCII graph will show [X] for failed tasks.

Integration Notes

The parallel executor uses the same task hashing, cache lookup, execution, and graph-printing logic.
No modifications are required in existing source files—just use reprovm_parallel when you want concurrency.
You can mix: use the serial ./reprovm for debugging or simple runs, and the parallel ./reprovm_parallel -j N for performance on larger DAGs.

Simulated Parallel Run Output (Realistic)

$ ./reprovm_parallel -j 3 manifest.txt
Will execute 3 tasks (parallel workers: 3)
==> Scheduling and starting workers...
[~] build (running)         [ ] test (waiting)        [ ] checksum (waiting)
==> Running task 'build': gcc -o hello hello.c
[✔] build (completed)
=== Task Graph ===
[✔] build (hash=...) res=...
[ ] test  (ready) res=
[ ] checksum (waiting on test)
==================
==> Running task 'test': ./hello > result.txt
[~] test (running)
[ ] checksum (waiting on test)
[✔] test (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
[✔] checksum (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================

If a task is a cache hit and skipped, it shows as [*] and its dependents may immediately become eligible without delay.

Tuning

Use -j to control throughput; too many threads on small DAGs may add scheduling overhead, so match worker count to workload size.
Because output rendering is serialized, you get consistent task graph snapshots even under concurrency.

Here’s a Docker section you can insert into the README (e.g., right after Installation & Build or before Manifest Specification):

Docker

ReproVM can be run inside a container for reproducible, isolated execution environments. The provided Dockerfile and docker-compose.yml (see repo) bundle all required tools (GCC, make, Python, Node.js, Ruby, etc.), build both serial and parallel binaries, and give you a consistent runtime.

Overview

Container encapsulates build/runtime dependencies: C toolchain, shell, optional polyglot helpers (Go, Rust, Node.js, Ruby), and ReproVM itself.
Workspace is mounted from the host so manifests, source files, and the .reprovm cache/CAS are shared (and persist across container invocations).
Parallelism is exposed via the reprovm_parallel binary and optionally controlled with -j or environment variable REPROVM_JOBS.

Building the Image

From the repo root:

docker build -t reprovm:latest .

This builds the image, compiling ReproVM (serial + parallel) and preparing ancillary scripts.

Running ReproVM with Docker

Run the pipeline against your manifest with:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./run_pipeline.sh manifest.txt

Or invoke the parallel binary directly (auto-detects CPU count unless overridden):

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel -j 4 manifest.txt

To pass parallelism via environment variable:

docker run --rm -v "$(pwd)":/workspace -w /workspace -e REPROVM_JOBS=8 reprovm:latest ./run_pipeline.sh manifest.txt

Using docker-compose

With the included docker-compose.yml, you can spin up the ReproVM service easily:

docker compose up reprovm

This mounts the current directory into /workspace, builds/uses the image, and runs ./run_pipeline.sh by default. To target a specific manifest or override:

docker compose run --rm reprovm ./reprovm_parallel -j 2 pipeline_manifest.txt bundle

Persisting Cache

Since .reprovm lives in the mounted workspace, cache and CAS entries survive between container runs. To maximize CI or repeated-run performance, persist the host-side .reprovm directory (e.g., in CI artifacts or volume-backed storage).

Examples

Cold build:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt

Warm rebuild (uses cache):

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt

Run a specific target:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt checksum

Shell into container for inspection/debugging:

docker run --rm -it -v "$(pwd)":/workspace -w /workspace reprovm:latest bash
# then inside:
./reprovm_parallel -j 4 manifest.txt

Cleanup

To remove the built image when you no longer need it:

docker image rm reprovm:latest

If using docker-compose, tear down (and optionally remove volumes):

docker compose down

Tips

Bind mount the workspace so intermediate outputs, manifests, and the cache are visible/editable from the host.
Use the same .reprovm directory across container invocations for maximum cache reuse.
Override commands by replacing the entrypoint/command in docker run or docker-compose if you need to run custom diagnostics (e.g., inspect .reprovm/cache contents).
CI integration: Build the image once, mount the repo, run the manifest, and persist .reprovm between jobs to drastically cut repeat run time.

Example File Layout After Successful Run

.
├── hello.c
├── manifest.txt
├── hello                    # compiled binary
├── result.txt              # program output
├── result.sha             # checksum
├── reprovm                # compiled VM executable
└── .reprovm
    ├── cache
    │   ├── a3f5d9c0e1b2...meta
    │   ├── 5f6e7d8c9b0a...meta
    │   └── 9a8b7c6d5e4f...meta
    └── cas
        └── objects
            ├── a3/
            │   └── f5d9c0e1b2c3d4...  # blobs: e.g., input hello.c
            ├── 5f/
            │   └── 6e7d8c9b0a1f2e...  # more blobs
            └── 9a/
                └── 8b7c6d5e4f3a2b...  # output blobs

License

MIT License. See the LICENSE file for details.

Thank you for using ReproVM! We hope it helps you build reproducible, efficient workflows with ease. If you have any questions or suggestions, feel free to open an issue or contribute to the project.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.devcontainer		.devcontainer
.githooks		.githooks
.github		.github
.idea		.idea
scripts		scripts
tests		tests
.astylerc		.astylerc
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
.uncrustify.cfg		.uncrustify.cfg
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
cas.c		cas.c
cas.h		cas.h
ci_restore_cache.sh		ci_restore_cache.sh
clean_cache.sh		clean_cache.sh
crc32_asm.asm		crc32_asm.asm
docker-compose.yml		docker-compose.yml
format.sh		format.sh
hello.c		hello.c
hello_updated.c		hello_updated.c
main.c		main.c
parallel_executor.c		parallel_executor.c
pipeline_manifest.txt		pipeline_manifest.txt
reprovm_parallel.c		reprovm_parallel.c
run_pipeline.sh		run_pipeline.sh
task.c		task.c
task.h		task.h
util.c		util.c
util.h		util.h

License

hoangsonww/ReproVM-Virtual-Machine

Folders and files

Latest commit

History

Repository files navigation

ReproVM - Reproducible Task Virtual Machine with Content-Addressed Caching

Table of Contents

Overview

Why ReproVM?

Key Concepts

Key Technologies

Quickstart

Installation & Build

Requirements

Build

Manifest Specification

Grammar Summary

Valid Fields

Example

Notes

Task Lifecycle

Cache & CAS Internals

CAS Layout

Metadata Record

Identity Propagation

CLI Usage Reference

Examples

Environment & Configuration

Simulated Realistic Sessions & Logs

1. First Full Run (cold cache)

2. Repeat Run (warm cache)

3. Change in Upstream Input (hello.c modified)

4. Failure Case: Missing Input

Advanced Usage Patterns

Partial Target Execution

Forcing a Rebuild

Inspect Cache / Provenance

Example: More Complex Pipeline Manifest

Extension Points / Developer Notes

Performance Tuning

Security Considerations

Troubleshooting & Error Codes

FAQ

Contributing

Glossary

Parallel Execution

What's Different

Building the Parallel Binary

Usage

Examples

Full Parallel Run (auto worker count)

Specifying 4 Workers

Mixed Parallelism with Independent Tasks

Failure Behavior

Integration Notes

Simulated Parallel Run Output (Realistic)

Tuning

Docker

Overview

Building the Image

Running ReproVM with Docker

Using docker-compose

Persisting Cache

Examples

Cleanup

Tips

Example File Layout After Successful Run

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

3. Change in Upstream Input (`hello.c` modified)

Packages