Simple Workflow with tq

The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Simple Workflow with tq_apply

Overview

tq_apply() provides a simplified workflow for running parallel tasks on HPC clusters. It combines multiple steps (project creation, resource assignment, task addition, and worker scheduling) into a single function call, similar to base R’s lapply() or sapply().

This is the easiest way to get started with taskqueue if you:

Have a simple function to run multiple times
Don’t need complex project management
Want to quickly parallelize work on an HPC cluster

Before using taskqueue, ensure you have:

PostgreSQL installed and configured (see PostgreSQL Setup vignette)
SSH access configured for remote resources (see SSH Setup vignette)
Database initialized:
```
library(taskqueue)
db_init()
```

A resource already defined:

resource_add(
  name = "hpc",
  type = "slurm",
  host = "hpc.example.com",
  nodename = "hpc",
  workers = 500,
  log_folder = "/home/user/log_folder/"
)

Basic Usage

The simplest use of tq_apply() requires just a few arguments:

library(taskqueue)

# Define your function
my_simulation <- function(i) {
  # Your computation here
  result <- i^2
  Sys.sleep(1)  # Simulate some work
  return(result)
}

# Run 100 tasks in parallel
tq_apply(
  n = 100,
  fun = my_simulation,
  project = "my_project",
  resource = "hpc"
)

This will:

Create or update the project “my_project”
Add the resource “hpc” to the project
Create 100 tasks
Schedule workers on the SLURM cluster
Execute my_simulation(1), my_simulation(2), …, my_simulation(100) in parallel

Function Arguments

Required Arguments

n: Number of tasks to run (integer)
fun: The function to execute for each task
project: Project name (string)
resource: Resource name (string, must already exist)

Optional Arguments

memory: Memory per task in GB (default: 10)
hour: Maximum runtime in hours (default: 24)
account: Account name for cluster billing (optional)
working_dir: Working directory on cluster (default: getwd())
...: Additional arguments passed to your function

Passing Arguments to Your Function

You can pass additional arguments to your function using ...:

my_function <- function(i, multiplier, offset = 0) {
  result <- i * multiplier + offset
  return(result)
}

tq_apply(
  n = 50,
  fun = my_function,
  project = "test_args",
  resource = "hpc",
  multiplier = 10,    # Passed to my_function
  offset = 5          # Passed to my_function
)

Each task will call: - Task 1: my_function(1, multiplier = 10, offset = 5) - Task 2: my_function(2, multiplier = 10, offset = 5) - And so on…

Complete Example

Here’s a practical example running a Monte Carlo simulation:

library(taskqueue)

# Define simulation function
run_monte_carlo <- function(task_id, n_samples = 10000, seed_base = 12345) {
  # Set unique seed for each task
  set.seed(seed_base + task_id)
  
  # Run simulation
  samples <- rnorm(n_samples)
  result <- list(
    task_id = task_id,
    mean = mean(samples),
    sd = sd(samples),
    quantiles = quantile(samples, probs = c(0.025, 0.5, 0.975))
  )
  
  # Save results
  out_file <- sprintf("results/simulation_%04d.Rds", task_id)
  dir.create("results", showWarnings = FALSE)
  saveRDS(result, out_file)
  
  return(invisible(NULL))
}

# Run 1000 simulations in parallel
tq_apply(
  n = 1000,
  fun = run_monte_carlo,
  project = "monte_carlo_study",
  resource = "hpc",
  memory = 8,           # 8 GB per task
  hour = 2,             # 2 hour time limit
  working_dir = "/home/user/monte_carlo",
  n_samples = 50000,    # Argument for run_monte_carlo
  seed_base = 99999     # Argument for run_monte_carlo
)

Monitoring Progress

After calling tq_apply(), monitor your tasks:

# Check task status
task_status("monte_carlo_study")

# Check overall project status
project_status("monte_carlo_study")

Collecting Results

After all tasks complete, collect your results:

# Read all result files
result_files <- list.files("results", pattern = "simulation_.*\\.Rds$", 
                          full.names = TRUE)

# Combine results
all_results <- lapply(result_files, readRDS)

# Analyze
means <- sapply(all_results, function(x) x$mean)
hist(means, main = "Distribution of Means")

Best Practices

1. Save Results to Files

Your function should save results to the file system:

my_task <- function(i) {
  out_file <- sprintf("output/result_%04d.Rds", i)
  
  # Skip if already done
  if (file.exists(out_file)) {
    return(invisible(NULL))
  }
  
  # Do computation
  result <- expensive_computation(i)
  
  # Save result
  saveRDS(result, out_file)
}

2. Make Functions Idempotent

Check if output already exists to avoid re-running completed tasks:

my_task <- function(i) {
  out_file <- sprintf("output/task_%d.Rds", i)
  if (file.exists(out_file)) return(invisible(NULL))
  
  # ... computation and save
}

3. Specify Working Directory

Ensure your working directory on the cluster is correct:

tq_apply(
  n = 100,
  fun = my_function,
  project = "my_project",
  resource = "hpc",
  working_dir = "/home/user/project_folder"
)

4. Set Appropriate Resources

Configure memory and time limits based on your task requirements:

tq_apply(
  n = 100,
  fun = memory_intensive_task,
  project = "big_analysis",
  resource = "hpc",
  memory = 64,    # 64 GB for large tasks
  hour = 48       # 48 hour time limit
)

Comparison with Manual Workflow

tq_apply() simplifies the workflow by combining these steps:

Manual approach:

# Multiple steps
project_add("test", memory = 10)
project_resource_add("test", "hpc", working_dir = "/path", hours = 24)
task_add("test", num = 100, clean = TRUE)
project_reset("test")
worker_slurm("test", "hpc", fun = my_function)

With tq_apply():

# Single step
tq_apply(n = 100, fun = my_function, project = "test", resource = "hpc",
         working_dir = "/path", hour = 24)

Troubleshooting

Tasks fail immediately: - Check the log folder specified in your resource configuration - Verify your function works locally first - Ensure the working directory exists on the cluster

Tasks remain in “idle” status: - Check that the project is started: project_start("my_project") - Verify the resource is correctly configured - Check SLURM queue: squeue -u $USER

“Resource not found” error: - The resource must be created before using tq_apply() - Use resource_list() to see available resources - Create resource with resource_add()

When to Use tq_apply()

Use tq_apply() when: - You have a simple parallel task - You want to quickly run many iterations of a function - You don’t need fine-grained control over project settings

Use the manual workflow when: - You need to manage multiple projects simultaneously - You want to reuse a project for different task sets - You need more control over resource scheduling - You’re running different types of tasks in the same project

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.