The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
The putior
package helps you document and visualize
workflows by extracting structured annotations from your R and Python
source files. This vignette shows you how to get started with PUT
annotations and workflow extraction.
PUT stands for PUT + Input + Output + R, reflecting the package’s core purpose: tracking data inputs and outputs through your analysis pipeline using special annotations.
The fastest way to see putior in action is to run the built-in example:
This creates a sample multi-language workflow and demonstrates the workflow extraction capabilities of putior.
PUT annotations are special comments that describe workflow nodes. Here’s how to add them to your source files:
R script example:
# data_processing.R
#put id:"load_data", label:"Load Customer Data", node_type:"input", output:"raw_data.csv"
# Your actual code
data <- read.csv("customer_data.csv")
write.csv(data, "raw_data.csv")
#put id:"clean_data", label:"Clean and Validate", node_type:"process", input:"raw_data.csv", output:"clean_data.csv"
# Data cleaning code
cleaned_data <- data %>%
filter(!is.na(customer_id)) %>%
mutate(purchase_date = as.Date(purchase_date))
write.csv(cleaned_data, "clean_data.csv")
Python script example:
# analysis.py
#put id:"analyze_sales", label:"Sales Analysis", node_type:"process", input:"clean_data.csv", output:"sales_report.json"
import pandas as pd
import json
# Load cleaned data
data = pd.read_csv("clean_data.csv")
# Perform analysis
sales_summary = {
"total_sales": data["amount"].sum(),
"avg_order": data["amount"].mean(),
"customer_count": data["customer_id"].nunique()
}
# Save results
with open("sales_report.json", "w") as f:
json.dump(sales_summary, f)
Use the put()
function to scan your files and extract
workflow information:
# Scan all R and Python files in a directory
workflow <- put("./src/")
# View the extracted workflow
print(workflow)
Expected output:
#> file_name file_type input label id
#> 1 data_processing.R r <NA> Load Customer Data load_data
#> 2 data_processing.R r raw_data.csv Clean and Validate clean_data
#> 3 analysis.py py clean_data.csv Sales Analysis analyze_sales
#> node_type output
#> 1 input raw_data.csv
#> 2 process clean_data.csv
#> 3 process sales_report.json
The output is a data frame where each row represents a workflow node. The columns include:
The general syntax for PUT annotations is:
#put property1:"value1", property2:"value2", property3:"value3"
PUT annotations support several formats to fit different coding styles:
#put id:"my_node", label:"My Process" # Standard format
# put id:"my_node", label:"My Process" # Space after #
#put| id:"my_node", label:"My Process" # Pipe separator
#put id:'my_node', label:'Single quotes' # Single quotes
#put id:"my_node", label:'Mixed quotes' # Mixed quote styles
While putior accepts any properties you define, these are commonly used:
Property | Purpose | Example Values |
---|---|---|
id |
Unique identifier | "load_data" , "process_sales" |
label |
Human description | "Load Customer Data" |
node_type |
Operation type | "input" , "process" ,
"output" |
input |
Input files | "raw_data.csv" , "data/*.json" |
output |
Output files | "processed_data.csv" |
For consistency across projects, consider using these standard node types:
input
: Data collection, file loading,
API callsprocess
: Data transformation,
analysis, computationoutput
: Report generation, data
export, visualizationdecision
: Conditional logic, branching
workflowsAdd any properties you need for visualization or metadata:
#put id:"train_model", label:"Train ML Model", node_type:"process", color:"green", group:"machine_learning", duration:"45min", priority:"high"
These custom properties can be used by visualization tools or workflow management systems.
You can process single files instead of entire directories:
Include subdirectories in your scan:
Control which files are processed:
For debugging annotation issues, include line numbers:
Control annotation validation:
If you omit the id
field, putior will automatically
generate a unique UUID:
# Annotations without explicit IDs get auto-generated UUIDs
#put label:"Load Data", node_type:"input", output:"data.csv"
#put label:"Process Data", node_type:"process", input:"data.csv", output:"clean.csv"
# Extract workflow - IDs will be auto-generated
workflow <- put("./")
print(workflow$id) # Will show UUIDs like "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
Note: If you provide an empty id
(e.g.,
id:""
), you’ll get a validation warning.
If you omit the output
field, putior automatically uses
the file name as the output:
# In process_data.R:
#put label:"Process Step", node_type:"process", input:"raw.csv"
# No output specified - will default to "process_data.R"
# In analyze_data.R:
#put label:"Analyze", node_type:"process", input:"process_data.R", output:"results.csv"
# This creates a connection from process_data.R to analyze_data.R
This feature ensures that scripts can be connected in workflows even when explicit output files aren’t specified.
When you have scripts that source other scripts, use this annotation pattern:
# In main.R (sources other scripts):
#put label:"Main Analysis", input:"load_data.R,process_data.R", output:"report.pdf"
source("load_data.R") # Reading load_data.R into main.R
source("process_data.R") # Reading process_data.R into main.R
# In load_data.R (sourced by main.R):
#put label:"Data Loader", node_type:"input"
# output defaults to "load_data.R"
# In process_data.R (sourced by main.R, depends on load_data.R):
#put label:"Data Processor", input:"load_data.R"
# output defaults to "process_data.R"
This correctly shows the flow: sourced scripts are inputs to the main script.
Let’s walk through a complete data science workflow:
# 01_collect_data.py
#put id:"fetch_api_data", label:"Fetch Data from API", node_type:"input", output:"raw_api_data.json"
import requests
import json
response = requests.get("https://api.example.com/sales")
data = response.json()
with open("raw_api_data.json", "w") as f:
json.dump(data, f)
# 02_process_data.R
#put id:"clean_api_data", label:"Clean and Structure Data", node_type:"process", input:"raw_api_data.json", output:"processed_sales.csv"
library(jsonlite)
library(dplyr)
# Load raw data
raw_data <- fromJSON("raw_api_data.json")
# Process and clean
processed <- raw_data %>%
filter(!is.na(sale_amount)) %>%
mutate(
sale_date = as.Date(sale_date),
sale_amount = as.numeric(sale_amount)
) %>%
arrange(sale_date)
# Save processed data
write.csv(processed, "processed_sales.csv", row.names = FALSE)
# 03_analyze_report.R
#put id:"sales_analysis", label:"Perform Sales Analysis", node_type:"process", input:"processed_sales.csv", output:"analysis_results.rds"
#put id:"generate_report", label:"Generate HTML Report", node_type:"output", input:"analysis_results.rds", output:"sales_report.html"
library(dplyr)
# Load processed data
sales_data <- read.csv("processed_sales.csv")
# Perform analysis
analysis_results <- list(
total_sales = sum(sales_data$sale_amount),
monthly_trends = sales_data %>%
group_by(month = format(sale_date, "%Y-%m")) %>%
summarise(monthly_total = sum(sale_amount)),
top_products = sales_data %>%
group_by(product) %>%
summarise(product_sales = sum(sale_amount)) %>%
arrange(desc(product_sales)) %>%
head(10)
)
# Save analysis
saveRDS(analysis_results, "analysis_results.rds")
# Generate report
rmarkdown::render("report_template.Rmd",
output_file = "sales_report.html")
Choose clear, descriptive names that explain what each step does:
# Good
#put name:"load_customer_transactions", label:"Load Customer Transaction Data"
#put name:"calculate_monthly_revenue", label:"Calculate Monthly Revenue Totals"
# Less descriptive
#put name:"step1", label:"Load data"
#put name:"process", label:"Do calculations"
Always specify inputs and outputs for data processing steps:
#put name:"merge_datasets", label:"Merge Customer and Transaction Data", input:"customers.csv,transactions.csv", output:"merged_data.csv"
Stick to a standard set of node types across your team:
#put name:"load_raw_data", label:"Load Raw Sales Data", node_type:"input"
#put name:"clean_data", label:"Clean and Validate", node_type:"process"
#put name:"export_results", label:"Export Final Results", node_type:"output"
Include metadata that helps with workflow understanding:
#put name:"train_model", label:"Train Random Forest Model", node_type:"process", estimated_time:"30min", requires:"tidymodels", memory_intensive:"true"
If put()
returns an empty data frame:
is_valid_put_annotation()
to test individual
annotationsIf you see validation warnings:
name
property to
all annotationsinput
, process
, output
)If annotations aren’t parsed correctly:
Good example:
#put name:"step1", description:"Process data, clean outliers", type:"process"
Problematic example:
#put name:"step1", description:Process data, clean outliers, type:process
Now that you understand the basics of putior:
source(system.file("examples", "reprex.R", package = "putior"))
For more detailed information, see: - ?put
- Complete
function documentation - Advanced usage vignette - Complex workflows and
integration - Best practices vignette - Team collaboration and style
guides
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.