The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
pairwiseLLM provides a unified, extensible framework for
generating, submitting, and modeling pairwise comparisons of
writing quality using large language models (LLMs).
It includes:
Several vignettes are available to demonstrate functionality.
For basic function usage, see:
For advanced batch processing workflows, see:
For information on prompt evaluation and positional-bias diagnostics, see:
The following models are confirmed to work for pairwise comparisons:
| Provider | Model | Reasoning Mode? |
|---|---|---|
| OpenAI | gpt-5.2 | ✅ Yes |
| OpenAI | gpt-5.1 | ✅ Yes |
| OpenAI | gpt-4o | ❌ No |
| OpenAI | gpt-4.1 | ❌ No |
| Anthropic | claude-sonnet-4-5 | ✅ Yes |
| Anthropic | claude-haiku-4-5 | ✅ Yes |
| Anthropic | claude-opus-4-5 | ✅ Yes |
| Google/Gemini | gemini-3-pro-preview | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-R1 | ✅ Yes |
| DeepSeek-AI1 | DeepSeek-V3 | ❌ No |
| Moonshot-AI1 | Kimi-K2-Instruct-0905 | ❌ No |
| Qwen1 | Qwen3-235B-A22B-Instruct-2507 | ❌ No |
| Qwen2 | qwen3:32b | ✅ Yes |
| Google2 | gemma3:27b | ❌ No |
| Mistral2 | mistral-small3.2:24b | ❌ No |
1 via the together.ai API
2 via Ollama on a local machine
Batch APIs are currently available for OpenAI, Anthropic, and Gemini
only. Models accessed via Together.ai and Ollama are supported for live
comparisons via submit_llm_pairs() /
llm_compare_pair().
| Backend | Live | Batch |
|---|---|---|
| openai | ✅ | ✅ |
| anthropic | ✅ | ✅ |
| gemini | ✅ | ✅ |
| together | ✅ | ❌ |
| ollama | ✅ | ❌ |
Once the package is available on CRAN, install with:
install.packages("pairwiseLLM")To install the development version from GitHub:
# install.packages("pak")
pak::pak("shmercer/pairwiseLLM")Load the package:
library(pairwiseLLM)At a high level, pairwiseLLM workflows follow this
structure:
{TRAIT_NAME}, {TRAIT_DESCRIPTION},
{SAMPLE_1}, {SAMPLE_2}.The package provides helpers for each step.
Use the unified API:
llm_compare_pair() — compare one pairsubmit_llm_pairs() — compare many pairs at onceExample:
data("example_writing_samples")
pairs <- example_writing_samples |>
make_pairs() |>
sample_pairs(5, seed = 123) |>
randomize_pair_order()
td <- trait_description("overall_quality")
tmpl <- get_prompt_template("default")
res <- submit_llm_pairs(
pairs = pairs,
backend = "openai",
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)Large-scale runs use:
llm_submit_pairs_batch()llm_download_batch_results()Example:
batch <- llm_submit_pairs_batch(
backend = "anthropic",
model = "claude-sonnet-4-5",
pairs = pairs,
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)
results <- llm_download_batch_results(batch)pairwiseLLM reads keys only from environment
variables.
Keys are never printed, never stored,
and never written to disk.
You can verify which providers are available using:
check_llm_api_keys()This returns a tibble showing whether R can see the required keys for:
You may set keys temporarily for the current R session:
Sys.setenv(OPENAI_API_KEY = "your-key-here")
Sys.setenv(ANTHROPIC_API_KEY = "your-key-here")
Sys.setenv(GEMINI_API_KEY = "your-key-here")
Sys.setenv(TOGETHER_API_KEY = "your-key-here")…but for normal use and for reproducible analyses, it is
strongly recommended
to store them in your ~/.Renviron file.
~/.RenvironOpen your .Renviron file:
usethis::edit_r_environ()Add the following lines:
OPENAI_API_KEY="your-openai-key"
ANTHROPIC_API_KEY="your-anthropic-key"
GEMINI_API_KEY="your-gemini-key"
TOGETHER_API_KEY="your-together-key"
Save the file, then restart R.
You can confirm that R now sees the keys:
check_llm_api_keys()pairwiseLLM includes:
register_prompt_template()list_prompt_templates()
#> [1] "default" "test1" "test2" "test3" "test4" "test5"tmpl <- get_prompt_template("default")
cat(substr(tmpl, 1, 400), "...\n")
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **Advocate for SAMPLE_1**: Mentally list the single strongest point of evidence that makes SAMPLE_1 the ...register_prompt_template("my_template", "
Compare two essays for {TRAIT_NAME}…
{TRAIT_NAME} is defined as {TRAIT_DESCRIPTION}.
SAMPLE 1:
{SAMPLE_1}
SAMPLE 2:
{SAMPLE_2}
<BETTER_SAMPLE>SAMPLE_1</BETTER_SAMPLE> or
<BETTER_SAMPLE>SAMPLE_2</BETTER_SAMPLE>
")Use it in a submission:
tmpl <- get_prompt_template("my_template")Traits define what “quality” means.
trait_description("overall_quality")
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."You can also provide custom traits:
trait_description(
custom_name = "Clarity",
custom_description = "How understandable, coherent, and well structured the ideas are."
)LLMs often show a first-position or second-position bias.
pairwiseLLM includes explicit tools for testing this.
pairs_fwd <- make_pairs(example_writing_samples)
pairs_rev <- sample_reverse_pairs(pairs_fwd, reverse_pct = 1.0)Submit:
res_fwd <- submit_llm_pairs(pairs_fwd, model = "gpt-4o", backend = "openai", ...)
res_rev <- submit_llm_pairs(pairs_rev, model = "gpt-4o", backend = "openai", ...)Compute bias:
cons <- compute_reverse_consistency(res_fwd, res_rev)
bias <- check_positional_bias(cons)
cons$summary
bias$summaryFive included templates have been tested across different backend
providers. Complete details are presented in a vignette: vignette("prompt-template-bias")
bt_data <- build_bt_data(res)
bt_fit <- fit_bt_model(bt_data)
summarize_bt_fit(bt_fit)# res: output from submit_llm_pairs() / llm_submit_pairs_batch()
elo_data <- build_elo_data(res)
elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit$elo
elo_fit$reliability
elo_fit$reliability_weighted| Workflow | Use Case | Functions |
|---|---|---|
| Live | small or interactive runs | submit_llm_pairs, llm_compare_pair |
| Batch | large jobs, cost control | llm_submit_pairs_batch,
llm_download_batch_results |
Contributions to pairwiseLLM are very welcome!
If you encounter a problem:
Run:
devtools::session_info()Include:
Open an issue at:
https://github.com/shmercer/pairwiseLLM/issues
MIT License. See LICENSE.
Mercer, S. H. (2025). pairwiseLLM: Pairwise writing quality comparisons with large language models (Version 1.0.0) [R package; Computer software]. https://github.com/shmercer/pairwiseLLM
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.