The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
pairwiseLLM provides a unified workflow for generating
and analyzing pairwise comparisons of writing quality
using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models
via Ollama..
A typical workflow:
For prompt evaluation and positional-bias diagnostics, see:
For advanced batch processing workflows, see:
pairwiseLLM reads provider keys only from
environment variables, never from R options or global
variables.
| Provider | Environment Variable |
|---|---|
| OpenAI | OPENAI_API_KEY |
| Anthropic | ANTHROPIC_API_KEY |
| Gemini | GEMINI_API_KEY |
| Together | TOGETHER_API_KEY |
You should put these in your ~/.Renviron:
OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="..."
GEMINI_API_KEY="..."
TOGETHER_API_KEY="..."
Check which keys are available:
library(pairwiseLLM)
check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#> backend service env_var has_key
#> 1 openai OpenAI OPENAI_API_KEY TRUE
#> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE
#> 3 gemini Google Gemini GEMINI_API_KEY TRUE
#> 4 together Together.ai TOGETHER_API_KEY TRUE
Ollama runs locally and does not require an API key, just that the Ollama server is running.
The package ships with 20 authentic student writing samples:
data("example_writing_samples", package = "pairwiseLLM")
dplyr::slice_head(example_writing_samples, n = 3)
#> # A tibble: 3 × 3
#> ID text quality_score
#> <chr> <chr> <int>
#> 1 S01 "Writing assessment is hard. People write different thing… 1
#> 2 S02 "It is hard to grade writing. Some are long and some are … 2
#> 3 S03 "Assessing writing is difficult because everyone writes d… 3Each sample has:
IDtextCreate all unordered pairs:
pairs <- example_writing_samples |>
make_pairs()
dplyr::slice_head(pairs, n = 5)
#> # A tibble: 5 × 4
#> ID1 text1 ID2 text2
#> <chr> <chr> <chr> <chr>
#> 1 S01 "Writing assessment is hard. People write different things.… S02 "It …
#> 2 S01 "Writing assessment is hard. People write different things.… S03 "Ass…
#> 3 S01 "Writing assessment is hard. People write different things.… S04 "Gra…
#> 4 S01 "Writing assessment is hard. People write different things.… S05 "Wri…
#> 5 S01 "Writing assessment is hard. People write different things.… S06 "It …Sample a subset of pairs:
Randomize SAMPLE_1 / SAMPLE_2 order:
td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#>
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n how clearly the writing is organized, and how effective the language and\n conventions are."Or define your own:
Load default prompt:
tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 300))
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#>
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#>
#> SAMPLES:
#>
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#>
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#>
#> EVALUATION PROCESS (Mental Simulation):
#>
#> 1. **AdPlaceholders required:
{TRAIT_NAME}{TRAIT_DESCRIPTION}{SAMPLE_1}{SAMPLE_2}Load a template from file:
The unified wrapper works for OpenAI, Anthropic, Gemini, Together, and Ollama.
res_live <- submit_llm_pairs(
pairs = pairs_small,
backend = "openai", # also "anthropic", "gemini", "together", "ollama"
model = "gpt-4o",
trait_name = td$name,
trait_description = td$description,
prompt_template = tmpl
)Preview results:
Each row includes:
pair_idsample1_id, sample2_id<BETTER_SAMPLE> tag →
better_sample and better_idConvert LLM output to a 3-column BT dataset:
# res_live: output from submit_llm_pairs()
bt_data <- build_bt_data(res_live)
dplyr::slice_head(bt_data, 5)and/or a dataset for Elo modeling:
Fit model:
Summarize results:
The output includes:
Outputs:
Most users use the unified interface, but backend helpers are available.
submit_openai_pairs_live()build_openai_batch_requests()run_openai_batch_pipeline()parse_openai_batch_output()submit_anthropic_pairs_live()build_anthropic_batch_requests()run_anthropic_batch_pipeline()parse_anthropic_batch_output()submit_gemini_pairs_live()build_gemini_batch_requests()run_gemini_batch_pipeline()parse_gemini_batch_output()together_compare_pair_live()submit_together_pairs_live()ollama_compare_pair_live()submit_ollama_pairs_live()check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#> backend service env_var has_key
#> <chr> <chr> <chr> <lgl>
#> 1 openai OpenAI OPENAI_API_KEY TRUE
#> 2 anthropic Anthropic ANTHROPIC_API_KEY TRUE
#> 3 gemini Google Gemini GEMINI_API_KEY TRUE
#> 4 together Together.ai TOGETHER_API_KEY TRUEUse the default template or set
include_thoughts = FALSE.
Use batch APIs for >40 pairs.
Use compute_reverse_consistency() +
check_positional_bias() (see vignette(“prompt-template-bias”)
for a full example).
Mercer, S. (2025). Getting started with pairwiseLLM (Version 1.0.0) [R package vignette]. In pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation. https://shmercer.github.io/pairwiseLLM/
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.