The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Getting Started with pairwiseLLM

1. Introduction

pairwiseLLM provides a unified workflow for generating and analyzing pairwise comparisons of writing quality using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models via Ollama..

A typical workflow:

Select writing samples
Construct pairwise comparison sets
Submit comparisons to an LLM (live or batch API)
Parse model outputs
Fit Bradley–Terry or Elo models to obtain latent writing-quality scores

For prompt evaluation and positional-bias diagnostics, see:

vignette("prompt-template-bias")

For advanced batch processing workflows, see:

vignette("advanced-batch-workflows")

2. Setting API Keys

pairwiseLLM reads provider keys only from environment variables, never from R options or global variables.

Provider	Environment Variable
OpenAI	OPENAI_API_KEY
Anthropic	ANTHROPIC_API_KEY
Gemini	GEMINI_API_KEY
Together	TOGETHER_API_KEY

You should put these in your ~/.Renviron:

OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="..."
GEMINI_API_KEY="..."
TOGETHER_API_KEY="..."

Check which keys are available:

library(pairwiseLLM)

check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#>   backend   service        env_var           has_key
#> 1 openai    OpenAI         OPENAI_API_KEY    TRUE
#> 2 anthropic Anthropic      ANTHROPIC_API_KEY TRUE
#> 3 gemini    Google Gemini  GEMINI_API_KEY    TRUE
#> 4 together  Together.ai    TOGETHER_API_KEY  TRUE

Ollama runs locally and does not require an API key, just that the Ollama server is running.

3. Example Writing Data

The package ships with 20 authentic student writing samples:

data("example_writing_samples", package = "pairwiseLLM")
dplyr::slice_head(example_writing_samples, n = 3)
#> # A tibble: 3 × 3
#>   ID    text                                                       quality_score
#>   <chr> <chr>                                                              <int>
#> 1 S01   "Writing assessment is hard. People write different thing…             1
#> 2 S02   "It is hard to grade writing. Some are long and some are …             2
#> 3 S03   "Assessing writing is difficult because everyone writes d…             3

Each sample has:

ID
text

4. Constructing Pairwise Comparisons

Create all unordered pairs:

pairs <- example_writing_samples |>
  make_pairs()

dplyr::slice_head(pairs, n = 5)
#> # A tibble: 5 × 4
#>   ID1   text1                                                        ID2   text2
#>   <chr> <chr>                                                        <chr> <chr>
#> 1 S01   "Writing assessment is hard. People write different things.… S02   "It …
#> 2 S01   "Writing assessment is hard. People write different things.… S03   "Ass…
#> 3 S01   "Writing assessment is hard. People write different things.… S04   "Gra…
#> 4 S01   "Writing assessment is hard. People write different things.… S05   "Wri…
#> 5 S01   "Writing assessment is hard. People write different things.… S06   "It …

Sample a subset of pairs:

pairs_small <- sample_pairs(pairs, n_pairs = 10, seed = 123)

Randomize SAMPLE_1 / SAMPLE_2 order:

pairs_small <- randomize_pair_order(pairs_small, seed = 99)

5. Traits and Prompt Templates

5.1 Using a built-in trait

td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#> 
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n      how clearly the writing is organized, and how effective the language and\n      conventions are."

Or define your own:

td_custom <- trait_description(
  custom_name = "Clarity",
  custom_description = "How clearly and effectively ideas are expressed."
)

5.2 Using or customizing prompt templates

Load default prompt:

tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 300))
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#> 
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#> 
#> SAMPLES:
#> 
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#> 
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#> 
#> EVALUATION PROCESS (Mental Simulation):
#> 
#> 1.  **Ad

Placeholders required:

{TRAIT_NAME}
{TRAIT_DESCRIPTION}
{SAMPLE_1}
{SAMPLE_2}

Load a template from file:

set_prompt_template(file = "my_template.txt")

6. Live Pairwise Comparisons

The unified wrapper works for OpenAI, Anthropic, Gemini, Together, and Ollama.

res_live <- submit_llm_pairs(
  pairs             = pairs_small,
  backend           = "openai", # also "anthropic", "gemini", "together", "ollama"
  model             = "gpt-4o",
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl
)

Preview results:

dplyr::slice_head(res_live, 5)

Each row includes:

pair_id
sample1_id, sample2_id
parsed <BETTER_SAMPLE> tag → better_sample and better_id
(optionally) raw model output

7. Preparing Data for BT or Elo Modeling

Convert LLM output to a 3-column BT dataset:

# res_live: output from submit_llm_pairs()
bt_data <- build_bt_data(res_live)
dplyr::slice_head(bt_data, 5)

and/or a dataset for Elo modeling:

# res_live: output from submit_llm_pairs()
elo_data <- build_elo_data(res_live)

8. Bradley–Terry Modeling

Fit model:

bt_fit <- fit_bt_model(bt_data)

Summarize results:

summarize_bt_fit(bt_fit)

The output includes:

latent θ ability scores
SEs
reliability (sirt engine)

9. Elo Modeling

elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit

Outputs:

Elo ratings for each sample
unweighted and weighted reliability
trial counts

10. Batch APIs (Large Jobs)

10.1 Submit a batch

batch <- llm_submit_pairs_batch(
  backend            = "openai",
  model              = "gpt-4o",
  pairs              = pairs_small,
  trait_name         = td$name,
  trait_description  = td$description,
  prompt_template    = tmpl
)

10.2 Download results

res_batch <- llm_download_batch_results(batch)
head(res_batch)

11. Backend-Specific Tools

Most users use the unified interface, but backend helpers are available.

11.1 OpenAI

submit_openai_pairs_live()
build_openai_batch_requests()
run_openai_batch_pipeline()
parse_openai_batch_output()

11.2 Anthropic

submit_anthropic_pairs_live()
build_anthropic_batch_requests()
run_anthropic_batch_pipeline()
parse_anthropic_batch_output()

11.3 Google Gemini

submit_gemini_pairs_live()
build_gemini_batch_requests()
run_gemini_batch_pipeline()
parse_gemini_batch_output()

11.4 Together.ai (live only)

together_compare_pair_live()
submit_together_pairs_live()

11.5 Ollama (local, live only)

ollama_compare_pair_live()
submit_ollama_pairs_live()

12. Troubleshooting

Missing API keys

check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#>   backend   service       env_var           has_key
#>   <chr>     <chr>         <chr>             <lgl>  
#> 1 openai    OpenAI        OPENAI_API_KEY    TRUE   
#> 2 anthropic Anthropic     ANTHROPIC_API_KEY TRUE   
#> 3 gemini    Google Gemini GEMINI_API_KEY    TRUE   
#> 4 together  Together.ai   TOGETHER_API_KEY  TRUE

Getting chain-of-thought leakage

Use the default template or set include_thoughts = FALSE.

Timeouts

Use batch APIs for >40 pairs.

Positional bias

Use compute_reverse_consistency() + check_positional_bias() (see vignette(“prompt-template-bias”) for a full example).

13. Citation

Mercer, S. (2025). Getting started with pairwiseLLM (Version 1.0.0) [R package vignette]. In pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation. https://shmercer.github.io/pairwiseLLM/

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.