The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.

Getting Started with pairwiseLLM

1. Introduction

pairwiseLLM provides a unified workflow for generating and analyzing pairwise comparisons of writing quality using LLM APIs (OpenAI, Anthropic, Gemini, Together), and local models via Ollama..

A typical workflow:

  1. Select writing samples
  2. Construct pairwise comparison sets
  3. Submit comparisons to an LLM (live or batch API)
  4. Parse model outputs
  5. Fit Bradley–Terry or Elo models to obtain latent writing-quality scores

For prompt evaluation and positional-bias diagnostics, see:

For advanced batch processing workflows, see:


2. Setting API Keys

pairwiseLLM reads provider keys only from environment variables, never from R options or global variables.

Provider Environment Variable
OpenAI OPENAI_API_KEY
Anthropic ANTHROPIC_API_KEY
Gemini GEMINI_API_KEY
Together TOGETHER_API_KEY

You should put these in your ~/.Renviron:

OPENAI_API_KEY="sk-..."
ANTHROPIC_API_KEY="..."
GEMINI_API_KEY="..."
TOGETHER_API_KEY="..."

Check which keys are available:

library(pairwiseLLM)

check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#>   backend   service        env_var           has_key
#> 1 openai    OpenAI         OPENAI_API_KEY    TRUE
#> 2 anthropic Anthropic      ANTHROPIC_API_KEY TRUE
#> 3 gemini    Google Gemini  GEMINI_API_KEY    TRUE
#> 4 together  Together.ai    TOGETHER_API_KEY  TRUE

Ollama runs locally and does not require an API key, just that the Ollama server is running.


3. Example Writing Data

The package ships with 20 authentic student writing samples:

data("example_writing_samples", package = "pairwiseLLM")
dplyr::slice_head(example_writing_samples, n = 3)
#> # A tibble: 3 × 3
#>   ID    text                                                       quality_score
#>   <chr> <chr>                                                              <int>
#> 1 S01   "Writing assessment is hard. People write different thing…             1
#> 2 S02   "It is hard to grade writing. Some are long and some are …             2
#> 3 S03   "Assessing writing is difficult because everyone writes d…             3

Each sample has:


4. Constructing Pairwise Comparisons

Create all unordered pairs:

pairs <- example_writing_samples |>
  make_pairs()

dplyr::slice_head(pairs, n = 5)
#> # A tibble: 5 × 4
#>   ID1   text1                                                        ID2   text2
#>   <chr> <chr>                                                        <chr> <chr>
#> 1 S01   "Writing assessment is hard. People write different things.… S02   "It …
#> 2 S01   "Writing assessment is hard. People write different things.… S03   "Ass…
#> 3 S01   "Writing assessment is hard. People write different things.… S04   "Gra…
#> 4 S01   "Writing assessment is hard. People write different things.… S05   "Wri…
#> 5 S01   "Writing assessment is hard. People write different things.… S06   "It …

Sample a subset of pairs:

pairs_small <- sample_pairs(pairs, n_pairs = 10, seed = 123)

Randomize SAMPLE_1 / SAMPLE_2 order:

pairs_small <- randomize_pair_order(pairs_small, seed = 99)

5. Traits and Prompt Templates

5.1 Using a built-in trait

td <- trait_description("overall_quality")
td
#> $name
#> [1] "Overall Quality"
#> 
#> $description
#> [1] "Overall quality of the writing, considering how well ideas are expressed,\n      how clearly the writing is organized, and how effective the language and\n      conventions are."

Or define your own:

td_custom <- trait_description(
  custom_name = "Clarity",
  custom_description = "How clearly and effectively ideas are expressed."
)

5.2 Using or customizing prompt templates

Load default prompt:

tmpl <- set_prompt_template()
cat(substr(tmpl, 1, 300))
#> You are a debate adjudicator. Your task is to weigh the comparative strengths of two writing samples regarding a specific trait.
#> 
#> TRAIT: {TRAIT_NAME}
#> DEFINITION: {TRAIT_DESCRIPTION}
#> 
#> SAMPLES:
#> 
#> === SAMPLE_1 ===
#> {SAMPLE_1}
#> 
#> === SAMPLE_2 ===
#> {SAMPLE_2}
#> 
#> EVALUATION PROCESS (Mental Simulation):
#> 
#> 1.  **Ad

Placeholders required:

Load a template from file:

set_prompt_template(file = "my_template.txt")

6. Live Pairwise Comparisons

The unified wrapper works for OpenAI, Anthropic, Gemini, Together, and Ollama.

res_live <- submit_llm_pairs(
  pairs             = pairs_small,
  backend           = "openai", # also "anthropic", "gemini", "together", "ollama"
  model             = "gpt-4o",
  trait_name        = td$name,
  trait_description = td$description,
  prompt_template   = tmpl
)

Preview results:

dplyr::slice_head(res_live, 5)

Each row includes:


7. Preparing Data for BT or Elo Modeling

Convert LLM output to a 3-column BT dataset:

# res_live: output from submit_llm_pairs()
bt_data <- build_bt_data(res_live)
dplyr::slice_head(bt_data, 5)

and/or a dataset for Elo modeling:

# res_live: output from submit_llm_pairs()
elo_data <- build_elo_data(res_live)

8. Bradley–Terry Modeling

Fit model:

bt_fit <- fit_bt_model(bt_data)

Summarize results:

summarize_bt_fit(bt_fit)

The output includes:


9. Elo Modeling

elo_fit <- fit_elo_model(elo_data, runs = 5)
elo_fit

Outputs:


10. Batch APIs (Large Jobs)

10.1 Submit a batch

batch <- llm_submit_pairs_batch(
  backend            = "openai",
  model              = "gpt-4o",
  pairs              = pairs_small,
  trait_name         = td$name,
  trait_description  = td$description,
  prompt_template    = tmpl
)

10.2 Download results

res_batch <- llm_download_batch_results(batch)
head(res_batch)

11. Backend-Specific Tools

Most users use the unified interface, but backend helpers are available.

11.1 OpenAI

11.2 Anthropic

11.3 Google Gemini

11.4 Together.ai (live only)

11.5 Ollama (local, live only)


12. Troubleshooting

Missing API keys

check_llm_api_keys()
#> All known LLM API keys are set: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY, TOGETHER_API_KEY.
#> # A tibble: 4 × 4
#>   backend   service       env_var           has_key
#>   <chr>     <chr>         <chr>             <lgl>  
#> 1 openai    OpenAI        OPENAI_API_KEY    TRUE   
#> 2 anthropic Anthropic     ANTHROPIC_API_KEY TRUE   
#> 3 gemini    Google Gemini GEMINI_API_KEY    TRUE   
#> 4 together  Together.ai   TOGETHER_API_KEY  TRUE

Getting chain-of-thought leakage

Use the default template or set include_thoughts = FALSE.

Timeouts

Use batch APIs for >40 pairs.

Positional bias

Use compute_reverse_consistency() + check_positional_bias() (see vignette(“prompt-template-bias”) for a full example).


13. Citation

Mercer, S. (2025). Getting started with pairwiseLLM (Version 1.0.0) [R package vignette]. In pairwiseLLM: Pairwise Comparison Tools for Large Language Model-Based Writing Evaluation. https://shmercer.github.io/pairwiseLLM/

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.