The hardware and bandwidth for this mirror is donated by METANET, the Webhosting and Full Service-Cloud Provider.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]metanet.ch.
This package implements an adpatation of the Higher-Criticism (HC) test to discriminate two frequency tables footnotes1.
The package includes two main functions: - two.sample.pvals – produces a list of P-values, one for each feature in the two tables. - HC.vals – computes the HC score of the P-values.
A third function two.sample.HC combines the two functions above so that the HC score of the two tables is obtained using a single function call.
#' # Can be used to check similarity of word-frequencies in texts:
#' text1 = "On the day House Democrats opened an impeachment inquiry of
#' President Trump last week, Pete Buttigieg was being grilled by Iowa
#' voters on other subjects: how to loosen the grip of the rich on
#' government, how to restore science to policymaking, how to reduce child
#' poverty. At an event in eastern Iowa, a woman rose to say that her four
#' adult children were “stuck” in life, unable to afford what she had in
#' the 1980s when a $10-an-hour job paid for rent, utilities and an
#' annual vacation."
#' text2 = "How can the federal government help our young people that want to do
#' what’s right and to get to those things that their parents worked so hard for?”
#' the voter asked. This is the conversation Mr. Buttigieg wants to have.
#' Boasting a huge financial war chest but struggling in the polls, Mr. Buttigieg
#' is now staking his presidential candidacy on Iowa, and particularly on
#' connecting with rural white voters who want to talk about personal concerns
#' more than impeachment. In doing so, Mr. Buttigieg is also trying to show how
#' Democrats can win back counties that flipped from Barack Obama to Donald
#' Trump in 2016 — there are more of them in Iowa than any other state —
#' by focusing, he said, on “the things that are going to affect folks’
#' lives in a concrete way."
tb1 = table(strsplit(tolower(text1),' '))
tb2 = table(strsplit(tolower(text2),' '))
pv = two.sample.pvals(tb1,tb2)
print(pv$pv)
> [1] 1.0000 1.0000 0.2304 1.0000 1.0000 1.0000 NA 0.1936 NA
print(pv$Var1)
> go i or say should stay you and not
HC.vals(pv$pv)
> $HC
> 0.323954762194625
> $HC.star
> 0.323954762194625
> $p
> 0.2304
> $p.star
> 0.2304
n = 1000 #number of features
N = 10*n #number of observations
k = 0.1*n #number of perturbed features
seq = seq(1,n)
P = 1 / seq #sample from a Zipf law distribution
P = P / sum(P)
tb1 = data.frame(Feature = seq(1,n), # sample 1
Freq = rmultinom(n = 1, prob = P, size = N))
seq[sample(seq,k)] <- seq[sample(seq,k)]
Q = 1 / seq
Q = Q / sum(Q)
tb2 = data.frame(Feature = seq(1,n), # sample 2
Freq = rmultinom(n = 1, prob = Q, size = N))
PV = two.sample.pvals(tb1, tb2) #compute P-values
HC.vals(PV$pv) # HC test
# can also test using a single function call
two.sample.HC(tb1,tb2)
See Kipnis A. Higher Criticism for Discriminating Word-Frequency Tables and Testing Authorship (2019)↩︎
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.