Shareholder quorum subsampling R function

Shareholder quorum subsampling is a method I invented in 2009 that is similar in some ways to rarefaction, but conceptually very different. Rarefaction tells you how many species you would find in a given ecological sample given a fixed, uniform sample size. SQS tells you how many species you would find given fixed "coverage" of the underlying abundance distribution. "Coverage" (Good 1953) is simply the sum of the frequencies of the species you have sampled (so, it is 100% when all species are known). When coverage is fixed, individual samples may be large (given even abundances) or small (given uneven abundances). It all depends on the data.

The important point is that when samples have identical coverage, the ratio of their species counts should exactly mirror the ratio one would obtain if one could examine their entire respective sampling universes. That is, with SQS "twice as diverse" really does mean TWICE AS DIVERSE. The same is not true of rarefaction (much less raw species counts).

This page gives instructions for downloading and using version 3.3 of the R function I have written to perform SQS. Version 1.0 was released in July, 2009; versions 2.0 and 3.1 featured successively improved subsampling algorithms (respectively released in March and July, 2011); version 3.2 has slightly better reporting of stats (released August 2011); and version 3.3 (released December 2011) includes a minor bug fix. And you know something, 3.3 really rocks — it's almost 100% unbiased regardless of what the abundance distribution looks like.

Meanwhile, Fossilworks has a built-in diversity curve generator that implements SQS, and I have also written a Perl script to produce SQS diversity curves from PaleoDB occurrence data files. Write me an e-mail if you want it.

Do not use the R function with taxonomic occurrence data because literature compilations introduce biases that have to be dealt with using algorithms I describe in my 2014 Paleobiology paper on turnover rates. Use the Perl program or PaleoDB website instead.

And who am I, anyway? I love those existential questions.

Arguments

The R function is called sqs(). It expects a single specimen count array (i.e., the abundance data from a single fossil collection/quadrat/census/whatever) and takes the following arguments:

ab = abundance data (= count) vector
q = quorum (desired frequency coverage level, between 0 and 1)
trials = number of subsampling trials
method = subsampling method (default SQS, option "rarefaction" a.k.a. "CR")
dominant = include/exclude dominant (most common) taxon (default include, option "exclude" a.k.a. "no")

Important note: the 3.3 algorithm is not guaranteed to return subsamples with the absolute exact quorum that was requested. Version 4.0 will. But at the moment, for example, if you enter q=0.80 you may get back a mean subsampled u (see below) of 0.78. This problem is more serious for samples with very high dominance and/or analyses employing very low q values. My recommendation is to (1) see if it helps to exclude the dominant; (2) not perform analyses with q below about 0.4; and (3) show subsampling curves with subsampled richness on the y-axis and the returned average u values on the x-axis (instead of the submitted q values).

Returned values

On the plus side, the R script is dead-simple and returns a whole bunch of standard ecological statistics. Use them at your own peril, however, as most of them either don't work or don't tell you much. The returned values are:

raw richness = the number of cells in the data array (i.e., the number of taxa)
Good's u = the index of frequency coverage described by Good (1953); the same equation is used during subsampling trials to see whether the quorum has been achieved, but this value pertains to the entire data set
subsampled richness = the geometric mean number of taxa found in subsamples (as generated either by SQS or rarefaction)
subsampled u = the mean value of Good's u based on the randomly drawn subsamples
Chao 1 = the simple estimate of species pool size produced by the index described in Chao (1984), which may yield noisy and/or sample size-biased estimates (included for comparative purposes only)
subsampled Chao 1 = the same estimate based only on counts included in subsamples (averaged across subsampling trials), which can be much lower than the overall Chao 1 estimate (unfortunately)
k = the generating parameter of the geometric series distribution, which is computed with a simple regression of log abundance vs. abundance rank
Fisher's alpha = the generating parameter of the log series distribution, which is generated with a standard recursive equation involving observed richness and total abundance
Shannon's H = the standard information index, i.e., -1 times the sum of f_i log f_i where f_i = the proportional frequency of the ith taxon in the sample
Hurlbert's PIE = the sample-size corrected probability of interspecific encounter index defined by Hurlbert (1971)
dominance = the Berger-Parker dominance index, i.e., the proportional frequency of the most common taxon
specimens = the total count across the sample
singletons = the number of taxa with a count of 1 (used to compute Good's u and Chao 1)
doubletons = the number of taxa with a count of 2 (used to compute Chao 1)
specimens drawn = the geometric mean number of specimens drawn in each random subsample

Examples

Read an array and run a basic SQS analysis with a quorum of 0.8:

x = scan("/Users/me/datafile")
sqs(x,0.8,100)["subsampled richness"]

Perform a standard rarefaction analysis with a quota of 200 items and print all the stats:

sqs(x,200,100,"rarefaction")

Iterate through a series of quorum levels and plot the results (a subsampling curve):

levels = seq(.05,.95,.05)
r = array()
u = array()
for (i in 1:length(levels)) {
  s = sqs(x,levels[i],300,"","no")
  r[i] = s["subsampled richness"]
  u[i] = s["subsampled u"]
}
plot(u,r,col="darkorange",cex=0.6,type="o",log="y",xlab="quorum",ylab="richness")

Read a PaleoDB occurrence file, turn the data into a count vector, and analyze it (even though I just told you not to):

o = read.table("/Users/me/download.csv",header=T,sep=",")
t = table(o[,"occurrence.genus_name"])
sqs(t,0.5,100)

A note on nomenclature

As mentioned, SQS and rarefaction aim to do fundamentally different things: the first equates fair sampling with uniform "coverage," and the second equates fair sampling with uniform sampling. In other words, the disagreement is on how to answer the classic philosophical question "What Is The Good?" We might just as well be talking about the Meaning Of Life here.

So, whatever you do, do not call SQS rarefaction. If the phrase "quorum subsampling" makes you want to puke or something, call it "fixed coverage subsampling" or some other bland boring thing — but don't confuse yourself and everyone else by equating it with a method whose founding premise couldn't possibly be more different.

Nonetheless, there are general similarities between rarefaction and SQS. Both seek to make sampling "fair" in some sense by looking at subsamples, and both are general methodologies with different possible implementations. For example, there is a standard algorithmic method for performing rarefaction (Simberloff 1972) in addition to an exact equation (Hurlbert 1971). And there are several algorithms that yield good approximations of fixed coverage under differing assumptions (e.g., under the assumption that published literature tends to overrepresent occurences of newly described taxa). These algorithms are described in my three 2010 papers and in one of my 2014 papers (see my CV).

The general idea of quorum subsampling and my name for it were both first published in my 2009 GSA abstract. By a remarkable coincidence, Lou Jost shortly afterwards published a paper that briefly described the idea of fixed coverage subsampling, which was suggested to him by Anne Chao. However, Jost's paper gave no details about the method, didn't name it, didn't use it, and advocated it only as a means of improving certain statistics that are used to describe abundance distributions (such as Pielou's J).

Chao and Jost (2012) went on to rename the method "coverage-based rarefaction". Ecologists continue to use that name and attribute the method to them. But there shouldn't be a priority issue here — SQS is not "rarefaction" in any sense, and because Jost thought he was dealing with some kind of "rarefaction" but didn't actually name the method, I certainly have name priority in addition to general priority.