An R implementation of the s-test

This is an R implementation of the s-test for label-free mass spectrometry-based proteomics data presented in

Pham TV, Jimenez CR. Simulated linear test applied to quantitative proteomics. Bioinformatics. 2016 Sep 1; 32(17):i702-i709.

This implementation uses moderated variances as the Limma method (Smyth G.K., 2004, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3:3).

The original MATLAB implementation is here, and its equivalent R implementation is here.

Quick start

The first step is to load the libary. This will create a variable called stest containing all the neccessary functions in your R working environment.

source("https://tvpham.github.io/stest/stest.r")

Next, load an example dataset with 4 samples in 2 conditions (group1 and group2)

dat <- read.delim("https://tvpham.github.io/stest/data-axl.txt")

colnames(dat)

# extract intensity data
intensity_data <- dat[, c("A.1", "A.2", "B.1", "B.2")]

head(intensity_data)

The following two statements estimate technical variation and perform the s-test using the estimated variation. In this example, the comparision is the first two samples versus the last two samples.

sdata <- stest$prepare(intensity_data)

ret <- stest$compare(sdata$d_log2[, c("A.1", "A.2", "B.1", "B.2")], 
                     c("group1", "group1", "group2", "group2"), 
                     sdata)

By default, the algorithm provides a model for missing data. One may turn off this feature by setting the parameter ignore_missing_data to TRUE in the first statement.

sdata <- stest$prepare(intensity_data, ignore_missing_data = TRUE)

ret <- stest$compare(sdata$d_log2[, c("A.1", "A.2", "B.1", "B.2")], 
                     c("group1", "group1", "group2", "group2"), 
                     sdata)

Finally, write out the result to a tab-deliminated text file

write.table(cbind(dat, ret), "stest-result.txt", sep ="\t", row.names = FALSE)

The result is concatenated to the original data. Data are log2-transformed. The column log_fc contains log2 of the fold change. The column pval contains the p-values of the test and the column pval.BH Benjamini-Hochberg corrected p-values. The column n_detected counts the number of data points detected.

Usages

Estimation of technical variation


stest$prepare(dat, 
              d_log2 = NULL, 
              pdf_output = "_stest.pdf", 
              moderate_factor = 1.0, 
              min_factor = 1.0, 
              ignore_missing_data = FALSE)

Note that dat is the non-log data. Missing values are denoted by 0.

The function creates a pdf file called _stest.pdf containing diagnostic information about technical variation.

By default, robust CV is used in subsequent steps. If you have a reason to use another value, it is possible by providing cv in the testing step.

You can also use a different dataset to estimate technical variation, for example when you have a dedicated set of samples for technical reproducibility analysis. The main data is then provided via the d_log2 parameter.

Performing the s-test


stest$compare(d_log2, 
              group, 
              sdata, 
              s2 = NULL, 
              cv = NULL, 
              id = NULL, 
              n.threads = 1, 
              index = 1:nrow(d_log2))

Some main parameters:

d_log2: log2-transformed data.
group: group labels of each column.
sdata: technical variation estimation from the previous step. The cv estimation can be overriden by cv.
id: matched ids for a paired test. In our example, if our data were paired, this would be c(1,2,1,2).
n.threads: the number of threads to use.
index: a subset of rows.

Updated 2019, Thang Pham