This is an R implementation of the s-test for label-free mass spectrometry-based proteomics data presented in
Pham TV, Jimenez CR. Simulated linear test applied to quantitative proteomics. Bioinformatics. 2016 Sep 1; 32(17):i702-i709.
This implementation uses moderated variances as the Limma method (Smyth G.K., 2004, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3:3).
The original MATLAB implementation is here, and its equivalent R implementation is here.
The first step is to load the libary. This will create a variable called stest
containing all the neccessary functions in your R working environment.
source("https://tvpham.github.io/stest/stest.r")
Next, load an example dataset with 4 samples in 2 conditions (group1
and group2
)
dat <- read.delim("https://tvpham.github.io/stest/data-axl.txt")
colnames(dat)
# extract intensity data
intensity_data <- dat[, c("A.1", "A.2", "B.1", "B.2")]
head(intensity_data)
The following two statements estimate technical variation and perform the s-test using the estimated variation. In this example, the comparision is the first two samples versus the last two samples.
sdata <- stest$prepare(intensity_data)
ret <- stest$compare(sdata$d_log2[, c("A.1", "A.2", "B.1", "B.2")],
c("group1", "group1", "group2", "group2"),
sdata)
By default, the algorithm provides a model for missing data. One may turn off this feature by setting the parameter ignore_missing_data
to TRUE in the first statement.
sdata <- stest$prepare(intensity_data, ignore_missing_data = TRUE)
ret <- stest$compare(sdata$d_log2[, c("A.1", "A.2", "B.1", "B.2")],
c("group1", "group1", "group2", "group2"),
sdata)
Finally, write out the result to a tab-deliminated text file
write.table(cbind(dat, ret), "stest-result.txt", sep ="\t", row.names = FALSE)
The result is concatenated to the original data. Data are log2-transformed. The column log_fc contains log2 of the fold change. The column pval contains the p-values of the test and the column pval.BH Benjamini-Hochberg corrected p-values. The column n_detected counts the number of data points detected.
stest$prepare(dat,
d_log2 = NULL,
pdf_output = "_stest.pdf",
moderate_factor = 1.0,
min_factor = 1.0,
ignore_missing_data = FALSE)
Note that dat
is the non-log data. Missing values are denoted by 0.
The function creates a pdf file called _stest.pdf containing diagnostic information about technical variation.
By default, robust CV is used in subsequent steps. If you have a reason to use another value, it is possible by providing cv
in the testing step.
You can also use a different dataset to estimate technical variation, for example when you have a dedicated set of samples for technical reproducibility analysis. The main data is then provided via the d_log2
parameter.
stest$compare(d_log2,
group,
sdata,
s2 = NULL,
cv = NULL,
id = NULL,
n.threads = 1,
index = 1:nrow(d_log2))
Some main parameters:
d_log2
: log2-transformed data.group
: group labels of each column.sdata
: technical variation estimation from the previous step. The cv estimation can be overriden by cv
.id
: matched ids for a paired test. In our example, if our data were paired, this would be c(1,2,1,2)
.n.threads
: the number of threads to use.index
: a subset of rows.