- 最后登录
- 2015-3-18
- 在线时间
- 884 小时
- 寄托币
- 1811
- 声望
- 66
- 注册时间
- 2009-9-22
- 阅读权限
- 30
- 帖子
- 11
- 精华
- 0
- 积分
- 1308
- UID
- 2702008
 
- 声望
- 66
- 寄托币
- 1811
- 注册时间
- 2009-9-22
- 精华
- 0
- 帖子
- 11
|
【CASK EFFECT】0910F阅读全方位锻炼--越障【SCI】 1-20
Today's Topic: Ranking differentially expressed genes from Affymetrix gene expression data: methods with reproducibility重复能力, 再现性, sensitivity, and specificity
( F" Y9 K6 N/ h( L+ r0 K
Abstract 4 G2 R$ U3 \. [, B/ q" K
Background: To identify differentially expressed genes (DEGs) from microarray data, users of the Affymetrix Gene Chip system need to select both a preprocessing加工前的 algorithm n.【数】算法; 规则系统; 演段 to obtain expression level measurements and a way of ranking genes to obtain the most plausible candidates. We recently recommended suitable combinations of a preprocessing algorithm and gene ranking method that can be used to identify DEGs with a higher level of sensitivity and specificity. However, in addition to these recommendations, researchers also want to know which combinations enhance reproducibility./ Q# a0 K" `5 P5 t! T! C9 H# S& ]9 h0 ^
Results: We compared eight conventional methods for ranking genes: weighted average difference (WAD),average difference (AD), fold change (FC), rank products (RP), moderated t statistic(modT), significance analysis of microarrays (samT), shrinkage t statistic(shrinkT), and intensity based moderated t statistic (ibmT) with six preprocessing algorithms (PLIER, VSN, FARMS, multimgMOS (mmgMOS), MBEI, and GCRMA). A total of 36 real experimental data sets was evaluated on the basis of the area under the receiver operating characteristic curve (AUC) as a measure for both sensitivity and specificity. We found that the RP method performed well for VSN-, FARMS-, MBEI-, and GCRMA-preprocessed data, and the WAD method performed well for mmgMOS preprocessed data. Our analysis of the MicroArray Quality Control (MAQC) project's data sets showed that the FC-based gene ranking methods (WAD, AD, FC, and RP) had a higher level of reproducibility: The percentages of overlapping与...交搭; 叠盖住 genes (POGs) across different sites for the FC-based methods were higher overall than those for the t-statistic-based methods (modT,samT, shrinkT, and ibmT). In particular, POG values for WAD were the highest overall among the FC based methods irrespective of the choice of preprocessing algorithm. P& p, Z: ?5 u/ [# Y) g
Conclusion: Our results demonstrate that to increase sensitivity, specificity, and reproducibility in microarray analyses, we need to select suitable combinations of preprocessing algorithms and gene ranking methods. We recommend the use of FC-based methods,in particular RP or WAD.- n6 \+ H# k& f! l9 k* g
$ ` F% c6 \' s2 }
/ d5 g3 E) c( ~# c* Q2 v8 i
Background ; G' O* d3 ^+ W/ \9 F' g: s
Microarray微矩阵 analysis is often used to detect differentially差别地,区别地 expressed genes (DEGs) under different conditions. As there are considerable differences [1,2] in how well it performs, choosing the best method of ranking these genes is important.Furthermore, Affymetrix GeneChip users need to choose a preprocessing algorithm from a number of competitors in order to obtain expression-level measurements [3].
We recently reported with another group that there are suitable combinations of preprocessing algorithms and gene ranking methods [1,2]. We evaluated three preprocessing algorithms, MAS [4], RMA [5], and DFW [6], and eight gene ranking methods, WAD [1], AD, FC, RP [7], modT [8], samT [9], shrinkT [10], and ibmT[11], by using a total of 38 data sets (including 36 real experimental datasets)[1]. Meanwhile, Pearson [2] evaluated nine preprocessing algorithms, MAS [4],RMA [5], DFW [6], MBEI [12], CP [13], PLIER[14], GCRMA [15], mmgMOS [16], and FARMS[17], and five gene ranking methods, modT [8], FC, a standard t-test,cyberT [18], and PPLR [19], by using only one artificial 'spike-in' dataset,the Golden Spike dataset [13]. 6 o, ^# W+ O2 I9 E' ^, S+ j
When were-evaluated the two reports using the common algorithms and methods we found that suitable gene ranking methods for each of the three preprocessing algorithms, i.e., MAS, RMA, and DFW, converge会聚; 聚合; 集中于一点; 辐合 to the same: Combinations of MAS and modT (MAS/modT), RMA/FC, and DFW/FC can thus be recommended. However, the final conclusions for the original reports are understandably different: Our recommendations [1] are MAS/WAD, RMA/FC, and DFW/RP, while Pearson [2] recommends mmgMOS/PPLR, GCRMA/FC, and so on. This difference is mainly because fewer preprocessing algorithms were evaluated in our previous study [1]. * H3 D9 O0 B5 \# K$ T/ I! c
+ U* Z7 K' `8 {2 v
We investigated suitable gene ranking methods for each of six preprocessing algorithms: MBEI,VSN [20], PLIER, GCRMA, FARMS, and mmgMOS. We also investigated the best combination of a preprocessing algorithm and gene ranking method using another evaluation metric, i.e., the percentage of overlapping genes (POG), proposed by the MAQC study [21].
Most authors of methodological papers have made claims that their methods have a greater area under the receiver operating characteristic curve (AUC) values, i.e., both high sensitivity and specificity [1,2]. However, reproducibility is rarely mentioned[21]. A good method should produce high POG values, i.e., those indicating reproducibility as well as high AUC ones, i.e., those for sensitivity and specificity. We will discuss suitable combinations of preprocessing algorithms and gene ranking methods.3 t; |% E8 ^5 ^; a* r. l
5 j. ?$ p4 M I4 b
+ E& d& S. q1 @2 _1 Z! E
Conclusion We evaluated the performance of combinations between six preprocessing algorithms and eight gene ranking methods in terms of the AUC value, i.e., both sensitivity and specificity, and the POG one, i.e., reproducibility. Our comprehensive evaluation confirmed the importance of using suitable combinations of preprocessing algorithms and gene ranking methods.
Overall, two FC-based gene ranking methods (RP and WAD) can be recommended. Our current and previous results indicate that any of the following combinations, RMA/RP,DFW/RP, PLIER/RP, VSN/RP, FARMS/RP, MBEI/ RP, GCRMA/RP, MAS/WAD, and mmgMOS/WAD, enhances both sensitivity and specificity, and also that using the WAD method enhances reproducibility.
8 R; Q5 u" j( `( u$ X3 m+ l" b( ^ y4 ?7 B
Methods . {7 H% l4 L$ H) @6 k( X
The raw data(Affymetrix CEL files) for Datasets 3–38 were obtained from the Gene ExpressionOmnibus (GEO) website [32]. All analysis was performed using R (ver. 2.7.2)[33] and Bioconductor [34]. The versions of R libraries used in this study areas follows: plier (ver. 1.10.0), vsn (3.2.1), farms (1.3),puma (1.6.0), affy (1.16.0) [35], gcrma (2.10.0), RankProd(2.12.0) [36], st (1.0.3) [10], limma (2.14.7) [8], ROC (1.14.0).The main functions in the R libraries are as follows: justPlier for PLIER,vsnrma for VSN, q.farms for FARMS, mmgmos for mmgMOS, expresso for MBEI (PM only model), gcrma for GCRMA, mas5 for MAS, rmafor RMA, expresso and the R codes available in [37] for DFW, RP forRP, modt.stat for modT, sam.stat for samT, shrinkt.stat forshrinkT, IBMT for ibmT [38], and pumaComb and pumaDE forPPLR [19].
Since the MBEI and MAS expression measures do not output logged values, signal intensities under 1 in those preprocessed data were set to 1 so that the logarithm of the data could be found. Logged values smaller than 0 in PLIER-, VSN-, FARMS-, mmgMOS-,and GCRMA-preprocessed data were set to 0. For reproducible research, we made the R code for analyzing Dataset 4 (GEO ID: GSM189708–189713) available as the additional file [see Additional file 3]. The R codes for the other datasets are available upon request.
The raw data forthe MAQC datasets were obtained from the MAQC website [39]. The evaluation based on POG was done with 12 datasets produced by the MAQC project [21] inwhich two RNA sample types and two mixtures of the original samples were used:Sample A, a universal human reference RNA; Sample B, a human brain reference RNA;Sample C, which consisted of 75 and 25% of Sample A and B respectively; andSample D, which consisted of 25 and 75% of Sample A and B respectively. Fivereplicate experiments for each of the four sample types at six independent testsites (Sites 1–6) were conducted, and, thus there are 20 files at each site.The data preprocessing was performed at each site. The application of the gene rankingmethods was independently performed for comparisons of "Sample A versusB" and "Sample C versus D". |
|