如何解决找出不同组内方差最大的变量 识别具有 (1) 同一组内的低方差和 (2) 明显、相对较大差异的变量的最佳方法是什么?
我正在尝试运行回归模型,我想在其中找到最佳预测变量。然而,数据包含超过 100,000 个变量(这是一个基于微阵列的实验)和 2 个结果(肿瘤和非肿瘤)。这是数据的快照:
>dt
Tumor cg15560884 cg15979415 cg21482377 cg27346986 cg13565718 cg04359978 cg00328058 cg07787977 cg02632261
1 No -0.2480779 -3.541635298 0.1930965 -0.2855506 1.7873570 -0.05663302 -0.3248885 -2.9448065 0.6228754
2 No 0.9172439 0.055514083 0.4655855 0.3226286 2.0404916 1.93954213 1.0556121 0.1188842 0.6394047
3 No 0.4115322 -2.688796456 -0.3414734 -0.5690240 1.4191325 0.23146577 -0.3843809 -2.3456532 2.1169214
4 No 0.9564983 -0.284362579 0.9074372 0.3841181 1.9482238 2.30368166 1.6791506 0.1672436 1.8456432
5 No 0.4373796 -0.847716026 -0.3654539 0.6407850 0.4981430 1.41772759 0.1211819 -0.3048053 0.9258699
6 No 1.1842945 0.184265452 -0.5769016 0.2349631 0.4897802 1.96881485 2.1653989 -0.2689008 -0.7468990
7 No 1.2802583 1.510712751 -0.4231337 -0.2016259 2.7457824 2.78922437 2.1925534 0.5288933 1.0394935
8 Yes 1.1831400 0.103893327 0.3585823 0.5774059 2.9961775 3.00736681 1.9211571 1.9990507 2.5718125
9 Yes 1.1419304 0.009477014 1.7524348 0.6827657 3.1542609 3.11282241 2.3964859 1.2965353 2.9299558
10 Yes 1.5811014 2.612363269 4.0609050 2.5058440 3.4295390 2.74999398 3.5159891 4.0156051 4.1311138
11 Yes 0.4145909 -0.375025614 0.2912988 0.3032374 3.1445856 2.20233921 0.8775737 0.8418369 1.9903667
12 Yes 0.9668263 -0.272698105 -0.1731778 0.2230170 2.5546191 1.91083215 2.3383876 2.3296599 2.1821964
13 Yes -0.1230484 -0.625187944 0.2620956 -0.0419292 2.9346895 2.45153644 1.9039218 1.5932535 2.4690055
14 Yes 0.8659252 0.175015222 1.0062097 0.3605752 2.0769247 1.52875829 1.5361073 0.8493504 2.3467234
我已经进行了一系列的 logit 分析,这些分析似乎可以很好地执行。简而言之,这就是管道:
- 采用经验贝叶斯方法 (
lmFit()
) 的一系列数组 (eBayes()
) 的线性模型:
#Designing the contrast model:
design <- data.frame(Tumor=c(rep("0",7),rep("1",7)),Benign=c(rep("1",rep("0",7)))
#Running linear regression with empirical Bayes approach:
lmFit(dt[,-1],design) %>% eBayes() -> fit
- 找到的候选人的单变量分析 (
glm()
) (P
glm(Tumor ~ [*],family=binomial(link='logit'),data=dt)
- 使用套索惩罚和交叉验证 (
cv.glmnet()
) 对找到的候选对象进行多变量分析 (P 1,000 名候选人):
#This is just an example (i.e. these are not necessarily the candidates).
x <- model.matrix(as.formula(c("~cg15560884 + cg15979415 + cg21482377 + cg27346986 + cg13565718 + cg04359978 + cg00328058 + cg07787977 + cg02632261")),dt)
y <- dt$Tumor
cv.lasso <- cv.glmnet(x,y,family="binomial",standardize=T,alpha=1,nfolds=10,data=dt)
在这一步,大约有 25 个候选通过了所有的分析。
当我绘制那些潜在的候选者时,我发现两组之间的差异(肿瘤 x 非肿瘤)对于其中一些人来说并不是那么好.由于这里的目的是寻找一些具有潜在临床相关性的候选人,因此问题如下:
识别具有 (1) 同一组内的低方差和 (2) 明显、相对较大差异的变量的最佳方法是什么?
我考虑过使用 Sum of squares between groups 或类似方法开始管道,但我不确定最好的方法是什么。
任何帮助将不胜感激。
PS:这是上面提供的模拟数据:
> dput(dt)
structure(list(Tumor = structure(c(1L,1L,2L,2L),.Label = c("No","Yes"),class = "factor"),cg15560884 = c(-0.248077910261345,0.917243931527906,0.411532204288758,0.956498270834689,0.437379596251596,1.18429454839675,1.28025825354934,1.18313995574906,1.14193044361971,1.58110142968133,0.414590861304658,0.966826317765609,-0.123048367553966,0.865925151474944),cg15979415 = c(-3.54163529762454,0.0555140832614802,-2.68879645560804,-0.284362579485303,-0.847716026488968,0.184265451680517,1.51071275115752,0.103893326861259,0.00947701421375391,2.61236326867269,-0.375025613783568,-0.272698105021754,-0.62518794357036,0.175015221908592),cg21482377 = c(0.19309650443158,0.465585470446969,-0.341473357667879,0.907437241260519,-0.36545393340153,-0.576901593963131,-0.423133690212283,0.358582311092238,1.75243481045574,4.06090501077815,0.291298796940103,-0.173177826492712,0.262095579024894,1.00620967854383),cg27346986 = c(-0.285550598277455,0.322628631286669,-0.569023984702458,0.384118141327422,0.640785025494895,0.234963083143655,-0.201625866484334,0.577405892650868,0.682765746636268,2.50584400408156,0.303237442139183,0.223016985745948,-0.0419291977006327,0.360575219255877),cg13565718 = c(1.78735699619159,2.04049160554018,1.41913246191327,1.94822376321837,0.498142972793614,0.489780156080168,2.74578238467836,2.99617752225596,3.15426093568808,3.42953897985088,3.14458563100296,2.55461911595142,2.93468952565516,2.07692465631039
),cg04359978 = c(-0.056633023447488,1.93954212965755,0.231465769910894,2.30368165948499,1.41772758917327,1.96881485075863,2.78922437354038,3.00736681386976,3.11282240855254,2.74999398469954,2.2023392119177,1.91083214812378,2.45153643876517,1.52875829347186),cg00328058 = c(-0.324888517874121,1.05561214264499,-0.384380901476841,1.67915058188375,0.121181903594745,2.16539894626557,2.19255343382776,1.9211570874648,2.39648589262158,3.51598910725581,0.877573664935301,2.33838763841662,1.90392175186133,1.53610733484261),cg07787977 = c(-2.94480653748164,0.118884228237407,-2.34565316735368,0.167243632510614,-0.304805256674929,-0.268900821807674,0.528893346493377,1.99905071007526,1.29653531766456,4.01560505952943,0.841836891217517,2.32965985284038,1.5932534862773,0.849350444895923),cg02632261 = c(0.622875369361909,0.63940465080861,2.11692137500146,1.84564321288801,0.925869893712036,-0.746898982059526,1.03949352765268,2.57181250869727,2.9299557731721,4.13111383622201,1.9903666807218,2.18219637136446,2.46900554766421,2.34672341454479)),row.names = c(NA,14L),class = "data.frame")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。