我要做什么
我对Python有点陌生,并且使用熊猫库的经验有限。但是,我一直在尝试修改以下数据帧,以使程序获取3 CSV文件的内容,根据来自第一数据帧和第二数据帧的数据创建两个新变量,然后将它们串联在名为Pred_arg的var中-这是一个进行比较的参考数据框。
第三个CSV文件是测试结果-已为var df添加。
在此之后,我试图创建一个脚本来扫描var的每一列,并根据每个集群组至少具有ABCPred和BCEPred中的一个值的条件来返回true或false(在输出表中)-然后是目标是将结果打印到每个群集的结果表中,结果为true或false-如果群集结果中至少有1个True,则将该群集标记为true。
我的目标是:
Cluster Number Status
clu1 True
clu2 True
clu3 False
... ...
clu57 True
稍后我可以在其中使用group by function进行分组并计算所有True以及所有False-最终我需要删除所有返回false的行,但是我可以做到这一点
我到目前为止所做的事情
步骤1-从ABCPred中读取结果并整理
ABCPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\ABCPred_res(254).csv")
ABCPred.columns = ['Seq','drop1','drop2','drop3','drop4']
ABCPred = ABCPred[ABCPred['Seq'].notna()]
ABCPred = ABCPred.drop(columns = ['drop1','drop4'])
print(ABCPred)
Seq
0 AGAAAYYVGYLQPRTF
1 AGCLIGAEHVNNSY
2 AGTITSGWTFGAGAAL
3 AGTITSGWTFGAGAALQIPF
4 ALEPLVDLPIGI
.. ...
248 YQTQTNSPRRARSVASQS
249 YSSANNCTFEYVSQPFLM
250 YSSANNCTFEYVSQPFLMDL
251 YTSALLAGTITSGWTFGA
252 YVGYLQPRTFLLKYNE
第2步-读取BCEPred和整洁的结果
BCEPred = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\Output data\BCEPred_res_cor.csv")
print(BCEPred)
Seq
0 IHVSGTNGT
1 VYFASTEK
2 TTLDSKTQ
3 VYYHKNN
4 MDLEGKQ
5 SYLTPGDSS
6 DPLSETK
7 YAWNRKRI
8 QIAPGQT
9 NNLDSKVG
10 RLFRKSNL
11 ATVCGPKKST
12 GVLTESNK
13 VITPGTNTS
14 RVYSTGS
15 ASYQTQTNSPRRA
16 LPVSMTK
17 ICGDSTEC
18 IAVEQDKNT
19 QILPDPSKPSKR
20 GKIQDSLS
21 TLVKQLS
22 ECVLGQSKR
23 EVAKNLN
24 CKFDEDDS
第3步-我将这些数据帧添加到名为Pred_arg
的新数据帧中Pred_arg = ABCPred.assign(ABCSeq = ABCPred['Seq'],BCEPred = BCEPred['Seq']).reset_index()
Pred_arg = Pred_arg.drop(columns = ['index','Seq'])
print(Pred_arg)
ABCSeq BCEPred
0 AGAAAYYVGYLQPRTF IHVSGTNGT
1 AGCLIGAEHVNNSY VYFASTEK
2 AGTITSGWTFGAGAAL TTLDSKTQ
3 AGTITSGWTFGAGAALQIPF VYYHKNN
4 ALEPLVDLPIGI MDLEGKQ
.. ... ...
248 YQTQTNSPRRARSVASQS NaN
249 YSSANNCTFEYVSQPFLM NaN
250 YSSANNCTFEYVSQPFLMDL NaN
251 YTSALLAGTITSGWTFGA NaN
252 YVGYLQPRTFLLKYNE NaN
所以现在我创建了要比较的参考数据框
第4步-导入测试结果以进行比较
df = pd.read_csv(r"C:\Users\tonyr\OneDrive - Ulster University\PhD - Stratified medicine research projects\COVID19 paper 2020\Data\IEDB_dataset_run1.csv")
df = df.drop(columns = ['Alignment','Position','Description'])
df = df.drop(df[df.Peptide == '-'].index) #removes all rows where '-' exsists in the peptide column
df = df.drop(df[df['Peptide Number'] == 'Singleton'].index) #remove singletons
Cluster Number Peptide Number Peptide
1 1 1 QDVNCTEVPVAIHADQLTPT
2 1 2 DVNCTEVPVAIHADQLTPTW
3 1 3 EVPVAIHADQLTPTWRVYST
4 1 4 PVAIHADQLTPTWRVYSTGS
5 1 5 DQLTPTWRVYSTGSNV
.. ... ... ...
307 55 2 TQRNFYEPQIITTDNTFV
309 56 1 CCSCGSCCKFDEDDSE
310 56 2 CKFDEDDS
312 57 1 CCSCLKGCCSCGSCCKFD
313 57 2 CCSCLKGCCSCGSCCK
这就是我被困住的地方
我尝试基于第4步中的群集进行分组,尽管呈现的一切都很好,其中群集编号是从0到57的索引,但我无法使用该组来检查ABCPred和BCEPred是否在clu1中。 / p>
如果我尝试将isin用于一种情况(即ABCPred结果),则全部返回false
df_groups = df.groupby(["Cluster Number"])["Peptide"].apply(list)
df_groups.columns = ['Cluster Number','Seq(s)']
print(df_groups)
Cluster Number
1 [QDVNCTEVPVAIHADQLTPT,DVNCTEVPVAIHADQLTPTW,E...
2 [ISVTTEILPVSMTKTSVDCT,EILPVSMTKTSVDCTMYI,ILP...
3 [STEKSNIIRGWIFGTTLD,KSNIIRGWIFGTTLDS,IRGWIFG...
4 [YQPYRVVVLSFELLHAPATV,SFELLHAPATVCGP,FELLHAP...
5 [LHRSYLTPGDSSSG,HRSYLTPGDSSSGWTA,SYLTPGDSSSG...
6 [VYSSANNCTFEYVSQPFL,YSSANNCTFEYVSQPFLMDL,YSS...
7 [QIPFAMQMAYRFNG,PFAMQMAYRFNGIGVT,FAMQMAYRFNG...
8 [ASYQTQTNSPRRA,YQTQTNSPRRARSVASQS,YQTQTNSPRR...
9 [EMIAQYTSALLAGTITSG,YTSALLAGTITSGWTFGA,LAGTI...
10 [TPCSFGGVSVITPGTNTSNQ,PCSFGGVSVITPGTNTSNQV,P...
11 [RGVYYPDKVFRSSVLHSTQD,GVYYPDKVFRSSVLHSTQ,KVF...
12 [YNENGTITDAVDCA,NENGTITDAVDCALDP,ENGTITDAVDC...
13 [GVSPTKLNDLCFTNVYADSF,TKLNDLCFTNVYADSFVI,NDL...
14 [GVYYHKNNKSWMESEFRV,VYYHKNNKSWMESEFRVYSS,VYY...
15 [PFGEVFNATRFASVYAWNRK,TRFASVYAWNRKRI,RFASVYA...
16 [AGCLIGAEHVNNSY,GCLIGAEHVNNSYECD,LIGAEHVNNSY...
17 [TEIYQAGSTPCNGVEG,YQAGSTPCNGVEGFNC,QAGSTPCNG...
18 [QQFGRDIADTTDAVRDPQTL,QQFGRDIADTTDAV,QFGRDIA...
19 [YFPLQSYGFQ,LQSYGFQPTNGVGYQP,YGFQPTNGVGYQPYR...
20 [IHVSGTNGTKRFDNPVLPFN,IHVSGTNGT,VSGTNGTKRFDN...
21 [NLREFVFKNIDGYFKIYS,EFVFKNIDGYFKIYSKHT,FKNID...
22 [IAVEQDKNT,AVEQDKNTQEVFAQ,VEQDKNTQEVFAQV,QD...
23 [DKVEAEVQIDRLITGRLQSL,EAEVQIDRLITGRLQSLQTY,Q...
24 [DSLSSTASALGKLQDV,LSSTASALGKLQDVVNQN,LSSTASA...
25 [PGQTGKIADYNYKLPD,GQTGKIADYNYKLP,TGKIADYNYKL...
26 [YEQYIKWPWYIWLGFIAG,YEQYIKWPWYIWLGFI,YIKWPWY...
27 [TVEKGIYQTSNFRVQP,EKGIYQTSNFRVQPTE,KGIYQTSNF...
28 [KSNLKPFERDISTEIYQA,SNLKPFERDISTEIYQAGST,FER...
29 [VLYNSASFSTFKCYGVSP,FSTFKCYGVSPTKL,STFKCYGVSP]
30 [HGVVFLHVTYVPAQEK,GVVFLHVTYVPAQEKNFT,HVTYVPA...
31 [PGTNTSNQVAVLYQDV,GTNTSNQVAVLYQDVNCT,TSNQVAV...
32 [KQIYKTPPIKDFGGFN,KTPPIKDFGGFN,TPPIKDFGGFNFS...
33 [VTQQLIRAAEIRASANLAAT,VTQQLIRAAEIRASANLA,TQQ...
34 [GCVIAWNSNNLDSKVGGNYN,CVIAWNSNNLDSKV,NNLDSKVG]
35 [GNYNYLYRLFRKSNLKPF,NYLYRLFRKSNL,RLFRKSNL]
36 [GGFNFSQILPDPSKPSKR,SQILPDPSKPSKRSFI,QILPDPS...
37 [SSNFGAISSVLNDI,SNFGAISSVLNDILSRLD,ISSVLNDIL...
38 [QKEIDRLNEVAKNLNE,KEIDRLNEVAKNLNESLI,EVAKNLN]
39 [FPNITNLCPFGEVFNA,PNITNLCPFGEVFN,NITNLCPFGEV...
40 [LTGTGVLTESNKKF,GVLTESNK]
41 [VLPFNDGVYFASTE,VYFASTEK]
42 [ECSNLLLQYGSFCTQLNRAL,LQYGSFCTQL]
43 [EVRQIAPGQTGKIADY,QIAPGQT]
44 [QLPPAYTNSFTR,PPAYTNSFTRGVYY]
45 [VTLADAGFIKQYGDCLGDIA,GFIKQYGDCLGDIAARDLIC]
46 [TLVKQLS,LVKQLSSNFGAISS]
47 [IGKIQDSLSSTASALG,GKIQDSLS]
48 [TNVVIKVCEFQFCNDP,VVIKVCEFQFCNDPFLGVYY]
49 [ESLIDLQELGKYEQYI,DLQELGKYEQYIKWPWYI]
50 [GDIAARDLICAQKFNGLT,RDLICAQKFNGLTVLP]
51 [PQGFSALEPLVDLPIGIN,ALEPLVDLPIGI]
52 [VVIGIVNNTVYDPLQPEL,VIGIVNNTVYDPLQPE]
53 [EILDITPCSFGGVSVI,EILDITPCSFGGVS]
54 [NFRVQPTESIVRFPNITN,VQPTESIVRFPNITNL]
55 [WFVTQRNFYEPQII,TQRNFYEPQIITTDNTFV]
56 [CCSCGSCCKFDEDDSE,CKFDEDDS]
57 [CCSCLKGCCSCGSCCKFD,CCSCLKGCCSCGSCCK]
rslt_df = Pred_arg['ABCSeq'].isin(df_groups)
print(rslt_df.describe()) ## comparason coming back all false !!!!!!!
count 253
unique 1
top False
freq 253
Name: ABCSeq,dtype: object
我知道我很想念一些很简单的东西-但是想出一些新鲜的眼睛和指导对改善我的练习很有帮助。
更新
我似乎可以使用以下方法在单元格内容之间进行比较-尽管它相当粗糙
#comparing group to pred_arg
rslt_df1 = Pred_arg['ABCSeq'].isin(df['Peptide'])
rslt_df2 = Pred_arg['BCEPred'].isin(df['Peptide'])
rslt_df = df.assign(ABCSeq = rslt_df1,BCEPred = rslt_df2).reset_index()
concencus = Pred_arg['ABCSeq'].isin(df['Peptide']) & Pred_arg['BCEPred'].isin(df['Peptide'])
print(concencus.describe()) # working better
count 253
unique 2
top False
freq 232
dtype: object
谢谢:)
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。