如何解决自定义词汇的一种热门编码
我有charset
,如下。
charset =set([ '$','^','#','(',')','-','.','/','1','2','3','4','5','6','7','=','Br','C','Cl','F','I','N','O','P','S','[2H]','[Br-]','[C@@H]','[C@@]','[C@H]','[C@]','[Cl-]','[H]','[I-]','[N+]','[N-]','[N@+]','[N@@+]','[NH+]','[NH2+]','[NH3+]','[N]','[Na+]','[O-]','[P+]','[S+]','[S-]','[S@+]','[S@@+]','[SH]','[Si]','[n+]','[n-]','[nH+]','[nH]','[o+]','[se]','\\','c','n','o','s','!','E'])
在此charset
的基础上,我如下创建char_to_int
。
char_to_int = dict((c,i) for i,c in enumerate(charset))
{'[nH]':0,'[2H]':1,'2':2,'N':3,'Cl':4,'c':5,'$':6, 'O':7,'((':8,'6':9,'s':10,'[S @ +]':11,'[C @@ H]':12,'C': 13,'[nH +]':14,'/':15,'[NH +]':16,'[Br-]':17,'[Si]':18, '4':19,'[N @ +]':20,'[se]':21,'P':22,'[SH]':23,'[N +]': 24,'[N]':25,'^':26,'5':27,'7':28,'n':29,'!':30, '\':31,'[n-]':32,'S':33,'[NH3 +]':34,'#':35,'I':36, '[O-]':37,'1':38,'[NH2 +]':39,'[S @@ +]':40,'Br':41,'F': 42,'[Na +]':43,'E':44,'[S-]':45,'。':46,')':47,'[C @]': 48,'=':49,'3':50,'-':51,'[C @ H]':52,'[Cl-]':53,'[I-]': 54,'[H]':55,'[P +]':56,'[S +]':57,'o':58,'[N @@ +]':59, '[N-]':60,'[n +]':61,'[o +]':62,'[C @@]':63}
和int_to_char
如下。
int_to_char = dict((i,c) for i,c in enumerate(charset))
{0:'[nH]',1:'[2H]',2:'2',3:'N',4:'Cl',5:'c',6:'$', 7:'O',8:'(',9:'6',10:'s',11:'[S @ +]',12:'[C @@ H]',13: 'C',14:'[nH +]',15:'/',16:'[NH +]',17:'[Br-]',18:'[Si]', 19:'4',20:'[N @ +]',21:'[se]',22:'P',23:'[SH]',24: '[N +]',25:'[N]',26:'^',27:'5',28:'7',29:'n',30:'!', 31:'\',32:'[n-]',33:'S',34:'[NH3 +]',35:'#',36:'I', 37:“ [O-]”,38:“ 1”,39:“ [NH2 +]”,40:“ [S @@ +]”,41:“ Br”,42: 'F',43:'[Na +]',44:'E',45:'[S-]',46:'。',47:')',48: '[C @]',49:'=',50:'3',51:'-',52:'[C @ H]',53:'[Cl-]',54: '[I-]',55:'[H]',56:'[P +]',57:'[S +]',58:'o',59:'[N @@ +]', 60:'[N-]',61:'[n +]',62:'[o +]',63:'[C @@]'}
我有一个字符串,希望基于char_to_int
和int_to_char
转换为一种热编码。
string = 'N[C@H]1C[C@@H](N2Cc3nn4cccnc4c3C2)CC[C@@H]1c1cc(F)c(F)cc1F'
是否有任何有效的方法使用自定义的char_to_int
和int_to_char
将string
转换为一个热向量?
解决方法
from itertools import chain,repeat,islice
import re
string = 'N[C@H]1C[C@@H](N2Cc3nn4cccnc4c3C2)CC[C@@H]1c1cc(F)c(F)cc1F'
items_list=[ '$','^','#','(',')','-','.','/','1','2','3','4','5','6','7','=','Br','C','Cl','F','I','N','O','P','S','[2H]','[Br-]','[C@@H]','[C@@]','[C@H]','[C@]','[Cl-]','[H]','[I-]','[N+]','[N-]','[N@+]','[N@@+]','[NH+]','[NH2+]','[NH3+]','[N]','[Na+]','[O-]','[P+]','[S+]','[S-]','[S@+]','[S@@+]','[SH]','[Si]','[n+]','[n-]','[nH+]','[nH]','[o+]','[se]','\\','c','n','o','s','!','E']
charset = set(items_list)
char_to_int = dict((c,i) for i,c in enumerate(charset))
pattern = '|'.join(re.escape(item) for item in items_list)
tokens = re.findall(pattern,string)
x=[char_to_int[k] for k in tokens]
在这里,x
是一种热编码。
x=[3,52,38,13,12,8,3,2,5,50,29,19,47,42,42]
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。