如何解决使用 Pandas 提取 $ 符号后的字母
我正在尝试从电子表格中提取包含 $ 符号的数据。
我已经隔离了数据,只给我包含数据的列,但我想要做的是提取任何和所有跟在 $ 符号后面的符号。
例如: $AAPL $LOW $TSLA 等等来自整个数据集,但我不需要或想要 $1000 $600 等等 - 只是字母,后面还有一个句点或空格,但只有字符 az 是我想要得到的.
我没有成功完全提取,我的代码开始变得混乱,所以我将提供可以带回数据的代码供您自己查看。我正在使用 Jupyter Notebook。
import mysql.connector
import pandas
googleSheedID = '15fhpxqWDRWkNtEFhi9bQyWUg8pDn4B-R2N18s1xFYTU'
worksheetName = 'Sheet1'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheedID,worksheetName
)
df = pandas.read_csv(URL)
del df['DATE']
del df['USERNAME']
del df['LINK']
del df['LINK2']
df[df["TWEET"].str.contains("RT")==False]
print(df)
解决方法
不确定我是否正确理解您想要的内容,但以下代码给出了 $
之后
之前的所有元素(空格)。
import mysql.connector
import pandas
googleSheedID = '15fhpxqWDRWkNtEFhi9bQyWUg8pDn4B-R2N18s1xFYTU'
worksheetName = 'Sheet1'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(
googleSheedID,worksheetName
)
df = pandas.read_csv(URL)
del df['DATE']
del df['USERNAME']
del df['LINK']
del df['LINK2']
unique_results = []
for i in range(len(df['TWEET'])):
if 'RT' in df["TWEET"][i]:
continue
else:
for j in range(len(df['TWEET'][i])-1):
if df['TWEET'][i][j] == '$':
if df['TWEET'][i][j+1] == '1' or df['TWEET'][i][j+1] == '2' or df['TWEET'][i][j+1] == '3' or\
df['TWEET'][i][j+1] == '4' or df['TWEET'][i][j+1] == '5' or df['TWEET'][i][j+1] == '6' or\
df['TWEET'][i][j+1] == '7' or df['TWEET'][i][j+1] == '8' or df['TWEET'][i][j+1] == '9' or df['TWEET'][i][j+1] == '0':
continue
else:
start = j
for k in range(start,len(df['TWEET'][i])):
if df['TWEET'][i][k] == ' ' or df['TWEET'][i][k:k+1] == '\n':
end = k
break
results = df['TWEET'][i][start:end]
if results not in unique_results:
unique_results.append(results)
print(unique_results)
编辑:修复代码
输出是:
['$GME','$SNDL','$FUBO','$AMC','$LOTZ','$CLOV','$USAS','$AIHS','$PLM','$LODE','$TTNP','$IMTE','','$NAK.','$NAK','$CRBP','$AREC','$NTEC','$NTN','$CBAT','$ZYNE','$HOFV','$GWPH','$KERN','$ZYNE,','$AIM','$WWR','$CARV','$VISL','$SINO','$NAKD','$GRPS','$RSHN','$MARA','$RIOT','$NXTD','$LAC','$BTC','$ITRM','$CHCI','$VERU','$GMGI','$WNBD','$KALV','$EGOC','$Veru','$MRNA','$PVDG','$DROP','$EFOI','$LLIT','$AUVI','$CGIX','$RELI','$TLRY','$ACB','$TRCH','$TRCH.','$TSLA','$cciv','$sndl','$ANCN','$TGC','$tlry','$KXIN','$AMZN','$INFI','$LMND','$COMS','$VXX','$LEDS','$ACY','$RHE','$SINO.','$GPL','$SPCE','$OXY','$CLSN','$FTFT','$FTFT.....','$BIEI','$EDRY','$CLEU','$FSR','$SPY','$NIO','$LI','$XPEV,'$UL','$RGLG','$SOS','$QS','$THCB','$SUNW','$MICT','$BTC.X','$T','$ADOM','$EBON','$CLPS','$HIHO','$ONTX','$WNRS','$SOLO','$Mara,'$Riot,'$SOS,'$GRNQ,'$RCON,'$FTFT,'$BTBT,'$MOGO,'$EQOS,'$CCNC','$CCIV','$tsla','$fsr','$wkhs','$ride','$nio','$NETE','$DPW','$MOSY','$SSNT','$PLTR','$GSAH:','$EQOS','$MTSL','$CMPS','$CHIF','$MU','$HST','$SNAP','$CTXR','$acy','$FUBOTV','$DPBE','$HYLN','$SPOT','$NSAV','$HYLN,'$aabb','$AAL','$BBIG','$ITNS','$CTIB','$AMPG','$ZI','$NUVI','$INTC','$TSM','$AAPL','$MRJT','$RCMT','$IZEA','$BBIG,'$ARKK','$LIAUTO','$MARA:','$SOS:','$XOM','$ET','$BRNW','$SYPR','$LCID','$QCOM','$FIZZ','$TRVG','$SLV','$RAFA','$TGCTengasco,'$BYND','$XTNT','$NBY','$sos','$KMPH','$','$(0.60)','$(0.64)','$BIDU','$rkt','$GTT','$CHUC','$CLF','$INUV','$RKT','$COST','$MDCN','$HCMC','$UWMC','$riot','$OVID','$HZON','$SKT','$FB','$PLUG','$BA','$PYPL','$PSTH.','$NVDA','$AMPG.','$aese.','$spy','$pltr','$MSFT','$AMD','$QQQ','$LTNC','$WKHS','$EYES','$RMO','$GNUS','$gme','$mdmp','$kern','$AEI','$BABA','$YALA','$TWTR','$WISH','$GE','$ORCL','$JUPW','$TMBR','$SSYS','$NKE','$AMPGAmpliTech','$$$','$$','$RGLS','$HOGE','$GEGR','$nclh','$IGAC','$FCEL','$TKAT','$OCG','$YVR','$IPDN.','$IPDN',"$SINO's",'$WIMI','$TKAT.','$BAC','$LZR','$LGHL','$F','$GM','$KODK','$atvk','$ATVK','$AIKI','$DS','$AI','$WTII','$oxy','$DYAI','$DSS','$ZKIN','$MFH','$WKEY','$MKGI','$DLPN','$PSWW','$SNOW','$ALYA','$AESE','$CSCW','$CIDM','$HOFV.','$LIVX','$FNKO','$HPR','$BRQS','$GIGM','$APOP','$EA','$CUEN','$TMBR?','$FLNT,'$APPS','$METX','$STG','$WSRC','$AMHC','$VIAC','$MO','$UAVL','$CS','$MDT','$GYST','$CBBT','$ASTC','$AACG','$WAFU.','$WAFU','$CASI','$mmmw','$MVIS','$SNOA','$C','$KR','$EWZ','$VALE','$EWZ.','$CSCO','$PINS','$XSPA','$VPRX','$CEMI','$M','$BMRA','$SPX','$akt','$SURG','$NCLH','$ARSN','$ODT','$SGBX','$CRWD.','$TGRR','$PENN','$BB','$XOP','$XL','$FREQ','$IDRA','$DKNG','$COHN','$ADHC','$ISWH','$LEGO','$OTRA','$NAAC','$HCAR','$PPGH','$SDAC','$PNTM','$OUST','$IO','$HQGE','$HENC','$KYNC','$ATNF','$BNSO','$HDSN','$AABB','$SGH','$BMY','$VERY','$EARS','$ROKU','$PIXY','$APRE','$SFET','$SQ','$EEIQ','$REDU','$CNWT','$NFLX','$RGBPP','$RGBP','$SHOP','$VITL','$RAAS','$CPNG','$JKS','$COMP','$NAFS']
,
您可以使用正则表达式。
\$[a-zA-Z]+
阅读df
后执行下面的代码
import re
# Create Empty list for final results
results = []
final_results = []
for row_num in range(len(df['TWEET'])):
string_to_check = df['TWEET'][row_num]
# Check for RT at the beginning of the string only.
# if 'RT' in df["TWEET"][row_num] would have found the "RT" anywhere in the string.
if re.match(r"^RT",string_to_check):
continue
else:
# Check for all words starting with $ and followed by only alphabets.
# This will find $FOOBAR but not $600,$6FOOBAR & $FOO6BAR
rel_text_l = re.findall(r"\$[a-zA-Z]+",string_to_check)
# Check for empty list
if rel_text_l:
# Add elements of list to another list directly
results.extend(rel_text_l)
# Making list of the set of list to remove duplicates
final_results = list(set(results))
print(results)
print(final_results)
结果是
['$GME','$FOOBAR','$FOO','$GME','$GOBLIN','$LTNC']
['$LTNC','$FUBO']
注意 $GME
在 final_results
中被删除一次
如果您不介意删除以 RT
开头的推文,所有这一切都可以在一行代码中实现。
direct_result = list(set(re.findall(r"\$[a-zA-Z]+",str(df['TWEET']))))
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。