如何解决Python替换/删除字符串中的所有URL
我正在尝试用字符串中的空字符串替换所有URL。 JSON之后是字符串。这不是一个对象。但是我很难捕捉到各种排列。
这是我的python脚本。但是,如果您查看https://regex101.com/r/r6tQ3B/2/,则会注意到正则表达式也删除了结尾的"
,并且也没有真正捕获速记“ t.co”或中间的网址。
for filename in dataFiles:
path = 'data/' + filename
with open(path) as r:
text = re.sub(r'https?:\/\/\S*','"',text,flags=re.MULTILINE)
with open(path,"w") as w:
w.write(text)
测试:https://regex101.com/r/r6tQ3B/1/
{
"created_at":"Fri Aug 12 10:04:00 +0000 2016","id":764039724818272256,"text":"@theblaze https://t.com/TY9DlZ584c @realDonaldTrump https://t.com/TY9DlZ584c","in_reply_to_screen_name":"theblaze","source":"<a href=\"http://twitter.com/download/iphone\" rel=\"nofollow\">Twitter for iPhone</a>","user":{
"id":366636488,"id_str":"366636488","name":"GIL DUPUY","screen_name":"DUPUY77","location":"Miami","url":"http://ggm-dupuy.com","description":"Fashion photographer,love action and adventure,care for the less fortunate,don't tolerate any kind of racism regardless of race or religion","verified":false,"followers_count":186,"friends_count":446,"utc_offset":null,"time_zone":null,"lang":"en","default_profile_image":false,"following":null,"notifications":null
},"geo":null,"coordinates":null,"place":{
"name":"Frontenac","full_name":"Frontenac,MO","country_code":"US","country":"United States","attributes":{
}
},"retweet_count":0,"favorite_count":0,"extended_entities":{
"media":[
{
"id":764039718237409281,"id_str":"764039718237409281","indices":[
27,50
],"media_url":"http://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg","media_url_https":"https://pbs.twimg.com/media/CppqE1_UkAE2qFj.jpg","url":"https://t.com/TY9DlZ584c","display_url":"pic.twitter.com/TY9DlZ584c","expanded_url":"http://twitter.com/DUPUY77/status/764039724818272256/photo/1","type":"photo","sizes":{
"medium":{
"w":640,"h":1136,"resize":"fit"
},"large":{
"w":640,"thumb":{
"w":150,"h":150,"resize":"crop"
},"small":{
"w":383,"h":680,"resize":"fit"
}
}
}
]
},"favorited":false,"retweeted":false,"possibly_sensitive":false,"lang":"und"
}
解决方法
尝试这种模式\s?(https?:\/\/[^\\\s"]*)
不是很干净,但是可以用于您的示例。
,删除“”中的所有网址,所有不包含“”的网址以及以pic.twitter开头的网址(这些似乎是唯一没有http(s)的网址)。
假设网址中没有空格或“:
int
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。