如何解决如何使用美丽的汤检索嵌入在标签中的 xml 文件中的特征
我正在尝试解析一系列 XML 文件,并使用美丽的汤使用这些函数来获取嵌入在标签中的某些值:
case_feature_keys = ['year','offenceCategory','offenceSubcategory']
person_feature_keys = ['gender','age','occupation','given']
outcome_key = 'verdictCategory'
case_feature_keys = ['year','given']
outcome_key = 'verdictCategory'
def get_person_features(trial_account,person_type: str):
person_features = {}
for key in person_feature_keys:
matches = [x for x in trial_account.find_all(type=key) if person_type in
x.parent.attrs.get("type","")]
if matches:
person_features[person_type + "_" + key] = matches[0]
return person_features
def process_trial_account(trial_account) -> dict:
"""
Takes in a single account and returns a dictionary representing a row of the table.
"""
case_features = {key: trial_account.find(type=key) for key in case_feature_keys}
defendant_features = get_person_features(trial_account,'defendant')
victim_features = get_person_features(trial_account,'victim')
outcome = trial_account.find(type=outcome_key)
features = {**case_features,**defendant_features,**victim_features,"outcome": outcome or {}}
return {key: value.get("value") for key,value in features.items()}
XML 文件如下所示:
</persName>
.</p>
<p>
<persName id="t18100221-1-person52">
GEORGE
ROSS
<interp inst="t18100221-1-person52" type="surname" value="ROSS"/>
<interp inst="t18100221-1-person52" type="given" value="GEORGE"/>
<interp inst="t18100221-1-person52" type="gender" value="male"/>
</persName>
. Q. Were you in trade - A. Yes,as a <rs id="t18100221-1-viclabel3" type="occupation">merchant</rs>
<join result="persNameOccupation" targOrder="Y" targets="t18100221-1-victim51 t18100221-1-viclabel3"/>; I lived in <placeName id="t18100221-1-crimeloc4">New Basinghall-street</placeName>
<interp inst="t18100221-1-crimeloc4" type="placeName" value="New Basinghall-street"/>
<interp inst="t18100221-1-crimeloc4" type="type" value="crimeLocation"/>
<join result="offencePlace" targOrder="Y" targets="t18100221-1-off1 t18100221-1-crimeloc4"/>; the prisoner was my <rs id="t18100221-1-deflabel5" type="occupation">clerk</rs>
我遇到的问题是职业类别不像大多数其他特征那样包含在“值=”行中。如果你看下面,职业是嵌入在标签本身中的,就像这样:'id="t18100221-1-viclabel3" type="occupation">merchant' 而不是性别,例如,它包含在这样的行中:'type="gender" value="male"/>' 所以我可以使用上面的函数来获取这个属性,因为它包含在一个类型/值中。
有谁知道我如何为受害者和被告找回职业?
解决方法
要获取职业和人员类型,您可以使用以下示例:
from bs4 import BeautifulSoup
html_data = """
<persName id="t18100221-1-person52">
GEORGE
ROSS
<interp inst="t18100221-1-person52" type="surname" value="ROSS"/>
<interp inst="t18100221-1-person52" type="given" value="GEORGE"/>
<interp inst="t18100221-1-person52" type="gender" value="male"/>
</persName>
. Q. Were you in trade - A. Yes,as a <rs id="t18100221-1-viclabel3" type="occupation">merchant</rs>
<join result="persNameOccupation" targOrder="Y" targets="t18100221-1-victim51 t18100221-1-viclabel3"/>; I lived in <placeName id="t18100221-1-crimeloc4">New Basinghall-street</placeName>
<interp inst="t18100221-1-crimeloc4" type="placeName" value="New Basinghall-street"/>
<interp inst="t18100221-1-crimeloc4" type="type" value="crimeLocation"/>
<join result="offencePlace" targOrder="Y" targets="t18100221-1-off1 t18100221-1-crimeloc4"/>; the prisoner was my <rs id="t18100221-1-deflabel5" type="occupation">clerk</rs>
"""
soup = BeautifulSoup(html_data,"html.parser")
for occupation in soup.select('[type="occupation"]'):
id_ = occupation["id"]
o = occupation.text
person_type = "victim" if "vic" in id_ else "defendant"
print("ID: {} Occupation: {} Person type: {}".format(id_,o,person_type))
打印:
ID: t18100221-1-viclabel3 Occupation: merchant Person type: victim
ID: t18100221-1-deflabel5 Occupation: clerk Person type: defendant
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。