如何解决根据条件将文本附加到html文档中的组
我正在解析一个大型{% extends 'base.html' %}
{% load crispy_forms_tags %}
{% block page_head %}Add Product{% endblock %}
{% block navbar %}
<navbar class="mr-auto">
<div class="container-fluid">
<ul class="nav navbar-expand">
<li>
<button type="submit" form="p-form" class="btn btn-outline-secondary">Add Product</button>
<button class="btn btn-outline-secondary" onclick="document.location='{% url 'prodlist' %}'">Close Without Saving</button>
</li>
</ul>
</div>
</navbar>
{% endblock %}
{% block content %}
<form id="p-form" action="{% url 'newproduct' %}" method="post">
<!--
{{form.prod_code|as_crispy_field}}
{{form.prod_descr|as_crispy_field}}
{{form.supplier|as_crispy_field}}
{{form.list_price|as_crispy_field}}
{{form.category|as_crispy_field}} -->
{% csrf_token %}
<div class="container-fluid ">
<div class="row">
<div class="col-6 ">
<div class="row">
<div class="col-3">Product Code</div>
<div class="col-8" >{{form.prod_code|as_crispy_field}}</div>
<div class="col-1"></div>
</div>
<div class="row">
<div class="col-3">Prod Descr</div>
<div class="col-8">{{form.prod_descr|as_crispy_field}}</div>
<div class="col-1"></div>
</div>
<div class="row">
<div class="col-3">Supplier</div>
<div class="col-8">{{form.supplier|as_crispy_field}}</div>
<div class="col-1"></div>
</div>
</div>
<div class="col-6 ">
<div class="row ">
<div class="col-4">List Price</div>
<div class="col-8">{{form.list_price|as_crispy_field}}</div>
</div>
<div class="row">
<div class="col-4">Serialsed Item</div>
<div class="col-8">{{form.serialsed_item|as_crispy_field}}</div>
</div>
<div class="row">
<div class="col-4">Category</div>
<div class="col-8">{{form.category|as_crispy_field}}</div>
</div>
</div>
</div>
</div>
</form>
{% endblock content %}
文档。我已经使用html
对文本进行“分组”并使用groups
进行了分隔。整个文本位于文档的\n\n
标记内。
每个组都有5个字段,<font> </font>
我需要使用每个“组”中的Serial#.........,Cust#...........,Customer Name...,BILL TO NO NAME.,DATE......
,并将其与列表中的每个其他组进行比较,以查找重复的Cust#...........
。
如果找到重复项,那么我需要将Cust#...........
附加到每个组中,并将重复项BILL TO NO NAME.
示例html:
Cust#...........
我需要的输出是:
Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... sgfdsfd546545645\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n'Serial#......... Jgfdhdgfhgfdh4545\nCust#........... 88483\nCustomer Name... John Smith\nBILL TO NO NAME. Bill To: 0146897 - Some Company\nDATE...... 01/01/00\n\n'Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd\nDATE...... 01/01/00\n\n'Serial#......... JdsfrfdsgHG091797\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'
我的输出省略了基于Cust#...... xxxxx的重复项,但是我只是想让我的预期结果更清晰些。我可以稍后将重复项清理掉。
到目前为止,我剩余部分的缩略版是无关紧要的。
Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n',Serial#......... JF2SJads5dsafdsaf\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 8888154 - Man Utd Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'
解决方法
这与您问题中的预期输出不完全相同,但可能会让您足够接近:
bills = """[your html above]"""
groups = bills.replace('Serial#','xxxSerial#').split('xxx')
cust_nums = []
ng= []
for group in groups[1:]:
items = group.split('\n')
cn = items[1].split('. ')[1]
bill = items[3].split('. ')[1]
if cn not in cust_nums:
cust_nums.append(items[1].split('. ')[1])
else:
items[3]+=' '+bill
ng.append(items[:-2])
ng
输出:
[['Serial#......... sgfdsfd546545645','Cust#........... 123456','Customer Name... Humpfrey Bear','BILL TO NO NAME. Bill To: 0165487 - Some Other Company Bill To: 0165487 - Some Other Company','DATE...... 01/01/00'],['Serial#......... JdsfrfdsgHG091797','Cust#........... 015648','Customer Name... Eric Cantona','BILL TO NO NAME. Bill To: 9876524 - Big Big Company Bill To: 9876524 - Big Big Company','DATE...... 01/01/00']]
,
如果您的目标是识别重复的Cust#
值,则这是一种方法:
import re
sep = "Serial#"
def get_cust_number(entry):
""" Isolate the customer ID number. """
pattern = re.compile("Cust#[^0-9]+(\d+)")
return pattern.findall(entry)[0]
# return a list of dicts,each element has full entry and customer id
parsed = [{"full_entry": sep + x,"cust_id": get_cust_number(x)}
for x in data.split(sep) if x]
# example parsed element
{'full_entry': "Serial#......... 12345678974566321\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 001166 - Some Company\nDATE...... 01/01/00\n\n'",'cust_id': '123456'}
现在使用set
查找重复的Cust#
并将重复的条目存储在dupes
中:
seen = set()
dupes = list()
# iterate over each entry,store :cust_id: values in :seen: set.
# if :cust_id: is already in :seen:,it is a dupe. store it in :dupes:.
for entry in parsed:
if entry['cust_id'] in seen:
dupes.append(entry['full_entry'])
else:
seen.add(entry['cust_id'])
dupes
["Serial#......... sgfdsfd546545645\nCust#........... 123456\nCustomer Name... Humpfrey Bear\nBILL TO NO NAME. Bill To: 0165487 - Some Other Company\nDATE...... 01/01/00\n\n'","Serial#......... JdsfrfdsgHG091797\nCust#........... 015648\nCustomer Name... Eric Cantona\nBILL TO NO NAME. Bill To: 9876524 - Big Big Company\nDATE...... 01/01/00\n\n'"]
我不明白为什么您的输出有第二个“汉普弗雷熊”条目,却有第一个“埃里克·坎通纳”条目。该答案将生成仅重复输出的列表(意味着第一个条目不存在)。
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。