通过正则表达式str.extract从数据框中的完整地址列获取邮政编码，并在熊猫中添加为新列

如何解决通过正则表达式str.extract从数据框中的完整地址列获取邮政编码，并在熊猫中添加为新列

我有一列中有完整地址的数据框，并且我需要在同一数据框中创建一个单独的列，该列仅包含5位数的邮政编码（从7开始）。一些地址可能为空或找不到邮政编码。

如何拆分列以获取邮政编码？邮政编码以7开头，例如76000是索引0中的邮政编码

MedicalCenters["Postcode"][0]
Location(75,Avenida Corregidora,Centro,Delegación Centro Histórico,Santiago de Querétaro,Municipio de Querétaro,Querétaro,76000,México,(20.5955795,-100.39274225,0.0))

示例数据

    Venue         Venue Latitude Venue Longitude Venue Category Address
0 Lab. Corregidora  20.595621   -100.392677      Medical Center Location(75,0.0))

我尝试使用正则表达式，但是却出现错误

# get zipcode from full address
import re 
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b',expand=False)

错误

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-185-84c21a29d484> in <module>
      1 # get zipcode from full address
      2 import re
----> 3 MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r'\b\d{5}\b',expand=False)

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in wrapper(self,*args,**kwargs)
   1950                 )
   1951                 raise TypeError(msg)
-> 1952             return func(self,**kwargs)
   1953 
   1954         wrapper.__name__ = func_name

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in extract(self,pat,flags,expand)
   3037     @forbid_nonstring_types(["bytes"])
   3038     def extract(self,flags=0,expand=True):
-> 3039         return str_extract(self,flags=flags,expand=expand)
   3040 
   3041     @copy(str_extractall)

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in str_extract(arr,expand)
   1010         return _str_extract_frame(arr._orig,flags=flags)
   1011     else:
-> 1012         result,name = _str_extract_noexpand(arr._parent,flags=flags)
   1013         return arr._wrap_result(result,name=name,expand=expand)
   1014 

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _str_extract_noexpand(arr,flags)
    871 
    872     regex = re.compile(pat,flags=flags)
--> 873     groups_or_na = _groups_or_na_fun(regex)
    874 
    875     if regex.groups == 1:

~/opt/anaconda3/lib/python3.7/site-packages/pandas/core/strings.py in _groups_or_na_fun(regex)
    835     """Used in both extract_noexpand and extract_frame"""
    836     if regex.groups == 0:
--> 837         raise ValueError("pattern contains no capture groups")
    838     empty_row = [np.nan] * regex.groups
    839 

ValueError: pattern contains no capture groups

time: 39.5 ms

解决方法

您需要添加括号以使其成组

MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")

您可以先尝试拆分字符串，然后更容易匹配邮政编码：

address = '75,Avenida Corregidora,Centro,Delegación Centro Histórico,Santiago de Querétaro,Municipio de Querétaro,Querétaro,76000,México,(20.5955795,-100.39274225,0.0'

matches = list(filter(lambda x: x.startswith('7') and len(x) == 5,address.split(','))) # ['76000']

因此，您可以通过以下方式填充DataFrame：

df['postcode'] = df['address'].apply(lambda address: list(filter(lambda x: x.startswith('7') and len(x) == 5,')))[0])

地址数据是一个对象，正则表达式不起作用的原因

MedicalCenters.dtypes
Venue               object
Venue Latitude     float64
Venue Longitude    float64
Venue Category      object
Health System       object
geom                object
Address             object
Postcode            object
dtype: object
time: 6.41 ms

将对象转换为字符串后：

MedicalCenters['Address'] = MedicalCenters['Address'].astype('str')

由于有了glam，我得以应用修改过的正则表达式

# get zipcode from full address
import re 
MedicalCenters['Postcode'] = MedicalCenters['Address'].str.extract(r"\b(\d{5})\b")

通过正则表达式str.extract从数据框中的完整地址列获取邮政编码，并在熊猫中添加为新列

如何解决通过正则表达式str.extract从数据框中的完整地址列获取邮政编码，并在熊猫中添加为新列

解决方法

相关推荐