pandas官方文档:https://pandas.pydata.org/pandas-docs/stable/?v=20190307135750
pandas基于Numpy,可以看成是处理文本或者表格数据。pandas中有两个主要的数据结构,其中Series数据结构类似于Numpy中的一维数组,DataFrame类似于多维表格数据结构。
pandas是python数据分析的核心模块。它主要提供了五大功能:
- 支持文件存取操作,支持数据库(sql)、html、json、pickle、csv(txt、excel)、sas、stata、hdf等。
- 支持增删改查、切片、高阶函数、分组聚合等单表操作,以及和dict、list的互相转换。
- 支持多表拼接合并操作。
- 支持简单的绘图操作。
- 支持简单的统计分析操作。
Series(熟悉)
import numpy as np
import pandas as pd
arr = np.array([1, 2, 3, 4, np.nan, ])
print(arr)
[ 1. 2. 3. 4. nan]
s = pd.Series(arr)
print(s)
0 1.0
1 2.0
2 3.0
3 4.0
4 NaN
dtype: float64
import random
random.randint(1,10)
1
import numpy as np
np.random.randn(6,4)
array([[-0.42660201, 2.61346133, 0.01214827, -1.43370137],
[-0.28285711, 0.14871693, 0.22235496, -2.63142648],
[ 0.78324411, -0.72633723, -0.23258796, 0.03855565],
[-0.30033472, -1.19873979, -1.72660722, 0.75214317],
[ 1.48194193, 0.11089792, 0.8845003 , -1.26433672],
[ 1.29958399, -1.75092753, 0.06823543, -0.64219199]])
DataFrame(掌握)
dates = pd.date_range('20190101', periods=6)
print(dates)
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
np.random.seed(1)
arr = 10*np.random.randn(6, 4)
print(arr)
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
[ 8.65407629 -23.01538697 17.44811764 -7.61206901]
[ 3.19039096 -2.49370375 14.62107937 -20.60140709]
[ -3.22417204 -3.84054355 11.33769442 -10.99891267]
[ -1.72428208 -8.77858418 0.42213747 5.82815214]
[-11.00619177 11.4472371 9.01590721 5.02494339]]
df = pd.DataFrame(arr, index=dates, columns=['c1', 'c2', 'c3', 'c4'])
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
# 使用pandas读取字典形式的数据
df2 = pd.DataFrame({'a': 1, 'b': [2, 3], 'c': np.arange(2), 'd': 'hello'})
df2
|
a |
b |
c |
d |
0 |
1 |
2 |
0 |
hello |
1 |
1 |
3 |
1 |
hello |
DataFrame属性(掌握)
dtype |
查看数据类型 |
index |
查看行序列或者索引 |
columns |
查看各列的标签 |
values |
查看数据框内的数据,也即不含表头索引的数据 |
describe |
查看数据每一列的极值,均值,中位数,只可用于数值型数据 |
transpose |
转置,也可用T来操作 |
sort_index |
排序,可按行或列index排序输出 |
sort_values |
按数据值来排序 |
# 查看数据类型
print(df2.dtypes)
a int64
b int64
c int64
d object
dtype: object
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
print(df.index)
DatetimeIndex(['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-06'],
dtype='datetime64[ns]', freq='D')
print(df.columns)
Index(['c1', 'c2', 'c3', 'c4'], dtype='object')
print(df.values)
[[ 16.24345364 -6.11756414 -5.28171752 -10.72968622]
[ 8.65407629 -23.01538697 17.44811764 -7.61206901]
[ 3.19039096 -2.49370375 14.62107937 -20.60140709]
[ -3.22417204 -3.84054355 11.33769442 -10.99891267]
[ -1.72428208 -8.77858418 0.42213747 5.82815214]
[-11.00619177 11.4472371 9.01590721 5.02494339]]
df.describe()
|
c1 |
c2 |
c3 |
c4 |
count |
6.000000 |
6.000000 |
6.000000 |
6.000000 |
mean |
2.022213 |
-5.466424 |
7.927203 |
-6.514830 |
std |
9.580084 |
11.107772 |
8.707171 |
10.227641 |
min |
-11.006192 |
-23.015387 |
-5.281718 |
-20.601407 |
25% |
-2.849200 |
-8.113329 |
2.570580 |
-10.931606 |
50% |
0.733054 |
-4.979054 |
10.176801 |
-9.170878 |
75% |
7.288155 |
-2.830414 |
13.800233 |
1.865690 |
max |
16.243454 |
11.447237 |
17.448118 |
5.828152 |
df.T
|
2019-01-01 00:00:00 |
2019-01-02 00:00:00 |
2019-01-03 00:00:00 |
2019-01-04 00:00:00 |
2019-01-05 00:00:00 |
2019-01-06 00:00:00 |
c1 |
16.243454 |
8.654076 |
3.190391 |
-3.224172 |
-1.724282 |
-11.006192 |
c2 |
-6.117564 |
-23.015387 |
-2.493704 |
-3.840544 |
-8.778584 |
11.447237 |
c3 |
-5.281718 |
17.448118 |
14.621079 |
11.337694 |
0.422137 |
9.015907 |
c4 |
-10.729686 |
-7.612069 |
-20.601407 |
-10.998913 |
5.828152 |
5.024943 |
# 按行标签从大到小排序
df.sort_index(axis=0)
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
# 按列标签从大到小排序
df2.sort_index(axis=1)
|
a |
b |
c |
d |
0 |
1 |
2 |
0 |
hello |
1 |
1 |
3 |
1 |
hello |
# 按a列的值从大到小排序
df2.sort_values(by='a')
|
a |
b |
c |
d |
0 |
1 |
2 |
0 |
hello |
1 |
1 |
3 |
1 |
hello |
DataFrame取值(掌握)
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df['c2']
2019-01-01 -6.117564
2019-01-02 -23.015387
2019-01-03 -2.493704
2019-01-04 -3.840544
2019-01-05 -8.778584
2019-01-06 11.447237
Freq: D, Name: c2, dtype: float64
df[0:3]
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
loc/iloc
# 通过自定义的行标签选择数据
df.loc['2019-01-01':'2019-01-05']
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df.values
array([[ 16.24345364, -6.11756414, -5.28171752, -10.72968622],
[ 8.65407629, -23.01538697, 17.44811764, -7.61206901],
[ 3.19039096, -2.49370375, 14.62107937, -20.60140709],
[ -3.22417204, -3.84054355, 11.33769442, -10.99891267],
[ -1.72428208, -8.77858418, 0.42213747, 5.82815214],
[-11.00619177, 11.4472371 , 9.01590721, 5.02494339]])
print(df.iloc[2, 1])
-2.49370375477
# 通过行索引选择数据
print(df.iloc[2, 1])
-2.49370375477
df.iloc[1:4, 1:4]
|
c2 |
c3 |
c4 |
2019-01-02 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.840544 |
11.337694 |
-10.998913 |
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
使用逻辑判断取值
df[df['c1'] > 0]
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
DataFrame值替换(掌握)
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df.iloc[0:3, 0:2] = 0
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
0.000000 |
0.000000 |
-5.281718 |
-10.729686 |
2019-01-02 |
0.000000 |
0.000000 |
17.448118 |
-7.612069 |
2019-01-03 |
0.000000 |
0.000000 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
16.243454 |
-6.117564 |
-5.281718 |
-10.729686 |
2019-01-02 |
8.654076 |
-23.015387 |
17.448118 |
-7.612069 |
2019-01-03 |
3.190391 |
-2.493704 |
14.621079 |
-20.601407 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
df[df['c1'] > 0] = 100
df
|
c1 |
c2 |
c3 |
c4 |
2019-01-01 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-01-02 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-01-03 |
100.000000 |
100.000000 |
100.000000 |
100.000000 |
2019-01-04 |
-3.224172 |
-3.840544 |
11.337694 |
-10.998913 |
2019-01-05 |
-1.724282 |
-8.778584 |
0.422137 |
5.828152 |
2019-01-06 |
-11.006192 |
11.447237 |
9.015907 |
5.024943 |
读取CSV文件(掌握)
from io import StringIO
test_data = '''
5.1,,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,,0.2
7.0,3.2,4.7,1.4
6.4,3.2,4.5,1.5
6.9,3.1,4.9,
,,,
'''
# df = pd.read_csv('C:/Users/test_data.csv')
test_data = StringIO(test_data)
df = pd.read_csv(test_data)
df = pd.read_excel(test_data)
df.columns = ['c1', 'c2', 'c3', 'c4']
df
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
NaN |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
NaN |
5 |
NaN |
NaN |
NaN |
NaN |
处理丢失数据(掌握)
df.isnull()
|
c1 |
c2 |
c3 |
c4 |
0 |
False |
False |
False |
False |
1 |
False |
False |
True |
False |
2 |
False |
False |
False |
False |
3 |
False |
False |
False |
False |
4 |
False |
False |
False |
True |
5 |
True |
True |
True |
True |
# 通过在isnull()方法后使用sum()方法即可获得该数据集某个特征含有多少个缺失值
print(df.isnull().sum())
c1 1
c2 1
c3 2
c4 2
dtype: int64
df
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
NaN |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
NaN |
5 |
NaN |
NaN |
NaN |
NaN |
# axis=0删除有NaN值的行
df.dropna(axis=0)
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
# axis=1删除有NaN值的列
df.dropna(axis=1)
# 删除全为NaN值得行或列
df.dropna(how='all')
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
NaN |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
NaN |
# 删除行不为4个值的
df.dropna(thresh=4)
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
# 删除c2中有NaN值的数据
df.dropna(subset=['c2'])
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
NaN |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
NaN |
df
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
NaN |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
NaN |
5 |
NaN |
NaN |
NaN |
NaN |
# 填充nan值
df.fillna(value=10)
|
c1 |
c2 |
c3 |
c4 |
0 |
4.9 |
3.0 |
1.4 |
0.2 |
1 |
4.7 |
3.2 |
10.0 |
0.2 |
2 |
7.0 |
3.2 |
4.7 |
1.4 |
3 |
6.4 |
3.2 |
4.5 |
1.5 |
4 |
6.9 |
3.1 |
4.9 |
10.0 |
5 |
10.0 |
10.0 |
10.0 |
10.0 |
导入导出数据(掌握)
使用df = pd.read_csv(filename)读取文件,使用df.to_csv(filename)保存文件。
# df = pd.read_csv("filename")
# 进行一堆处理后
# df.to_csv("filename", header=True, index=False)
合并数据(掌握)
df1 = pd.DataFrame(np.zeros((3, 4)))
df1
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
df2 = pd.DataFrame(np.ones((3, 4)))
df2
|
0 |
1 |
2 |
3 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
# axis=0合并列
pd.concat((df1, df2), axis=0)
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
# axis=1合并行
pd.concat((df1, df2), axis=1)
|
0 |
1 |
2 |
3 |
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
1.0 |
# append只能合并列
df1.append(df2)
|
0 |
1 |
2 |
3 |
0 |
0.0 |
0.0 |
0.0 |
0.0 |
1 |
0.0 |
0.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
0.0 |
0 |
1.0 |
1.0 |
1.0 |
1.0 |
1 |
1.0 |
1.0 |
1.0 |
1.0 |
2 |
1.0 |
1.0 |
1.0 |
1.0 |
读取sql语句(熟悉)
import numpy as np
import pandas as pd
import pymysql
def conn(sql):
# 连接到mysql数据库
conn = pymysql.connect(
host="localhost",
port=3306,
user="root",
passwd="123",
db="db1",
)
try:
data = pd.read_sql(sql, con=conn)
return data
except Exception as e:
print("SQL is not correct!")
finally:
conn.close()
sql = "select * from test1 limit 0, 10" # sql语句
data = conn(sql)
print(data.columns.tolist()) # 查看字段
print(data) # 查看数据
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 [email protected] 举报,一经查实,本站将立刻删除。