如何解决根据样本数据输入再现数据
我有一个样本数据,我需要重现更多的行数(将输入行数),这将通过随机组合的列值(包括NULL)与我的样本共享几乎相同的分布。
样本数据
gender marital status occupation ethnic background
Male Single Doctor Caucasian
Male Divorced NA African American
NA Widow Teacher NA
Female Married Doctor Caucasian
Male Divorced Engineer African American
NA Widow Teacher NA
所需数据
gender marital status occupation ethnic background
Male Divorced NA African American
Male Single Doctor Caucasian
Male Divorced NA African American
NA Widow Teacher NA
NA Widow Teacher NA
Female Married Doctor Caucasian
Female Married Doctor Caucasian
Male Divorced Engineer African American
NA widow Teacher NA
Male Single Doctor Caucasian
NA Widow Teacher NA
Female Married Doctor Caucasian
Male Divorced NA African American
NA Widow Teacher NA
Male Divorced Engineer African American
NA Widow Teacher NA
Male Single Doctor Caucasian
Male Divorced Engineer African American
解决方法
this solution中的一个想法-仅需要替换丢失的值,以避免在较旧的熊猫版本的groupby
中将其删除,然后为Series
的列表的每一列应用代码并最后加入一起:
注意:分发匹配取决于行数,因此,如果可能的话,您可以使用多个原始长度的数据-这里的原始长度为6
,而新的长度为6*4=24
#test distibution of original
print (df.fillna('missing').apply(lambda x: pd.value_counts(x,normalize=True)))
gender marital status occupation ethnic background
African American NaN NaN NaN 0.333333
Caucasian NaN NaN NaN 0.333333
Divorced NaN 0.333333 NaN NaN
Doctor NaN NaN 0.333333 NaN
Engineer NaN NaN 0.166667 NaN
Female 0.166667 NaN NaN NaN
Male 0.500000 NaN NaN NaN
Married NaN 0.166667 NaN NaN
Single NaN 0.166667 NaN NaN
Teacher NaN NaN 0.333333 NaN
Widow NaN 0.333333 NaN NaN
missing 0.333333 NaN 0.166667 0.333333
df = df.fillna('missing')
nrows = len(df)
total_sample_size = 24
out = []
for c in df.columns:
f = lambda x: x.sample(int((x.count()/nrows)*total_sample_size),replace=True)
out.append(df.groupby(c)[c].apply(f).sample(frac=1).reset_index(drop=True))
df1 = pd.concat(out,axis=1).replace('missing',np.nan)
print (df1)
gender marital status occupation ethnic background
0 NaN Single Teacher African American
1 Male Divorced Teacher African American
2 Male Widow NaN NaN
3 Male Married Engineer NaN
4 NaN Divorced Teacher African American
5 NaN Divorced Doctor NaN
6 NaN Divorced Teacher Caucasian
7 Male Widow Teacher Caucasian
8 Male Divorced Doctor Caucasian
9 Female Widow Teacher NaN
10 NaN Widow Engineer Caucasian
11 Female Single Teacher Caucasian
12 Female Widow Engineer African American
13 Male Married Doctor African American
14 NaN Single Doctor African American
15 Female Married Engineer Caucasian
16 Male Divorced NaN Caucasian
17 Male Widow NaN African American
18 Male Single Doctor NaN
19 Male Widow Doctor NaN
20 NaN Widow Teacher NaN
21 Male Divorced NaN African American
22 NaN Married Doctor NaN
23 Male Divorced Doctor Caucasian
#test distibution of new
print (df1.fillna('missing').apply(lambda x: pd.value_counts(x,normalize=True)))
gender marital status occupation ethnic background
African American NaN NaN NaN 0.333333
Caucasian NaN NaN NaN 0.333333
Divorced NaN 0.333333 NaN NaN
Doctor NaN NaN 0.333333 NaN
Engineer NaN NaN 0.166667 NaN
Female 0.166667 NaN NaN NaN
Male 0.500000 NaN NaN NaN
Married NaN 0.166667 NaN NaN
Single NaN 0.166667 NaN NaN
Teacher NaN NaN 0.333333 NaN
Widow NaN 0.333333 NaN NaN
missing 0.333333 NaN 0.166667 0.333333
编辑:
如果应该通过获取N次采样原始数据来简化解决方案:
N = 4
df = pd.concat([df] * N,ignore_index=True).sample(frac=1)
print (df)
gender marital status occupation ethnic background
12 Male Single Doctor Caucasian
14 NaN Widow Teacher NaN
4 Male Divorced Engineer African American
8 NaN Widow Teacher NaN
16 Male Divorced Engineer African American
1 Male Divorced NaN African American
7 Male Divorced NaN African American
5 NaN Widow Teacher NaN
15 Female Married Doctor Caucasian
23 NaN Widow Teacher NaN
22 Male Divorced Engineer African American
17 NaN Widow Teacher NaN
18 Male Single Doctor Caucasian
0 Male Single Doctor Caucasian
9 Female Married Doctor Caucasian
19 Male Divorced NaN African American
21 Female Married Doctor Caucasian
20 NaN Widow Teacher NaN
10 Male Divorced Engineer African American
3 Female Married Doctor Caucasian
11 NaN Widow Teacher NaN
13 Male Divorced NaN African American
6 Male Single Doctor Caucasian
2 NaN Widow Teacher NaN
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。