如何解决解释密度标度geom_density
我对如何计算geom_density的值有疑问: 这是重现我的结果的代码:
a <- structure(c(1.94603297055443,NA,0.543010708224088,-0.694950111479158,1.12072575741894,0.955664314791837,-1.76784948855531,1.20325647873249,-1.43772660330111,1.03819503610539,0.0478263803427896,1.78097152792733,-0.28229650491141,-1.27266516067401,0.873133593478287,-1.52025732461466,0.625541429537638,-2.51062598037725,0.130357101656339,0.295418544283439,-0.612419390165609,-1.02507299673336,-0.364827226224959,-2.01544165249595,0.460479986910538,1.94603297055443,-0.777480832792708,-0.860011554106258,0.212887822969889,0.790602872164737,-0.447357947538509,1.36831792135959,-0.19976578359786,-0.942542275419808,-1.19013443936046,1.86350224924088,-0.529888668852059,-1.85038020986886,-0.0347043409707605,-1.35519588198756,-0.11723506228431,0.708072150851187,0.377949265596989,-1.93291093118241,1.53337936398669,-2.4280952590637,-1.60278804592821,1.45084864267314,-1.68531876724176,1.69844080661379,-2.0979723738095,-2.2630338164366,-2.34556453775015,-1.10760371804691,1.28578720004604,1.61591008530023,NA),.Dim = c(614L,1L
))
a= data.frame(a=a)
ggplot(a,aes(x=a)) +
geom_density()
但是,当我计算比例时,我的值要小得多
table(a$a) %>% data.frame() %>% mutate(freq= Freq/459)%>% arrange(desc(freq))
那到底是什么绘制成密度?
解决方法
这只是注释中提到的内容,以使其在视觉上稍微显示出来。
就像直方图具有bin宽度一样,KDE也具有带宽。对于默认的高斯核,这是法线密度函数的标准偏差参数。
library(ggplot2)
# Pick some data
data <- faithful$eruptions
bandwidth <- 0.5
# Get data range
range <- range(data)
# Setup a sequence of x-coordinates
x_coord <- seq(range[1] - bandwidth,range[2] + bandwidth,length.out = 512)
现在,我们可以为每个数据点计算正常密度函数,其中mu
是数据点值,标准偏差是带宽。我在下面所做的只是为每个数据点计算这些钟形曲线。
# Calculate individual densities
indi_dens <- vapply(data,function(mu) {
dnorm(x_coord,mu,sd = bandwidth)
},x_coord)
# Plot individual densities
indi_data <- reshape2::melt(indi_dens)
indi_data$Var1 <- x_coord[indi_data$Var1]
ggplot(indi_data) +
geom_line(aes(Var1,value,group = Var2))
下一步是简单地汇总每个x坐标的所有密度。我们将其除以观察次数,以使所得KDE的面积等于1。
我将以蓝色显示我们计算出的密度,并以红色显示相同数据的geom_density()
。
# Sum density for every x-coordinate
summed_dens <- rowSums(indi_dens)
# Make it integrate to 1 by dividing by the number of observations
summed_dens <- summed_dens / length(data)
# Prepare plotting data
plotdata <- data.frame(
x = x_coord,y = summed_dens
)
ggplot(plotdata) +
geom_line(aes(x,y),colour = "blue",size = 2) +
# Overlay density calculated by geom_density() for the same data
geom_density(data = faithful,aes(x = eruptions),bw = bandwidth,color = "red",linetype = 2,size = 2)
您可以看到它们非常接近。为了进一步详细说明,由于KDE积分为1,这意味着曲线下的总面积必须为1,所以对于较大范围的数据,密度将较低,而对于较小范围的数据,密度将较高。这不仅适用于KDE,而且几乎适用于任何密度。例如,将sd = 1
和sd = 5
与正态分布的正态密度曲线进行比较,后者比前者的范围更大。
ggplot() +
geom_function(fun = dnorm,args = list(sd = 1),colour = "blue") +
geom_function(fun = dnorm,args = list(sd = 5),colour = 'red') +
scale_x_continuous(limits = c(-10,10)) +
scale_y_continuous(name = "density")
版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 dio@foxmail.com 举报,一经查实,本站将立刻删除。