gprof vs cachegrind配置文件

如何解决gprof vs cachegrind配置文件

| 在尝试优化代码时，我对由kcachegrdind和gprof产生的配置文件的差异感到困惑。具体来说，如果我使用gprof（使用-pg开关等进行编译），则会得到以下信息：

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 89.62      3.71     3.71   204626     0.02     0.02  objR<true>::R_impl(std::vector<coords_t,std::allocator<coords_t> > const&,std::vector<unsigned long,std::allocator<unsigned long> > const&) const
  5.56      3.94     0.23 18018180     0.00     0.00  W2(coords_t const&,coords_t const&)
  3.87      4.10     0.16   200202     0.00     0.00  build_matrix(std::vector<coords_t,std::allocator<coords_t> > const&)
  0.24      4.11     0.01   400406     0.00     0.00  std::vector<double,std::allocator<double> >::vector(std::vector<double,std::allocator<double> > const&)
  0.24      4.12     0.01   100000     0.00     0.00  Wrat(std::vector<coords_t,std::vector<coords_t,std::allocator<coords_t> > const&)
  0.24      4.13     0.01        9     1.11     1.11  std::vector<short,std::allocator<short> >* std::__uninitialized_copy_a<__gnu_cxx::__normal_iterator<std::vector<short,std::alloca

似乎暗示我除了both4ѭ外，无需费神同时，如果我在没有-pg开关的情况下进行编译并改为运行valgrind --tool=callgrind ./a.out，则情况会有所不同：这是kcachegrind输出的屏幕截图如果我正确地解释了这一点，似乎表明::R_impl(...)仅花费约50％的时间，而另一半花费在线性代数中（Wrat(...)，eigenvalues和下面的lapack调用），该线性代数在gprof轮廓下方下降。我了解ѭ1和ѭ13使用不同的技术，如果结果有些不同，我也不会打扰。但是在这里，它看起来非常不同，而我对如何解释这些内容感到困惑。有什么想法或建议吗？

解决方法

您正在查看错误的列。您必须查看kcachegrind输出中的第二列，该列名为\“ self \”。这是特定子例程仅在不考虑其子对象的情况下所花费的时间。第一列具有累计时间（它等于主机的机器时间的100％），但信息不多（我认为）。请注意，从kcachegrind的输出中可以看到，该过程的总时间为53.64秒，而在子例程“ R_impl \”中花费的时间为46.72秒，占总时间的87％。因此，gprof和kcachegrind几乎完全一致。 , gprof是仪器分析仪，callgrind是采样分析仪。使用有工具的探查器，每个函数的进入和退出都会产生开销，这可能会使概要文件产生偏差，尤其是当您具有相对较小的函数（被多次调用）时。采样探查器往往更准确-它们会稍微减慢整个程序的执行速度，但这往往会对所有功能产生相同的相对影响。尝试从RotateRight免费获得30天的Zoom评估-我怀疑它会为您提供与callgrind相比，与gprof更加吻合的配置文件。

gprof vs cachegrind配置文件

如何解决gprof vs cachegrind配置文件

解决方法

相关推荐