c – 并行for_each比std :: for_each慢两倍以上

我正在读安东尼·威廉姆斯的 C++ Concurrency in Action.在关于设计并发代码的章节中,有 std::for_each algorihtm的并行版本.以下是本书略有修改的代码：

join_thread.hpp

#pragma once

#include <vector>
#include <thread>

class join_threads
{
public:
  explicit join_threads(std::vector<std::thread>& threads)
    : threads_(threads) {}

  ~join_threads()
  {
    for (size_t i = 0; i < threads_.size(); ++i)
    {
      if(threads_[i].joinable())
      {
        threads_[i].join();
      }
    }
  }

private:
  std::vector<std::thread>& threads_;
};

parallel_for_each.hpp

#pragma once

#include <future>
#include <algorithm>

#include "join_threads.hpp"

template<typename Iterator,typename Func>
void parallel_for_each(Iterator first,Iterator last,Func func)
{
  const auto length = std::distance(first,last);
  if (0 == length) return;

  const auto min_per_thread = 25u;
  const unsigned max_threads = (length + min_per_thread - 1) / min_per_thread;

  const auto hardware_threads = std::thread::hardware_concurrency();

  const auto num_threads= std::min(hardware_threads != 0 ?
        hardware_threads : 2u,max_threads);

  const auto block_size = length / num_threads;

  std::vector<std::future<void>> futures(num_threads - 1);
  std::vector<std::thread> threads(num_threads-1);
  join_threads joiner(threads);

  auto block_start = first;
  for (unsigned i = 0; i < num_threads - 1; ++i)
  {
    auto block_end = block_start;
    std::advance(block_end,block_size);
    std::packaged_task<void (void)> task([block_start,block_end,func]()
    {
      std::for_each(block_start,func);
    });
    futures[i] = task.get_future();
    threads[i] = std::thread(std::move(task));
    block_start = block_end;
  }

  std::for_each(block_start,last,func);

  for (size_t i = 0; i < num_threads - 1; ++i)
  {
    futures[i].get();
  }
}

我使用以下程序使用顺序版本std::for_each对其进行基准测试：

main.cpp中

#include <iostream>
#include <random>
#include <chrono>

#include "parallel_for_each.hpp"

using namespace std;

constexpr size_t ARRAY_SIZE = 500'000'000;
typedef std::vector<uint64_t> Array;

template <class FE,class F>
void test_for_each(const Array& a,FE fe,F f,atomic<uint64_t>& result)
{
  auto time_begin = chrono::high_resolution_clock::now();
  result = 0;
  fe(a.begin(),a.end(),f);
  auto time_end = chrono::high_resolution_clock::now();

  cout << "Result = " << result << endl;
  cout << "Time: " << chrono::duration_cast<chrono::milliseconds>(
            time_end - time_begin).count() << endl;
}

int main()
{
  random_device device;
  default_random_engine engine(device());
  uniform_int_distribution<uint8_t> distribution(0,255);

  Array a;
  a.reserve(ARRAY_SIZE);

  cout << "Generating array ... " << endl;
  for (size_t i = 0; i < ARRAY_SIZE; ++i)
    a.push_back(distribution(engine));

  atomic<uint64_t> result;
  auto acc = [&result](uint64_t value) { result += value; };

  cout << "parallel_for_each ..." << endl;
  test_for_each(a,parallel_for_each<Array::const_iterator,decltype(acc)>,acc,result);
  cout << "for_each ..." << endl;
  test_for_each(a,for_each<Array::const_iterator,result);

  return 0;
}

我机器上算法的并行版本比顺序版本慢两倍：

parallel_for_each ...
Result = 63750301073
Time: 5448
for_each ...
Result = 63750301073
Time: 2496

我在运行在Intel(R)Core(TM)i3-6100 CPU @ 3.70GHz上的Ubuntu Linux上使用GCC 6.2编译器.

如何解释这种行为？这是因为共享原子< uint64_t>线程和缓存乒乓之间的变量？

我用perf分别对它们进行了分析.对于并行版本,统计数据如下：

1137982167      cache-references                                            
  247652893      cache-misses              #   21,762 % of all cache refs    
60868183996      cycles                                                      
27409239189      instructions              #    0,45  insns per cycle        
 3287117194      branches                                                    
      80895      faults                                                      
          4      migrations

对于顺序的：

402791485      cache-references                                            
  246561299      cache-misses              #   61,213 % of all cache refs    
40284812779      cycles                                                      
26515783790      instructions              #    0,66  insns per cycle
 3188784664      branches                                                    
      48179      faults
          3      migrations

很明显,并行版本会产生更多的缓存引用,循环和故障,但为什么呢？

解决方法

您正在共享相同的结果变量：所有线程都在atomic< uint64_t>上累积.结果,捶打缓存！

每次线程写入结果时,其他核心中的所有高速缓存都将失效：这会导致高速缓存行争用.

[…] to write to a memory location a core must additionally have exclusive ownership of the cache line containing that location. While one core has exclusive use,all other cores trying to write the same memory location must wait and take turns — that is,they must run serially. Conceptually,it’s as if each cache line were protected by a hardware mutex,where only one core can hold the hardware lock on that cache line at a time.

> This article on “false sharing”,其中涉及类似问题,更深入地解释了缓存中发生的情况.

我对您的程序进行了一些修改并获得了以下结果(在具有i7-4770K [8线程超线程]的机器上)：

Generating array ...
parallel_for_each ...
Result = 63748111806
Time: 195
for_each ...
Result = 63748111806
Time: 2727

并行版本比串行版本快约92％.

> std :: future和std :: packaged_task是重量级抽象.在这种情况下,std::experimental::latch就足够了.
>每个任务都发送到线程池这可以最大限度地减少线程创建开销.
>每个任务都有自己的累加器.这消除了共享.

代码可用于here on my GitHub.它使用了一些个人依赖项,但您应该了解这些更改.

以下是最重要的变化：

// A latch is being used instead of a vector of futures.
ecst::latch l(num_threads - 1);

l.execute_and_wait_until_zero([&]
{
    auto block_start = first;
    for (unsigned i = 0; i < num_threads - 1; ++i)
    {
        auto block_end = block_start;
        std::advance(block_end,block_size);

        // `p` is a thread pool.
        // Every task posted in the thread pool has its own `tempacc` accumulator.
        p.post([&,block_start,tempacc = 0ull]() mutable
        {
            // The task accumulator is filled up...
            std::for_each(block_start,[&tempacc](auto x){ tempacc += x; });

            // ...and then the atomic variable is incremented ONCE.
            func(tempacc);
            l.decrement_and_notify_all();
        });

        block_start = block_end;
    }

    // Same idea here: accumulate to local non-atomic counter,then
    // add the partial result to the atomic counter ONCE.
    auto tempacc2 = 0ull;
    std::for_each(block_start,[&tempacc2](auto x){ tempacc2 += x; });
    func(tempacc2);
});

c – 并行for_each比std :: for_each慢两倍以上

解决方法

相关推荐