Assembly line / Pipeline

Problem Definition

Cache Data Refresh

  • We have 576 cache files,each file (in fact,a file pair,one is an index file,the other is data file) contains point to address mapping: (latitude,longitude) -> Address,i.e., the result ofreverse geocoding.
  • some data (addresses) may be out of date,now we need get updated address of those points.


  1. There is no date info for each cache entry [point,address],so we refresh all points in cache file,that is,regenerate the cache file in whole.
  2. The refresh can be interrupted and continue from the breakpoint - we do not want start from beginning.
  3. Total points number is about 80M,we need finish it in given time,or the refresh duration time can be controlled.
  4. The refresh is done on-line. After refresh finished,the server switch to new cache file and back up old cache file.
  5. There are several peer server,each server has same data (576 cache files),we want all of the server get updated addresses in the same time.


  1. For each cache file
    • For each point in the cache file
      • read point from cache file - R
      • get updated address - G
      • write new address into new cache file - W


  • If the refresh is interrupted,one cache file need restart at right beginning. Suppose a cache file contains 5m points,the server crash when the refresh coming to the last point of 5m points. Then all the 4.99m effort is lost.
  • The whole process is in single thread,its rate is limits by the bottleneck process: get updated address.


Instead of 'R-G-W' one point by one point. We first snap all the points in the cache file intensively,and put all the point in 4096 point files. So each point file contains a certain number of points. Then use several threads to repeat 'R-G' process using point files,and one threads to repeat 'W' process. Once a point file is finished,delete it and process next file.

  1. For each cache file
    • For each point in the cache file
      • read point from cache file
      • write point into point file,create a new file every 80m/4096 points
  2. For each point file
    • For each point in the point file - multiply threads,each threads handle one file
      • read point from point file
      • get updated address
      • add new address into write thread queue
    • Delete the point file once it finish all the points
  3. For each point in the write queue
    • Write new address into new cache file


  • If the refresh is interrupted,at most one point file work is lost.
  • Use multiply threads in bottleneck function('G'),so we speed up the whole refresh. The refresh rate is controlled by thread number


1. How to handle batch process

1) break down (Split) and reassemble to get pipelining

Consider the assembly of a car: assume that certain steps in the assembly line are to install the engine,install the hood,and install the wheels (in that order,with arbitrary interstitial steps); only one of these steps can be done at a time. In traditional production,only one car would be assembled at a time. If engine installation takes 20 minutes,hood installation takes 5 minutes,and wheel installation takes 10 minutes,then a car can be produced every 35 minutes.

In an assembly line,car assembly is split between several stations,all working simultaneously. When one station is finished with a car,it passes it on to the next. By having three stations,a total of three different cars can be operated on at the same time,each one at a different stage of its assembly.

After finishing its work on the first car,the engine installation crew can begin working on the second car. While the engine installation crew works on the second car,the first car can be moved to the hood station and fitted with a hood,then to the wheels station and be fitted with wheels. After the engine has been installed on the second car,the second car moves to the hood assembly. At the same time,the third car moves to the engine assembly. When the third car’s engine has been mounted,it then can be moved to the hood station; meanwhile,subsequent cars (if any) can be moved to the engine installation station.

Assuming no loss of time when moving a car from one station to another,the longest stage on the assembly line determines the throughput (20 minutes for the engine installation) so a car can be produced every 20 minutes,once the first car taking 35 minutes has been produced.

2) find the bottleneck in a proper granularity

3)intermediate result/state may be helpful. In above case,point file.

2. Cache Design

Considering about cache expire. Generally Date info or life time is created together with cache entry.

3. Quadtrees

Why not database to store the entry ?

please see Quadtrees

1) C++ Implementation

2) Use Case

版权声明:本文内容由互联网用户自发贡献,该文观点与技术仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。


什么是设计模式一套被反复使用、多数人知晓的、经过分类编目的、代码 设计经验 的总结;使用设计模式是为了 可重用 代码、让代码 更容易 被他人理解、保证代码 可靠性;设计模式使代码编制  真正工程化;设计模式使软件工程的 基石脉络, 如同大厦的结构一样;并不直接用来完成代码的编写,而是 描述 在各种不同情况下,要怎么解决问题的一种方案;能使不稳定依赖于相对稳定、具体依赖于相对抽象,避免引
单一职责原则定义(Single Responsibility Principle,SRP)一个对象应该只包含 单一的职责,并且该职责被完整地封装在一个类中。Every  Object should have  a single responsibility, and that responsibility should be entirely encapsulated by t
单例模式(Singleton Design Pattern)保证一个类只能有一个实例,并提供一个全局访问点。
观察者模式(Observer Design Pattern)定义了对象之间的一对多依赖,当对象状态改变的时候,所有依赖者都会自动收到通知。
工厂模式(Factory Design Pattern)可细分为三种,分别是简单工厂,工厂方法和抽象工厂,它们都是为了更好的创建对象。
备忘录模式(Memento Pattern)保存一个对象的某个状态,以便在适当的时候恢复对象。备忘录模式属于行为型模式。 基本介绍 **意图:**在不破坏封装性的前提下,捕获一个对象的内部状态,并在该
顾名思义,责任链模式(Chain of Responsibility Pattern)为请求创建了一个接收者对象的链。这种模式给予请求的类型,对请求的发送者和接收者进行解耦。这种类型的设计模式属于行为
享元模式(Flyweight Pattern)(轻量级)(共享元素)主要用于减少创建对象的数量,以减少内存占用和提高性能。这种类型的设计模式属于结构型模式,它提供了减少对象数量从而改善应用所需的对象结