《数据结构》学习-- Hash(3) --Open Addressing

1. 回顾

上一次我们讲了Hash冲突解决方案之开散列（Separate Chaining）。其优点是思路简单，实现也容易。这一回我们介绍另一种Hash冲突解决方案，名为闭散列法，或叫Open Addressing

你可能觉得闭散列和Open有些矛盾。其实，看了Open Addressing的核心思想后，你就明白了。

2. Open Addressing核心思想

Open Addressing思想非常简单。如果第一次Hash寻找到得位置失败，那就不断进行位移，直到找到满足条件的位置。

即：我们不断尝试h0(x),h1(x),h2(x)...这些位置。其中：hi(x) = (Hash(x) + Function(i)) % tableSize。其中Function(0) = 0.

2.1 Linear Open Addressing

顾名思义，就是Function(i) = i。也就是说，如果第一次Hash寻找位置失败，那么就顺序找下去，直到找到一个满足要求的位置为止。

优点：思路简单，而且只要Hash表不满，总能找到满足条件的位置。

缺点：容易产生主聚合效应（primary clustering）。简单来说，就是插入的点容易聚集到一块地方，从而使得第一次Hash到这块范围的数都必须顺序搜索这块范围。根据复杂的计算，我们可以得到，当load factor（此概念在上一章介绍）为0.5时，平均每次插入（等同于非成功寻找）需要位移2.5次，平均每次成功寻找需要位移1.5次。将load factor保证在0.5以下，那么时间是比较理想的。

2.2 Quadratic Open Addressing

顾名思义，就是Function(i) = i^2。简单地计算可以得到：h(i+1)(x) = hi(x) + 2i -1. 另外，只有当load factor小于0.5且Hash表大小为质数时，才能保证每次插入都成功（可以证明，这里略）。

优点：不会产生主聚合效应。

缺点：虽然Quadratic方法不会产生主聚合效应。但会产生次聚合效应（secondary clustering）。即，第一次Hash到同一个位置的点，他们之后的搜索过程都完全一样，需要重复。

3. 延迟删除（lazy deletion）

如果我们需要删除一个值，不能简单的把那个位置的值去掉。简单思索便可明白，因为这个点后面的值可能是通过位移过去的，如果这点被挖空，那么我们想寻找后面的值就变得不可能了。

因此，我们使用一个延迟删除的技术。思想很简单，我们给每个点赋予一个状态，分别是被占用（legitimate），空（empty），被删除（deleted）。初始时所有点都为空，当被插入一个值时将状态设为被占用，但被删除时状态设为被删除。这样的话，如果我们要寻找一个点，只要搜索路径上的点非空，且其值与我们想要搜索的值不同，那么就不断搜索下去，直到找到空点或者相同值得点。（如果觉得拗口，请看下面的代码）。

4. Open Addressing实现

4.1 基本数据结构

enum Kind  {LEGITIMATE,EMPTY,DELETED};
struct HashNode{
	ElementType elementValue;
	enum Kind kind;
};
struct HashTbl{
	int tableSize;
	int content;
	HashNode* table;
};
HashTbl* hashTable;

4.2 初始化

template<class elementtype="">
void HashTable<elementtype>::initialize(HashTbl*& newHashTable,int minSize)
{
	int tableSize=nextPrime(minSize);//寻找下一个比minSize大的质数

	try{
		newHashTable=new HashTbl;
	}catch(std::bad_alloc&){
		errorDisplay("new memory failed!",__FILE__,__FUNCTION__,__LINE__);//如果new失败，报错
	}

	try{
		newHashTable->table=new HashNode[tableSize];
	}catch(std::bad_alloc&){
		errorDisplay("new memory failed!",__LINE__);//如果new失败，报错
	}

	for(int i=0;i<tablesize i="" newhashtable-="">table[i].kind=EMPTY;
	}

	newHashTable->tableSize=tableSize;
	newHashTable->content=0;
}
</tablesize></elementtype></class>

4.3 寻找Find

Find函数可以说是Open Addressing的关键。

template<class elementtype="">
int HashTable<elementtype>::findInner(HashTbl* _hashTable,ElementType& elementValue)
{
	int key=getElementKey(elementValue)%_hashTable->tableSize; //第一次Hash，getElementKey是根据输入数据获得一个初始key值，详细可参考上一章
	int hashTimes=0;
	while(_hashTable->table[key].kind!=EMPTY && _hashTable->table[key].elementValue!=elementValue){
		key=hash2(key,hashTimes)%_hashTable->tableSize; //hash2就是上面所提到的Function，具体见下面
	}
	return key;
}

template<class elementtype="">
bool HashTable<elementtype>::find(ElementType elementValue)
{
	int pos=findInner(hashTable,elementValue);
	return hashTable->table[pos].kind==LEGITIMATE;
}

template<class elementtype="">
int HashTable<elementtype>::hash2(int key,int hashTimes)
{
	switch(OPEN_ADDRESS){ //根据不同的Open Addressing方法，选择不同的位移方式
		case LINEAR:
			return key+hashTimes;
		case QUDRATIC:
			return key+2*(hashTimes+1)-1;
		default:
			errorDisplay("OPEN_ADDRESS method error!",__LINE__);
			return -1;
	}
}
</elementtype></class></elementtype></class></elementtype></class>

4.4 插入Insertion

template<class elementtype="">
bool HashTable<elementtype>::insertInner(HashTbl*& _hashTable,ElementType& elementValue) 
{
	//rehash
	if(loadFactor>MAX_LOAD_FACTOR){ //MAX_LOAD_FACTOR一般取0.5
		_hashTable=rehash(_hashTable->tableSize); //rehash的概念在上一章讲过
	}
	
	int pos=findInner(_hashTable,elementValue);

	HashNode& hashNode=_hashTable->table[pos];
	if(hashNode.kind==LEGITIMATE) //该值已经存在，无需插入
		return false;
	else{ //该值不存在，或者已被删除
		hashNode.elementValue=elementValue;
		hashNode.kind=LEGITIMATE;
		_hashTable->content++;
		loadFactor=(double)(_hashTable->content)/(double)(_hashTable->tableSize);
		return true;
	}
}

template<class elementtype="">
bool HashTable<elementtype>::insert(ElementType elementValue)
{
	return insertInner(hashTable,elementValue);
}
</elementtype></class></elementtype></class>

4.5 删除Remove

template<class elementtype="">
bool HashTable<elementtype>::removeInner(HashTbl* _hashTable,ElementType& elementValue)
{
	int pos=findInner(_hashTable,elementValue);
	
	HashNode& hashNode=_hashTable->table[pos];
	if(hashNode.kind==LEGITIMATE){ //这个点存在
		hashNode.kind=DELETED;
		_hashTable->content--;
		loadFactor=(double)(_hashTable->content)/(double)(_hashTable->tableSize);
		return true;
	}
	else //这个点不存在，或已被删除
		return false;
}

template<class elementtype="">
bool HashTable<elementtype>::remove(ElementType elementValue)
{
	return removeInner(hashTable,elementValue);
}
</elementtype></class></elementtype></class>

4.6 扩充Hash表 rehash

template<class elementtype="">
class HashTable<elementtype>::HashTbl* HashTable<elementtype>::rehash(int currentSize)
{
	HashTbl* newHashTable;
	initialize(newHashTable,currentSize*10);//扩充一个比原来大十倍的Hash表，这个数字是我简单设定的，没有经过考量！
	loadFactor=(double)(hashTable->content)/(double)(newHashTable->tableSize);

	for(int i=0;i<hashtable->tableSize;i++){
		insertInner(newHashTable,hashTable->table[i].elementValue);
	}

	return newHashTable;
}
</hashtable-></elementtype></elementtype></class>

5. 性能测试

我们创建一个使用Quadratic方式位移的Hash表。初始大小设为1,000,000.然后不断插入10,000个随机数。测试需要多少时间。

int main()
{
	HashTable<int> hashTable(1000000,&getElementKey,&isEqual,HashTable<int>::QUDRATIC,0.49);
	clock_t start=clock();

	for(int i=0;i<10000000;i++){
		int r=rand();
		hashTable.insert(r);
	}

	clock_t finish=clock();
	printf("time is %fs\n",(double)(finish-start)/CLOCKS_PER_SEC);

	return 0;
}
</int></int>

使用clang++编译，O3速度优化。测试结果：

time is 2.239344s
time is 2.059147s
time is 2.318181s

6.总结

这次我们介绍了闭散列法（Open Addressing），实测下来，这种方法比开散列速度更快。个人认为主要原因是避免了内存分配/释放操作这一非常耗时的过程。至此为止，我们已经把主流的Hash方法都介绍了。对于一般的应用基本足够。Hash冲突解决方案，Hash Function的设计，都是需要具体问题具体分析的，没有一个放之四海而皆准的方案，关于这一点我也并没有经验，请大家参考其他资源。最后，在下一章（应该也是最终章）中，将介绍C++ STL，以及Python中的Hash库。敬请期待吧！