Design and Application of Hash Table
What is a hash table
The English name of the hash table is “Hash Table”, and we usually call it “Hash Table” or “Hash Table”
The hash table uses the feature that arrays support random access to data according to subscripts, so the hash table is actually an extension of the array, evolved from the array. It can be said that if there is no array, there is no hash table.
Why do you say that? In fact, the storage of the so-called hash table is to extract some data from the data to be stored, calculate the index of an array according to a specific rule, and then store the data in the corresponding index position. The extracted data is called ** key (key), or the keyword **, this rule is called ** hash function **, and the calculated array index is ** hash value **, or Hash value.
Let’s take a specific example. We want to store a doc named ** abcd.txt ** in the hash table. We use the file name as the key to design a hash function. This hash function imported parameter is the file name. After a series of calculations, the index of the array is 3, so the content of this doc is stored in the place where the index of the array is 3. Next time I want to see if the doc of abcd.txt has been stored, I can execute the hash function again., get subscript 3, and then go to query whether there is data in the place where the subscript is 3.
Hash function
From the above example, we can see that the hash function plays a very key role in the hash table. Now let’s learn about the hash function.
Hash function, as its name suggests, is a function. We can define it as hash (key), where key represents the key value of the element, and hash (key) represents the hash value calculated by the hash function.
So how should we design a hash function?
- The hash value calculated by the hash function is a non-negative integer;
If key1 = key2, then hash (key1) hash(key2);
If key1 ≠ key2, then hash (key1) ≠ hash (key2).
The first point is easy to understand because our hash table is an array, and the index of the array is a non-negative integer.
The second point is also easier to understand. The hash value I get for the same content should be the same, so as to ensure that I can find it accurately the next time I want to find it.
The third point is actually difficult to do. We can hardly find a perfect conflict-free hash function. Even if we can find it, the time cost and calculation cost are very large, so for the hash conflict problem, we need to use other ways to solve it. For a simple example, if the length of the hash table is 10,000, no matter how your hash function is designed, the 10001 data will definitely be the same as the array index obtained by a certain record. This is the case when the loading factor reaches 1, which is bound to conflict.
The loading factor of the hash table = the number of elements filled in the table/the length of the hash table
When key1 ≠ key2, then hash (key1) = hash (key2), which is called hash conflict.
Solutions to hash conflicts
Now that we have said that we can hardly avoid hash conflicts, we need to think of a relatively low-cost way to solve hash conflicts.
There are two commonly used hash conflict resolution methods, open addressing and chaining.
Open addressing method
The core idea of open addressing is that if a hash conflict occurs, we re-probe a free location and insert it.
Linear detection method
- Insertion process:
How to re-probe the new location? Let me first talk about a relatively simple detection method, Linear Probing. When we insert data into the hash table, if a certain data has been hashed by the hash function and the storage location has been occupied, we start from the current location and look back in turn to see if there is a free location until we find it.
- Search process:
The process of searching for elements in the hash table is somewhat similar to the insertion process. We use the hash function to obtain the hash value corresponding to the key value of the element to be found, and then compare the element subscribed as the hash value in the array with the element to be found. If they are equal, it means that they are the elements we are looking for; otherwise, we search in order. If we traverse to the free position in the array and have not found it yet, it means that the element to be found is not in the hash table.
- Removal process:
Hash tables, like arrays, support not only insert and find operations, but also delete operations. For hash tables that use linear probing to resolve conflicts, the delete operation is slightly special. We cannot simply set the element to be deleted to null. Why is this? Remember the search operation we just talked about? During the search, once we find a free position through linear probing method, we can assume that the data does not exist in the hash table. However, if this free position is later deleted by us, it will invalidate the original search algorithm. The data that originally existed will be considered non-existent. How to solve this problem? We can mark the deleted element as deleted in particular. When the linear probe searches, it encounters a space marked as deleted, and instead of stopping, it continues to probe down.
Quadratic detection method
The so-called quadratic detection is very similar to linear detection. The step size of each linear detection is 1, so the subscript sequence it detects is hash (key) + 0, hash (key) + 1, hash (key) + 2… And the step size of the quadratic detection becomes the original “quadratic”, that is to say, the subscript sequence it detects is hash (key) + 0, hash (key) + 12, hash (key) + 22…
Double hashing
The so-called double hashing means not only using one hash function. We use a set of hash functions hash1 (key), hash2 (key), hash3 (key)… We first use the first hash function, if the calculated storage location is already occupied, then use the second hash function, and so on, until a free storage location is found. No matter which detection method is used, when there are not many free positions in the hash table, the probability of hash collision will be greatly increased. In order to ensure the operational efficiency of the hash table as much as possible, in general, we will try our best to ensure that there are a certain proportion of free slots in the hash table. We use the load factor to represent the number of vacancies. The calculation formula of the loading factor is: the loading factor of the hash table = the number of elements filled in the table/the length of the hash table. The larger the loading factor, the fewer free positions and the more conflicts, the performance of the hash table will decrease.
Linked list method
The linked list method is a more commonly used hash conflict resolution method, which is much simpler than the open addressing method.
In the hash table, each “bucket” or “slot” corresponds to a linked list, and all elements with the same hash value are placed in the linked list corresponding to the same slot.
When inserting, we only need to calculate the corresponding hash slot through the hash function and insert it into the corresponding linked list, so the time complexity of insertion is O (1).
When searching and deleting an element, we also calculate the corresponding slot through the hash function, and then traverse the linked list to find or delete.
What is the time complexity of the find or delete operation? In fact, the time complexity of these two operations is proportional to the length k of the linked list, which is O (k). For a hash function with a uniform hash, theoretically, k = n/m, where n represents the number of data in the hash and m represents the number of “slots” in the hash table. The solution begins.
Thinking
Since the hash table is actually an array, why do we need to make it so complicated?
Personally, I think the most important point is to be able to determine whether the value you are looking for is in the hash table in O (1) time complexity, and if so, return it.
How to design a good hash table
The query efficiency of hash table cannot be generalized as O (1). It is related to hash function, load factor, hash conflict, etc.
If the hash function is not well designed or the loading factor is too high, it may lead to an increase in the probability of hash collision and a decrease in query efficiency.
In extreme cases, some malicious attackers may also use carefully constructed data so that all data is hashed into the same slot after a hash function. If we use a list-based conflict resolution method, then at this time, the hash table will degenerate into a linked list, and the time complexity of the query will rapidly degrade from O (1) to O (n).
From our description of the hash table above, we should be able to understand the three main points of designing a good hash table:
- Design an efficient and evenly distributed hash function
- Efficient expansion of hash tables
- Choose the right hash conflict solution
Design a hash function
First of all, the design of the hash function should not be too complicated. An overly complex hash function will inevitably consume a lot of computing time, which indirectly affects the performance of the hash table. Secondly, the values generated by the hash function should be as random and evenly distributed as possible, so as to avoid or minimize hash conflicts, and even if there is a conflict, the data hashed into each slot will be relatively average, and there will be no A situation where there is a lot of data in one slot.
Efficient expansion
When the load factor of the hash table exceeds a certain threshold, it needs to be expanded. The load factor threshold needs to be selected properly. If it is too large, it will lead to too many collisions; if it is too small, it will lead to serious memory waste
For array expansion, the data move operation is relatively simple. However, for hash table expansion, the data move operation is much more complicated.
Because the size of the hash table has changed, the storage location of the data has also changed, so we need to recalculate the storage location of each data through the hash function.
Insert a data, in the best case, no expansion is required, and the best time complexity is O (1). In the worst case, the hash table loading factor is too high, and the expansion is started. We need to reapply the Memory Space, recalculate the hash position, and move the data, so the time complexity is O (n)
In most cases, inserting data into a dynamically expanded hash table is very fast, but in special cases, when the loading factor has reached the threshold, it is necessary to expand first and then insert data. At this time, inserting data will become very slow and even unacceptable.
In order to solve the situation that one-time expansion takes too much time, we can interspersed the expansion operation with the insertion operation and complete it in batches. When the loading factor reaches the threshold, ** we only apply for new space, but do not move the old data to the new hash table **.
** When there is new data to be inserted, we insert the new data into the new hash table, and take a data from the old hash table and put it into the new hash table **. Each time a data is inserted into the hash table, we repeat the above process. After multiple insert operations, all the data in the old hash table is moved to the new hash table bit by bit. In this way, there is no centralized one-time data movement, and the insertion operation becomes very fast.
How to choose a conflict resolution method?
Open addressing method
The open addressing method is not like the linked list method, which requires pulling a lot of linked lists. The data in the hash table is stored in an array, which can effectively use the CPU cache to speed up the query. Moreover, the hash table implemented by this method is relatively simple to serialize.
Deleting data is more troublesome, and it is necessary to mark the data that has been deleted. Moreover, in the open addressing method, all data is stored in an array, which is more expensive than the linked list method. Therefore, using the open addressing method to resolve conflicting hash tables, the upper limit of the loading factor should not be too large. This also leads to this method wasting Memory Space more than the linked list method.
Therefore, when the amount of data is relatively small and the loading factor is small, it is suitable to use the open addressing method. This is also the reason why the ThreadLocalMap in the Java uses the open addressing method to solve the hash conflict
Linked list method
The linked list method has a higher tolerance for large loading factors than the open addressing method. The open addressing method can only be used when the loading factor is less than 1. When it is close to 1, there may be a lot of hash collisions, resulting in a large number of probes, re-hashes, etc., and the performance will drop a lot. But for the linked list method, as long as the value of the hash function is randomly uniform, even if the loading factor becomes 10, that is, the length of the linked list becomes longer. Although the search efficiency has decreased, it is still much faster than the sequential search.
Because the linked list needs to store pointers, it consumes more memory for the storage of relatively small objects, and may double the memory consumption. Moreover, because the nodes in the linked list are scattered in memory, not continuous, it is not friendly to CPU cache, which also has a certain impact on execution efficiency.
If we are storing large objects, that is to say, the size of the object to be stored is much larger than the size of a pointer (4 bytes or 8 bytes), then the memory consumption of the pointer in the linked list can be ignored in front of the large object.
In fact, we can achieve a more efficient hash table by modifying the linked list method a little. That is, we transform the linked list in the linked list method into other efficient dynamic data structures, such as jump tables and Red-Black Trees. In this way, even if there is a hash collision, in extreme cases, all the data is hashed into the same bucket, and the search time of the final hash table is only O (logn). This also effectively avoids the hash collision attack mentioned earlier.
Why hash tables are often used with linked lists
Arrays have the advantage of random access, but have the disadvantage of requiring continuous memory.
Linked lists have the advantage of non-contiguous storage, but access lookups are linear.
The mixed use of hash tables and linked lists is to combine the advantages of arrays and linked lists and avoid their shortcomings.
Through the hash plus linked list, you can achieve discontinuous storage but random access, such as we need to access an element, you can calculate its hash value and then directly to the hash table to find, and because each item in the hash table is a linked list, but also can achieve discontinuous storage of data.