Comparison of Several Basic Data Structures

** For a computer, it only knows how to use memory addresses to access variables in memory. We have artificially separated two storage forms, one is continuous storage (that is, arrays), and the other is chained storage (linked lists). As for other data structures, they are actually the use of these two storage forms. Any data structure can be implemented in two storage forms, only the question of whether it is suitable or not, and there is no question of whether it is possible or not, because in the end, the memory address is used to access the memory. Queue, stack we usually use arrays to achieve, but can also use linked lists, such as trees we use linked lists in most cases, but in fact, like a complete binary tree array to achieve no problem.

This blog is mainly to analyze what features of data structures we need in these cases through some examples. In order to better meet these features, we need to use arrays or linked lists to implement. ** For all the content, I will not go into detail, let alone the specific implementation, because they are very basic, but if we really want to develop it, it will become a long talk. We mainly look at them from a higher level. some features.

What is Data Structure?

I personally believe that data structures can be divided into two parts.

One is the storage structure of data, that is, sequential storage or chain storage.

One is the logical structure of data, the special relationship between a set of data of the same type, and a series of special additions, deletions, and changes defined to maintain this relationship.

How to choose to design a data structure

Different data structures are suitable for different scenarios. The continuity of arrays allows us to efficiently access randomly, but insertion and deletion are time-consuming and troublesome to expand. Although linked lists cannot be efficiently accessed randomly, insertion and deletion are very simple. There is no such thing as expansion. When we design data structures, we repeatedly use these two points to make a fuss.

For example, zipped hash tables use hash functions to quickly obtain array subscripts, and then use linked lists to achieve quick insertion and deletion.

For example, using the heap (a complete binary tree) to simplify the O (n) of the linked list search to O (logn) and so on.

Our usual thinking process for choosing a data structure should be like this:

  • What kind of algorithm do we need to choose in this scenario? For example, anti-Poland requires first-in and then-out

  • What characteristics of the data structure does this algorithm need? If you only need to push and exit the stack, then choose the stack.

  • According to our specific needs, whether we need frequent expansion and contraction, whether we have high requirements on access time, etc. ** Choose the appropriate storage method for data structure **.

In fact, this process is the opposite of the process of data structure generation.

** Whether it is an array or a linked list, for the computer, the memory address is used to access the data in the memory **, but the memory address of the array is continuous, so we can directly calculate the first address plus the offset The memory address of the data to be accessed, while the memory address of the linked list is discontinuous, each node in the linked list needs to record the memory address of the next node, rather than calculating the offset.

Therefore, array or linked list is actually a different form of utilization of computer storage space.

And ** different data structures can be said that we give some special relationships between these same types of data, these relationships can help our algorithm execute more efficiently, so special insertion and deletion methods are needed to maintain this relationship **. For example, an array that can only be FIFO is called a queue, and the left sub-node must be less than the parent node, and the right child node is greater than the parent node, we call it a binary search tree.

Arrays and linked lists

Arrays and linked lists are not only two ways to store data in computer memory, but also the most basic data structures used for the most basic data storage.

Queue

The characteristic of the queue is first-in, first-out (FIFO). Its operation is to enter and leave the queue. Entering the queue is to add an item at the end of the queue, and leaving the queue is to pop up an item from the head of the queue.

For example, EventLoop in our JavaScript or nextTick in Vue are all in this form.

After each event is triggered, the callback function is put into a queue, and then the loop is continuously traversed from the beginning of the queue, and the callback function is taken out from the head to execute.

When we need to apply the first-in-first-out feature, we can use the data structure of queues.

The implementation of queues, using arrays and linked lists, is no problem. Normally, its insertion and deletion only require O (1).

However, if the array is used to achieve, due to the characteristics of the continuous memory of the array, we must specify the size of the array in advance, and then apply for a continuous memory, so if the length of the queue needs to exceed the initial queue length, then there is an expansion problem, we need to apply for a larger piece of memory, then copy the data, and then go back to release this piece of memory, this time the time complexity of insertion will become O (n).

If a linked list is used, there will be no expansion problem, because its memory is discontinuous, and we can apply for memory for the enlisted data separately when enqueuing. But because its memory is discontinuous, there is no way to calculate the address of each node by adding the first address and offset, so we need extra space to record the address of each node, that is, each node has to record the address of the next node.

So whether to use an array or a linked list depends on the specific needs. If the queue length is fixed or does not change much, it is better to use an array, because the continuous Memory Space is more friendly for caching, but if the queue length will continue to change, the data that needs to be stored It also requires much more space than a pointer. In fact, a linked list is better.

For other content about the queue, you can go to my other article关于队列的文章

Stack

The characteristic of the stack is that it is first in and then out. Its operation is to enter and exit the stack. Entering the stack can also be called pushing the stack, which is to push an item from the stack top, and leaving the stack is to pop an item from the stack top, that is, entering the stack and leaving the stack. The stack is from the same head.

For example, the one above leetcode逆波兰表达式求值Its solution idea is the application of a very classic stack.

Similarly, the implementation of the stack can also be implemented with arrays or linked lists.

There is no more analysis about the stack. In fact, it is similar to the queue.

Jump table

We also said above that the problem with chaining is that the time complexity for random access is O (n).

Even if the data in the linked list is already sorted from small to large, it is impossible to use dichotomy to find the required data in O (logn) time complexity like arrays.

In response to this problem, the classic space-for-time in the computer field has come. We can extract the first node up every k nodes to form a new linked list, and save a new pointer pointing to the original node in the extracted node.

In this way, we extract layer by layer until there is only one node left in a certain layer.

This time we go to find a data, you can first compare from the top, if greater than the current node value, put the pointer down one layer, and then compare back to find the last node is less than it, and then sink, until the last layer.

In this way, the time complexity of the search will drop to the logarithm of n with base k.

We often say that the Redis database uses jump tables.

Hash table

About the introduction of hash tables, you can take a look at my article first.关于散列表的文章

After reading, we can probably summarize that the hash table can be just an array. The most important feature it utilizes is that the array can perform random access in O (1) time complexity according to the subscript, and calculate the data through the hash function. The subscript position of storage, but due to the pigeon cage principle (n + 1 pigeons are placed in n cages, there must be more than one pigeon in a cage), even if your hash function is well designed, hash conflicts are inevitable.

Then we have to find a way to solve the hash conflict, one is open addressing method, find another vacancy in the hash table through other methods, and the other is to use the characteristics of the linked list, use the zipper method, each item in the hash table is actually It is the head node of a linked list, and every time data is inserted in the linked list.

The zipper method combines the characteristics of arrays and linked lists well. First, the hash function is used to calculate the hash table subscript that the data should exist, and the advantage of random access in the array is used to quickly find the linked list that should exist, and then find or insert it in the linked list. As long as our hash function is well designed, the average length of this zipper will not be too long, and the time complexity can also be considered O (1).

Tree

So far, we may feel that the hash table has been able to support our needs very well. Why do we still need trees?

Here we will talk about a few limitations of hash tables:

  • The data in the hash is unordered, which means that if we want to find a specific value, the hash works well, but if we want to find all the data in a range, the hash table will not work
  • The hash function is difficult to design, so the performance of the hash table is very unstable, and if it involves expansion and contraction, it is more troublesome.
  • In order to avoid hash conflicts as much as possible, the load factor cannot be too large, so there will be some wasted space.

Binary search tree

For the first limitation just mentioned, we can use binary search tree (BST) to achieve

Binary Search Tree is born to achieve fast search. However, it not only supports fast search of a data, but also supports fast insertion and deletion of a data. How does it do this? These all depend on the special structure of Binary Search Tree. Binary Search Tree requires that any node in the tree, the value of each node in its left subtree is less than the value of this node, while the value of the right subtree node is greater than the value of this node.

In addition to insert, delete, and find operations, Binary Search Tree also supports fast search for the largest and smallest nodes, precursor nodes, and successor nodes. I won’t show these operations one by one. I will put the corresponding code on GitHub, you can implement it yourself first, and then go to it to see. In addition to supporting the above operations, Binary Search Tree has an important feature, which is to traverse the Binary Search Tree in order, which can output an ordered data series with O (n) time complexity, which is very efficient. Therefore, Binary Search Tree is also called binary sorted tree.

Balanced binary search tree

The binary search tree mentioned above also looks good, but there are still problems. Suppose we have 1-9 nine trees, it is very likely that through a series of insert and delete operations, it will become a left sub-node from the root node to 0, then at this time the binary search tree will become a linked list, and the efficiency will be drastically degraded to O (n).

So we need to try to ensure that the height difference between the left and right subtrees is not too large, which leads to a balanced binary tree: the height difference between the left and right subtrees of any node in the binary tree cannot be greater than 1.

If our BST is still a balanced binary tree, it can solve the above problem very well.

Of course, what we may use in practice is ** Red-Black Tree **, an approximately balanced binary tree.

Heap (complete binary tree)

A heap is a special kind of tree. Let’s take a look now, what kind of tree is a heap.

I listed two requirements, as long as these two points are met, it is a heap.

  • The heap is a complete binary tree;

  • The value of each node in the heap must be greater than or equal to (or less than or equal to) the value of each node in its child tree.

The first point is that the heap must be a complete binary tree. Remember the definition of a complete binary tree we talked about before? A complete binary tree requires that, except for the last layer, the number of nodes in all layers is full, and the nodes in the last layer are aligned to the left.

Second, each node in the heap must have a value greater than or equal to (or less than or equal to) the value of each node in its subtree. In fact, we can also put it another way, the value of each node in the heap is greater than or equal to (or less than or equal to) the values of its left and right sub-nodes. These two statements are equivalent.

For the heap where the value of each node is greater than or equal to the value of each node in the subtree, we call it “big top heap”. For the heap where the value of each node is less than or equal to the value of each node in the subtree, we call it “small top heap”.

For heap applications, it is very common to recommend keywords for TopK.

For this requirement, we can maintain a small top heap with a node number of K, and then traverse the number of keyword occurrences in turn. If the number of occurrences is less than the root node, it will continue, and if it is greater than the heap operation.

Figure

Since a new data structure has emerged, there must be some problems that the above data structure cannot support well.

As another nonlinear data structure besides trees, the relationships that can be expressed in graphs are intersecting, and the relationships between nodes are not necessarily parent-child relationships.

For example, in our social network, everyone is a node. If there is a relationship between the two, they will be connected. This relationship cannot be expressed by trees.

Of course, the storage of the graph can also use arrays or linked lists. For arrays, we can use adjacency matrices, and for linked lists, we can use adjacency lists