In a vector of length 2n, an element appears at least n+1 times. Find the most frequent element of the vector.
Introduction:
One obvious algorithm scans the vector in one pass keeping a running tally of the occurrences for each distinct element encountered.
If the number of distinct elements is fixed (i.e. 2), this algorithm will have a time complexity of O(n), and memory complexity of O(1).
However, looking at the worst case, as the number of distinct elements can be as high as n, this solution for finding the most frequent element requires an array of length n to store the frequencies and will have a higher than linear time complexity.
Let’s try and find a solution that has a linear time complexity and constant memory usage in this worst-case scenario.
Algorithm:
If our vector has only two possible entries, A & B, the following approach would work:
- pass the vector and keep in memory the difference between the frequencies of A and B up until that point, and which element has the higher frequency.
To do this, we just add one to the counter when we encounter the most frequent element up to that point, and subtract one when we encounter the other one. When the counter is 0, we ‘forget’ the most frequent value, and set it in the next step.
In the end, we will have our desired most frequent element and the frequency delta.
It seems that this solution might work even when we add more distinct values to the array, replacing some of the B’s in it.
Let’s start from the beginning of the array and move along it. We will keep in memory a pair consisting of a favorite and a counter.
Initially, the favorite is unknown and the counter is 0.
When we move incrementally in the array, passing an element:
- if the counter is 0, we set the favorite to the observed element and we set the counter to 1.
- if the counter is not 0, we increment or decrement the counter by 1 according to whether the observed element is the favorite or not.
When we are done, the favorite is the mode, if such an element exists.
Let’s see how this works on a given vector:
def find_mode(v, debug=False):
cnt = 0
fav = None
for i in range(0, len(v)):
if cnt == 0:
cnt = 1; fav = v[i]
else:
if v[i] == fav:
cnt += 1
else:
cnt -= 1
if cnt ==0:
fav = None
if debug:
print(f"Step {i+1}: Fav: {fav}, Counter: {cnt}")
return fav
Proof that the algorithm works:
First Method
We can look at our array of length 2n and the state of the running algorithm after it has processed the ith element.
The elements up to this point can always be divided into two disjoint groups.
One group with all the elements equal to the current favorite and another group with elements that can be paired in such a way that no pair has two equal elements, meaning that they cancel each other out. In this group, we can still have some elements equal to the current favorite.
We can use this property to prove that the algorithm produces the mode of the vector.
Second Method
Let X and k be the Favorite and the counter at the end of the vector. We can observe that k must be even since, at any point, the counter has the same parity as the number of elements processed in the array, and our array has length 2n.
We can split the elements of the vector as seen before:
- the favorite group of k times X
- the 2n-k paired elements and distinct from one another.
Suppose that X is not the mode. Then there must be Y, such that the frequency of Y is more than n.
But Y can only appear in the second group and at most once in every pair, so he has at most apparitions, which is less than n. This comes in contradiction with our initial supposition, so the favorite is the mode.
Note:
This algorithm was invented by Robert S Boyer and J Strother Moore in 1980. You can read the paper published by them that proves in more detail that it is possible to split the elements into the two groups we mentioned before. It also contains more examples and a FORTRAN implementation and verification system. You can also check this step-by-step example of the application of the algorithm written by the authors.