Misra-Gries Algorithm
Estimating the most frequent Items in a dataset (with an example)
The misra-gries algorithm is used in Big data to find the most frequent items in a data stream. In smaller datasets, it is rather straight forward to find the most frequent items — simply count the occurrences of each item and identify those that appear most often. However, this approach becomes impractical in the context of big data due to the sheer volume of information and the real-time nature of data streams.
This is where this algorithm comes in handy. The Misra-gries algorithm is able to find the most frequent items without storing the entire data set in memory. As a result, this is able to save the computing power required while still attaining the goal. This algorithm however, has it’s drawbacks as you’ll see. (Will get to that in the example).
In the industry however, there are softwares that are used for this. You will most likely never be required to write this by hand.
The example provided here is meant to explain the underlying logic of the algorithm and also for educational purposes.
Lets get into the example:
Example:
Suppose you want to extract the Most Frequent hashtags from twitter. The available counters that we have are only 3. This means, we can store only 3 different/unique hashtags at a time.. Remember I also mentioned, the algorithm does not store the entire data set into memory.
Our goal is to find the 3 most frequent tweets from the stream.
Here is the stream of the first 10 hashtags: (m=10)
#music, #health, #health, #movie, #health, #movie, #health, #sport, #movie, #food
The Counters we have available are only 3. The number of counters is typically much smaller than the total number of distinct items in the stream.
The first hashtag comes in:(#music) We will put that in the first slot. and increament the count by 1:
The second hashtag comes in (#health). We will add that to the second slot (counter 2) , and increment by 1.
The third hashtag streams in (#health), and because it matches one of the counters, we increment the second counter by 1.
Here is what it looks like so far:
The 4th hashtag streams into the counter and its new, and there is still space. (#movie). We will add it on the third slot and increment it by 1
Here is our updated table:
The fifth hashtag comes in: (#health). we will increment the second counter by 1:
The 6th hashtag comes in (#movie). We will increment the third counter by 1:
The 7th hashtag streams in (#health). We will increment the second counter by 1:
The 8th hashtag streams in(#sport). Now, the spaces are already full. There is nowhere to store the #sport hashtag.
In this case, we will look for any item in the counter that has a count of 0.(Note that there is none)
Since there is none, We will then proceed to decrement every item in the counters by 1. After doing that, we will then move to the next item in the stream. (In this case, we sought of ignore the #sport hashtag after reducing every counter by 1)
This decrementing process effectively means we overlook the #sport hashtag for the moment, after which we proceed to the next item in the data stream. Essentially, we bypass the #sport hashtag following the reduction of each counter by one.
Here is the updated table:
or you can look at it like this:
We have basically removed the #music hashtag
Then we have the 9th hashtag: (#movie). As previously, increment the counter 3 by 1
And finally, the last hashtag (#food) comes in. Since we have the counter 1 empty, we will add it there and increment that by 1:
Here is a list of the 3 most frequent hashtags and their counts. #health(3), #movie(2) and #food(1)
Now, you might remember that I mentioned that this algorithm has some drawbacks. And you might have caught this by now: When we have a new hashtag coming in, and there is no space to store it, then we decrement everything by 1. As a result, the final counts do not reflect the True count/frequencies of the items in the stream.
That is the biggest drawback.
Thank you for reading!.
Abby Nyakara :)