Consider that you are assigned to conduct a wildlife census in a pristine rainforest. You take a picture of an animal every time you see one. The entire number of photos taken by your digital camera will be tracked, but you are only concerned with the quantity of unique creatures, or those that you haven’t yet counted. How can I obtain that number the best? The obvious solution, according to Illinois Institute of Technology computer scientist Lance Fortnow, is to keep track of every animal you have already seen and compare it to the list whenever you encounter a new one. However, he continued, there are more ingenious methods to go about it, since the obvious strategy is far from simple when dealing with thousands of entries.
Things worsen. What would happen if Facebook wanted to keep track of how many unique users signed in every day, even if some of them did so from different devices and at different times? We are currently comparing each new login to a potentially billion-strong list.
Computer scientists recently published a paper outlining a novel approach that only requires memorizing a limited number of entries in order to approximate the number of distinct entries in a lengthy list. Any list where the elements are added one at a time, such as words in a speech, things on a conveyor belt, or cars on the freeway, can be used with this algorithm.
The distinct elements problem, which computer scientists have been attempting to solve for more than 40 years, is finally on the verge of being resolved thanks to the breakthrough work of Sourav Chakraborty of the Indian Statistical Institute, Vinodchandran Variyam of the University of Nebraska, Lincoln, and Kuldeep Meel of the University of Toronto. This algorithm is known as the CVM algorithm. It requests an efficient method of counting the number of elements in a stream, which may exceed the amount of memory that is available, and then estimating the number of unique elements.
“The new algorithm is remarkably straightforward and simple to apply,” University of Massachusetts Amherst’s Andrew McGregor stated. “I wouldn’t be shocked if this ended up being the standard method used to approach the [distinct elements] problem in practice.”
Consider that you are listening to the audio version of Hamlet to understand both the issue and the way the CVM algorithm resolves it. The play contains 30,557 words. What number of them are unique? You might utilize the pause button frequently while listening to the play, list every word alphabetically in a notebook, and skip over terms that you already have on your list to find out. All you have to do is count the amount of words on the list when you get to the finish. This method works, however it uses about the same amount of RAM as there are unique words.
Typically, millions of objects are to be tracked in data-streaming scenarios. Variyam stated, “You might not want to store everything.” And the CVM algorithm can provide a simpler solution in this situation. He claimed that depending on randomness is the key.
Let’s go back to Hamlet, but this time there are only 100 words available in your working memory, which is like a whiteboard. As soon as the play begins, you record the first 100 words you hear, again ignoring repetitions. Press pause once the area is full, then flip a coin for each word. If the term is heads, it remains on the list; if it is tails, it is removed. Following this first round, you should have roughly fifty different words remaining.
You now proceed with what the group refers to as Round 1. Continue reading Hamlet, including new vocabulary as you go. Toss a coin once again if you come across a term that is already on your list. If the result is tails, remove the word; else, it remains on the list. Continue doing this until the whiteboard is filled with 100 words. Then, depending on the results of 100 coin flips, eliminate roughly half again at random. That brings an end to Round 1.
Proceed to Round 2 after that. Proceed as in Round 1, but this time, we’ll make it more difficult to stay on topic. Flip the coin once more when you reach a word that appears twice. Tails, and just like before, you erase it. However, you will flip the coin again if it comes up heads. Don’t spread the word unless you receive a second heads. After the board is full, based on 100 coin flips, there is one more purge of roughly half the words before the round concludes.
To keep a word in the third round, you must get three heads in a row. You will need four heads in a row in the fourth round. And so forth.
Associated:
Researchers Approach New Speed Limit for Seminal Problem; Scientists Find the Ideal Balance of Data Storage and Time; Mathematicians Find the Ideal Multiplying Formula
You will eventually reach Hamlet’s conclusion in the kth round. The goal of the exercise has been to make sure that each word has the same chance of existing based on your random selections: half a thousand. To estimate the number of separate words, you can divide 61 by the likelihood, 1/26, if, for example, you have 61 words on your list at the end of Hamlet and the process took six rounds. In this case, the number of distinct words is 3,904. It is simple to understand how this process operates: Let’s say you have one hundred coins, and you start by flipping each one separately, keeping just the heads. You will eventually have about fifty coins, and one can estimate that there were originally roughly one hundred coins by dividing that amount by the chance, or ½.)
Mathematically, Chakraborty, Variyam, and Meel demonstrated that this technique’s accuracy increases with memory size. There are exactly 3,967 unique words in Hamlet. (They made a count.) After five runs, the average estimate in experiments with a 100-word memory was 3,955 words. By using a 1,000-word memory, the average increased to 3,964. Variyam responded, “Of course we can get 100% accuracy if the [memory] is so big that it fits all the words.”
Harvard University’s William Kuszmaul commented, “This is a great example of how, even for very basic and well-studied problems, there are sometimes very simple but non-obvious solutions still waiting to be discovered.”