CopyPastor

ANswer from [This Link][2]
From [Wikipedia][1]:
> Bloom filters have a strong space > advantage over other data structures > for representing sets, such as > self-balancing binary search trees, > tries, hash tables, or simple arrays > or linked lists of the entries. Most > of these require storing at least the > data items themselves, which can > require anywhere from a small number > of bits, for small integers, to an > arbitrary number of bits, such as for > strings (tries are an exception, since > they can share storage between > elements with equal prefixes). Linked > structures incur an additional linear > space overhead for pointers. A Bloom > filter with 1% error and an optimal > value of k, on the other hand, > requires only about 9.6 bits per > element — regardless of the size of > the elements. This advantage comes > partly from its compactness, inherited > from arrays, and partly from its > probabilistic nature. If a 1% false > positive rate seems too high, each > time we add about 4.8 bits per element > we decrease it by ten times.
Pretty clear to me.
A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly **not** present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).
And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter *in conjunction* with a structure like a hash table, once you're certain the element has a chance of being present.
So an example usage pattern might be:
You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of *m*. Then the optimal *k* is determined (from the formula given in the article). You populate your filter from this disk-bound data once.
Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).
Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really *will* be there, so the work was not for naught.
[1]: http://en.wikipedia.org/wiki/Bloom_filter [2]: https://stackoverflow.com/a/4282445/10997525

From [Wikipedia][1]:
> Bloom filters have a strong space > advantage over other data structures > for representing sets, such as > self-balancing binary search trees, > tries, hash tables, or simple arrays > or linked lists of the entries. Most > of these require storing at least the > data items themselves, which can > require anywhere from a small number > of bits, for small integers, to an > arbitrary number of bits, such as for > strings (tries are an exception, since > they can share storage between > elements with equal prefixes). Linked > structures incur an additional linear > space overhead for pointers. A Bloom > filter with 1% error and an optimal > value of k, on the other hand, > requires only about 9.6 bits per > element — regardless of the size of > the elements. This advantage comes > partly from its compactness, inherited > from arrays, and partly from its > probabilistic nature. If a 1% false > positive rate seems too high, each > time we add about 4.8 bits per element > we decrease it by ten times.
Pretty clear to me.
A bloom filter doesn't store the elements themselves, this is the crucial point. You don't use a bloom filter to test if an element is present, you use it to test whether it's certainly **not** present, since it guarantees no false negatives. This lets you not do extra work for elements that don't exist in a set (such as disk IO to look them up).
And all in significantly less space than something like a hash table (which is likely going to be partially on disk for large data sets). Though you may use a bloom filter *in conjunction* with a structure like a hash table, once you're certain the element has a chance of being present.
So an example usage pattern might be:
You've got a lot of data, on disk -- you decide on what error bound you want (e.g. 1%), that prescribes the value of *m*. Then the optimal *k* is determined (from the formula given in the article). You populate your filter from this disk-bound data once.
Now you have the filter in RAM. When you need to process some element, you query your filter to see if it stands a chance of existing in your data set. If it doesn't, no extra work is done. No disk reads, etc. (Which you would have to do if it were a hash or tree, etc).
Otherwise, if the filter says "Yes, it's in there", there's a 1% chance that it's wrong, so you do the necessary work to find out. 99% of the time, it really *will* be there, so the work was not for naught.
[1]: http://en.wikipedia.org/wiki/Bloom_filter

CopyPastor

Possible Plagiarism

Original Post