Stream processing:
Data Mining Stream:
Data Stream is a continuous, fast-changing, and ordered chain of data transmitted at a very high speed. It is an ordered sequence of information for a specific interval. The sender’s data is transferred from the sender’s side and immediately shows in data streaming at the receiver’s side. Streaming does not mean downloading the data or storing the information on storage devices.
Sources of Data Stream:
There are so many sources of the data stream, and a few widely used sources are listed below:
· Internet traffic
· Sensors data
· Real-time ATM transaction
· Live event data
· Call records
· Satellite data
· Audio listening
· Watching videos
· Real-time surveillance systems
· Online transactions
What are Data Streams in Data Mining?
Data Streams in Data Mining is extracting knowledge and valuable insights from a continuous stream of data using stream processing software. Data Streams in Data Mining can be considered a subset of general concepts of machine learning, knowledge extraction, and data mining. In Data Streams in Data Mining, data analysis of a large amount of data needs to be done in real-time. The structure of knowledge is extracted in data steam mining represented in the case of models and patterns of infinite streams of information.
Characteristics of Data Stream in Data Mining
Data Stream in Data Mining should have the following characteristics:
· Continuous Stream of Data: The data stream is an infinite continuous stream resulting in big data. In data streaming, multiple data streams are passed simultaneously.
· Time Sensitive: Data Streams are time-sensitive, and elements of data streams carry timestamps with them. After a particular time, the data stream loses its significance and is relevant for a certain period.
· Data Volatility: No data is stored in data streaming as It is volatile. Once the data mining and analysis are done, information is summarized or discarded.
· Concept Drifting: Data Streams are very unpredictable. The data changes or evolves with time, as in this dynamic world, nothing is constant.
Data Stream is generated through various data stream generators. Then, data mining techniques are implemented to extract knowledge and patterns from the data streams. Therefore, these techniques need to process multi-dimensional, multi-level, single pass, and online data streams.
Data stream concepts:
Introduction to stream concepts :
A data stream is an existing, continuous, ordered (implicitly by entrance time or explicitly by timestamp) chain of items. It is unfeasible to control the order in which units arrive, nor it is feasible to locally capture stream in its entirety.
It is enormous volumes of data, items arrive at a high rate.
Types of Data Streams :
· Data stream –
A data stream is a(possibly unchained) sequence of tuples. Each tuple comprised of a set of attributes, similar to a row in a database table.
· Transactional data stream –
It is a log interconnection between entities:
1. Credit card – purchases by consumers from producer
2. Telecommunications – phone calls by callers to the dialed parties
3. Web – accesses by clients of information at servers
· Measurement data streams –
1. Sensor Networks – a physical natural phenomenon, road traffic
2. IP Network – traffic at router interfaces
3. Earth climate – temperature, humidity level at weather stations
Examples of Stream Sources-
1. Sensor Data –
In navigation systems, sensor data is used. Imagine a temperature sensor floating about in the ocean, sending back to the base station a reading of the surface temperature each hour. The data generated by this sensor is a stream of real numbers. We have 3.5 terabytes arriving every day and we for sure need to think about what we can be kept continuing and what can only be archived.
2. Image Data –
Satellites frequently send down-to-earth streams containing many terabytes of images per day. Surveillance cameras generate images with lower resolution than satellites, but there can be numerous of them, each producing a stream of images at a break of 1 second each.
3. Internet and Web Traffic –
A bobbing node in the center of the internet receives streams of IP packets from many inputs and paths them to its outputs. Websites receive streams of heterogeneous types. For example, Google receives a hundred million search queries per day.
Characteristics of Data Streams :
1. Large volumes of continuous data, possibly infinite.
2. Steady changing and requires a fast, real-time response.
3. Data stream captures nicely our data processing needs of today.
4. Random access is expensive and a single scan algorithm
5. Store only the summary of the data seen so far.
6. Maximum stream data are at a pretty low level or multidimensional in creation, needs multilevel and multidimensional treatment.
Applications of Data Streams :
1. Fraud perception
2. Real-time goods dealing
3. Consumer enterprise
4. Observing and describing on inside IT systems
Advantages of Data Streams :
· This data is helpful in upgrading sales
· Help in recognizing the fallacy
· Helps in minimizing costs
· It provides details to react swiftly to risk
Disadvantages of Data Streams :
· Lack of security of data in the cloud
· Hold cloud donor subordination
· Off-premises warehouse of details introduces the probable for disconnection
STREAM DATA MODEL AND ARCHITECTURE :
Data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. The data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.
Any number of streams can enter the system. Each stream can provide elements at its own schedule; they need not have the same data rates or data types, and the time between elements of one stream need not be uniform. Streams may be archived in a large archival store, but it is not possible to answer queries from the archival store. There is also a working store, into which summaries or parts of streams may be placed, and which can be used for answering queries. The working store might be disk, or it might be main memory, depending on how fast we need to process queries. It is of sufficiently limited capacity that it cannot store all the data from all the streams.
Examples of Stream Sources:
Ø Sensor Data: Image a temperature sensor bobbing about in the ocean, sending back to a base station a reading of the surface temperature each hour. Data produced by this sensor is a stream of real numbers. The data rate is so low.
Ø Image Data: Satellites often send down to earth streams consisting of many terabytes of images per day. Surveillance camera produces images with lower resolution than satellites.
Ø Internet and Web Traffic: Internet receives streams of IP packets from many inputs and routes them to its outputs. Google receives several hundred million search queries per day.
Stream Queries:
Standing queries - permanently executing, and produce outputs at appropriate times.
Ad-hoc queries - a question asked once about the current state of a stream or streams.
If we want the facility to ask a wide variety of ad-hoc queries, a common approach is to store a sliding window of each stream in the working store.
A sliding window can be the most recent n elements of a stream, for some n, or it can be all the elements that arrived within the last t time units, e.g., one day.
If we regard each stream element as a tuple, can treat the window as a relation and query it with any SQL query.
Issues in Stream Processing:
Process elements in real time, or we lose the opportunity to process them at all, without accessing the archival storage.
The stream-processing algorithm is executed in main memory, without access to secondary storage or with only rare accesses to secondary storage.
Many problems about streaming data would be easy to solve if had enough memory. Generalizations about stream algorithms:
More efficient to get an approximate answer to our problem than an exact solution.
Hashing - introduce useful randomness into the algorithm’s behavior, in order to produce an approximate answer that is very close to the true result.
Stream Computing:
A high performance computer system that analyzes multiple data streams from many sources. Stream computing is used to mean pulling in streams of data, processing the data and streaming it back out as a single flow.
It uses a software algorithm that analyzes the data in real time as it streams in to increase and accuracy when dealing with data handling and analysis.
Stream computing delivers real-time analytic processing on constantly changing data in motion.
It allows capturing and analyzing all data in all the time, just in time. Stream analyzes data before you store it.
Analyze data that is in motion (Velocity) Process any type of data (Variety)
Streams are designed to scale to process any size of data from Tera bytes to Zeta bytes per day.
Suppose we have a large set S of keys
● We want to filter a stream <key, data> to let pass only the elements for which key ∊ S
● Example: key is an e-mail address, we have a total of |S|=109 allowed e-mail addresses
● Naïve solution? Hash table won’t work, too big!
Bloom Filter (1-bit case):
Given a set of keys S
Create a bit array B[ ] of n bits
– Initialize to all 0s
Pick a hash function h with range [0,n)
– For each member of s ∈ S
● Hash to one of n buckets
● Set that bit to 1, i.e., B[h(s)] ← 1
For each element a of the stream
Output a if and only if B[h(a)] == 1 stream processing
Bloom Filter is an approximate filter
Can it output an element with a key not in S?
Can it not output an element with a key in S?
Bloom Filter is an approximate filter
Can it output an element with a key not in S? Yes, due to hash collisions h(x)=h(y) when x≠y
Can it not output an element with a key in S? No,because h(x) is always the same for x
Bloom filters are permissive (not strict)
A bloom filter is:
– An array of n bits, initialized as 0
– A collection of hash functions h1, h2, …, hk
– A set S of m key values
The purpose of the bloom filter is to allow all stream items whose key is in S
Bloom filter initialization
For all positions i in [0, n-1]
B[i] ← 0
For all keys K in S:
For every hash function h1, h2, …, hk B[hi(K)] ← 1
Bloom filter usage
For each input element <key, data>
allow ← TRUE
For every hash function h1,h2, …,hkallow←allow∧B[hi(K)] == 1 output element if allow == TRUE
Characteristics of Bloom Filters
Are lax (not strict) and let some items pass
– May require a second-level check to make filter strict, for instance store output on disk files and then check against hash tables (slower)
Implementations can be very fast
E.g., use hardware words for the bit table
Preliminaries for the analysis:
targets and darts
Suppose we throw y darts at x targets
– All darts will hit one of the targets
y=4 darts x=4 targets
Preliminaries for the analysis:
targets and darts (cont.)
●
How many distinct targets can we expect to hit at least once?
–
Prob. that a given dart will hit a specific target is 1/x
– Prob. that a given dart will not hit a specific target is 1-1/x
– Prob. none of the y darts will hit a specific target is (1-1/x)y = (1- 1/x)x(y/x)
– Using that (1-ε)1/ε ≃ 1/e for small ε
– If x is large, 1/x is small, and prob. that none of the y darts will
hit a specific target is (1/e)y/x
y=4 darts x=4 targets
Analysis of the 1-bit Bloom Filter
●
Each element of the signature S is a dart |S|=y
● Each bit in the array is a target n=x
● Suppose y=|S|=109 (1 G) and x=n=8 x 109 (8 G)
● Prob. that a given bit is not set to 1 (dart does not hit the target) is (1/e)y/x = (1/e)1/8
● Prob. that a given bit is set to 1 is 1 – (1/e)1/8 = 0.1175
● Expected number of bits that is set to 1 = 11.75% x 8GB
● About 12% of bits are set to one in this Bloom Filter this is also the false-hit probability in this case
General case
● |S|=m keys, array has n bits
● k hash functions
● Targets x=n, darts y=km
● Probability that a bit remains 0 is (1/e)km/n = e-km/n
● False positive rate with k bits: (1 - e-km/n)k
– This is the probability that all of the k bits are set to 1
● Example: we can pick k=n/m to obtain collision probability 1/e = 37%
● Analysis of a 2-bit Bloom Filter
●
Suppose |S|=10 (1 G) and n=8 x 109 (8 GB)
● Suppose we use two hash functions
● Prob. that a given bit is NOT set to 1 (dart does not hit the target) is (1/e) (1/e)1/4
● y/x = (1/e)1/4
●
Prob. a bit is set to 1 is 1 – (1/e)
●
Prob. two bits are set to 1 is (1 – (1/e)1/4)2= 0.0493
● We have a false hit probability of about 5% with two hash functions, while the probability was about 12% with only one
How many hash functions to use?
Too few: test is too unspecific. Too many: table becomes too crowded.
m = 1 billion, n = 8 billion
False positive rate with k bits: (1 – e km/n)k
k = 1: (1 – e -1/8)1 = (1 – e -1/8) = 0.1175
Number of hash functions, k
k = 2: (1 – e -2/8)2 = (1 – e -1/4)2 = 0.0493
● What happens as we keep increasing k?
– “Optimal” value of k: n/m ln(2)
– In our case: Optimal k = 8 ln(2) = 5.54 ≈ 6
– Error at k = 6: (1 – e -6/8)6 = 0.0216
Counting Distinct Elements in a stream:
Counting distinct elements is a problem that frequently arises in distributed systems. In general, the size of the set under consideration (which we will henceforth call the universe) is enormous. For example, if we build a system to identify denial of service attacks, the set could consist of all IP V4 and V6 addresses. Another common use case is to count the number of unique visitors on popular websites like Twitter or Facebook.
An obvious approach if the number of elements is not very large would be to maintain a Set. We can check if the set contains the element when a new element arrives. If not, we add the element to the set. The size of the set would give the number of distinct elements. However, if the number of elements is vast or we are maintaining counts for multiple streams, it would be infeasible to maintain the set in memory. Storing the data on disk would be an option if we are only interested in offline computation using batch processing frameworks like Map Reduce.
Like the previous algorithms we looked at, the algorithms for counting distinct elements are also approximate, with an error threshold that can be tweaked by changing the algorithm's parameters.
Flajolet — Martin Algorithm
The first algorithm for counting distinct elements is the Flajolet-Martin algorithm, named after the algorithm's creators. The Flajolet-Martin algorithm is a single pass algorithm. If there are m distinct elements in a universe comprising of n elements, the algorithm runs in O(n) time and O(log(m)) space complexity. The following steps define the algorithm.
· First, we pick a hash function h that takes stream elements as input and outputs a bit string. The length of the bit strings is large enough such that the result of the hash function is much larger than the size of the universe. We require at least log nbits if there are n elements in the universe.
· r(a) is used to denote the number of trailing zeros in the binary representation of h(a) for an element a in the stream.
· R denotes the maximum value of r seen in the stream so far.
· The estimate of the number of distinct elements in the stream is (2^R).
To intuitively understand why the algorithm works consider the following.
The probability that h(a) ends in at least i zeros is exactly (2^-i). For example, for i=0, there is a probability 1that the tail has at least 0zeros. For i=1, there is a probability of 1/2 that the last bit is zero; for i=2, the probability is 1/4 that the last two bits are zero's, and so on. The probability of the rightmost set bit drops by a factor of 1/2 with every position from the Least Significant Bit to the Most Significant Bit.
This probability should become 0 when bit position R >> log m while it should be non-zero when R <= log m.Hence, if we find the rightmost unset bit position R such that the probability is 0, we can say that the number of unique elements will approximately be 2 ^ R.
The Flajolet-Martin uses a multiplicative hash function to transform the non-uniform set space to a uniform distribution. The general form of the hash function is
(ax + b) mod c where a and b are odd numbers and c is the length of the hash range.
The Flajolet-Martin algorithm is sensitive to the hash function used, and results vary widely based on the data set and the hash function. Hence there are better algorithms that utilize more than one hash function. These algorithms use the average and median values to reduce skew and increase the predictability of the result.
Flajolet-Martin Psuedocode and Explanation
1. L = 64 (size of the bitset), B= bitset of size L
2. hash_func = (ax + b) mod 2^L
3. for each item x in stream
· y = hash(x)
· r = get_righmost_set_bit(y)
· set_bit(B, r)
4. R = get_righmost_unset_bit(B)
5. return 2 ^ R
We define a hash range, big enough to hold the maximum number of possible unique values, something as big as 2 ^ 64. Every stream element is passed through a hash function that transforms the elements into a uniform distribution.
For each hash value, we find the position of the rightmost set bit and mark the corresponding position in the bit set to 1. Once all elements are processed, the bit vector will have 1s at all the positions corresponding to the position of every rightmost set bit for all elements in the stream.Now we find R , the rightmost 0 in this bit vector. This position R corresponds to the rightmost set bit that we have not seen while processing the elements. This corresponds to the probability 0 and will help in approximating the cardinality of unique elements as
2 ^ R.

0 Comments