Real-time filtering of data streams
Within this research area, centralized systems for filtering data streams will be investigated, which find and forward to consumers in real-time only interesting data objects from incoming data streams coming from several different sources. Moreover, new distributed algorithms will be developed that will be able to filter objects from different data streams, taking into account their volume, speed, variety and veracity. The basic idea of distributed filtering of data flows is to divide the processing logic into several independent components that are distributed in a computer cluster. For distributed filtering, open source platforms that support distributed processing of data streams (such as Apache Spark and Apache Flink) will be explored. The ultimate goal of this research area is the development of a distributed data stream filtering system that will be able to filter different data streams simultaneously and in real-time.
Detection of critical events in data streams
Detecting anomalies and outliers in data streams is a very important research area for many industrial applications because by monitoring sensor data streams in real-time, malfunctions can be instantly recognized and major damages can be prevented by reacting to them in a timely manner. Within this research area, methods for detecting anomalies and outliers in data streams will be investigated with special emphasis on the scalability of their application in a distributed computing cluster environment.
Knowledge discovery in data streams
An important element of processing large amounts of data is the need to include exploratory and predictive analytics of data streams using statistical and visualization methods and machine learning techniques. Forecasting using real-time data streams is a major scientific challenge due to the complexity of the algorithms that need to be efficiently executed on limited computing resources. The phenomena of processing large data streams and machine learning have been further fostered in recent years by the development of a large number of open source solutions, among which the most famous are the projects of the Apache organization (Hadoop, Spark, Kafka, Flink, Cassandra, etc.) and an increased number of fee and open-source libraries for statistical data processing and machine learning (Python, R). Within this research area, solutions for exploratory and predictive analysis of data streams will be developed and predictive models will be implemented for forecasting time series within data streams. In doing so, special attention will be given to the development of scalable distributed solutions for analysis and forecasting in a computer cluster.
Updating machine learning models by analyzing data streams in real-time
In practice, model learning is done periodically based on historical data. An example is a recommender system that uses collaborative filtering methods to discover interesting objects for a user. These models are updated once or at most several times a day, but could be potentially updated in real-time if data streams with user actions were processed for this purpose. Achieving real-time model updates is a scientific challenge due to the time and spatial complexity of collaborative filtering methods. At the same time, satisfactory performance of such methods can only be achieved using distributed processing of incoming data streams. This research area will investigate appropriate methods for updating real-time machine learning models based on distributed processing of input data streams. When developing these methods, special attention will be given to the scalability of the solutions in the computer cluster.
Processing of semantic data streams
Many data stream sources produce data objects in the form of subject-predicate-object triples in accordance with the RDF (Resource Description Framework) specification. An example is sensor data sources that often publish sensor readings in the form of RDF triples that refer to the corresponding ontology. The processing of such semantic data streams, which are also called RDF data streams, is very specific and significantly different from ordinary data streams due to the use of special semantic reasoners in order to discover knowledge in semantic data streams. As the existing semantic reasoners are primarily intended for processing static data, they need to be adapted to work with dynamic data, i.e. semantic data streams. This is a very complex problem for current reasoners that have high time complexity of reasoning even for static data. Because of this, a whole new research area related to reasoning over semantic data streams appeared recently. Within this research area, methods for processing semantic data streams will be investigated, with a special emphasis on the possibility of distributed processing in a computer cluster.