Förderjahr 2018 / Stipendien Call #13 / ProjektID: 3793 / Projekt: Data Management Strategies for Near Real-Time Edge Analytics
This blog post introduces further methodology and environment for implementing and realizing some of the proposed edge data management strategies, aiming to provide their practical applicability besides theoretical contribution.
Blog posts about approximate data analytics, edge data recovery, and elastic storage services are designed to work on distributed and decentralized systems such as edge nodes. Running data processing on distributed and decentralized nodes can involve multiple applications in which structure and logic are separated from data utilized.
Separation of data and applications
Nowadays, many applications such as speech recognition and object classification, separate core application logic from data and their location. Thus, it is also important and challenging to consider placement of application instances regarding data location in near real-time.
If we consider data locations and satisfaction of required resources, such as latency, there are four possible cases from the perspective of one individual edge/cloud node:
- lack of resources and necessary data are not present;
- lack of resources, while necessary data are present;
- enough resources, but necessary files are not present;
- enough resources, while necessary files are present.
Furthermore, data-intensive applications require customized data collection and dynamic data processing pipelines across multiple nodes and there is often a need for aggregated analytics. This results in a challenge to orchestrate and manage application services and data locations on demand. Especially in scenarios in which applications rely completely on edge computing capabilities rather than consulting cloud services, e.g., in remote edge sites that do not have connection to the Internet or due to the intermittent connection to the cloud and network congestions.
What are the advantages of edge data processing?
Data protection. Many IoT applications deal with sensitive data information. Therefore, avoiding the transfer of data across the network and using multiple hops, sensitive data can be kept at the source of data.
Bandwidth usage and cost of data transfer. It can be too expensive for all data to be transferred to a centralized data center for analytics. However, in some cases, data have to be transferred to more capable (cloud) nodes to satisfy certain conditions (resources, accuracy).
- Avoidance of network bottlenecks. Increasing latency due to network bottlenecks in centralized cloud computing can affect (near) real-time decisions for edge systems.
Application containers and the role of orchestrator
These IoT applications can be packaged up into containers that can run across different hosts, i.e., edge/cloud nodes. The main role in managing these containers has an orchestrator. Currently, widely used and an open-source orchestrator that automates application deployment is Kubernetes. I plan to use Kubernetes as a platform for container orchestration, aiming to realize and support some important data management strategies in the proposed EDMFrame.
Consequently, future contributions should enable data management strategies deployed among multiple sites, including edge nodes (Raspberry PIs, edge gateways), and cloud nodes (geographically distributed and distant servers). The next challenge is how application placement can meet different locality of necessary data for edge analytics?