Real-Time Monitoring with Spark and Kafka, Data Processing & SageMaker IntegrationDraft: Reatimemonitor (!11) · Merge requests · Qian Wang / talktiveproject

Qian Wang requested to merge reatimemonitor into main Feb 09, 2025

Real-Time Data Streaming:

Use Kafka to stream data to Spark for real-time monitoring. The data will be incoming flow data (e.g., network traffic, sensor data). Data Preprocessing:

The Spark stream should handle necessary data preprocessing operations: Centering the data (e.g., standardization or normalization). Handling missing values using fillna or similar techniques. SageMaker Integration:

Once the data is processed, send it to AWS SageMaker endpoint for prediction. This model should make real-time predictions based on the incoming data. After predictions are made, the results should be pushed back into Hive for historical data storage. Data Storage:

Hive will act as the data warehouse for storing the historical prediction results. HBase will be used to store the real-time predictions and serve as a fast, low-latency storage layer to provide real-time data for front-end dashboards. Model Training:

Every time the Hive data warehouse reaches a certain threshold, trigger the model retraining process using SageMaker pipelines. The XXL-Jobs framework will periodically check the size of the data warehouse and trigger the SageMaker pipeline to retrain the model based on new data. Dashboard Integration:

The real-time predictions stored in HBase will be used to provide immediate updates to the front-end dashboard. This enables live data visualization for the end users

Real-Time Monitoring with Spark and Kafka, Data Processing & SageMaker IntegrationDraft: Reatimemonitor

Merge request reports