berttraing (!7) · Merge requests · Qian Wang / talktiveproject

Qian Wang requested to merge berttraing into main Feb 09, 2025

Goal of the Changes: HTTP Dataset 2010: The HTTP Dataset 2010 will be used for training. The data consists of labeled web traffic (requests, logs) for classification tasks such as identifying potential anomalies or malicious activities.

BERT for Feature Extraction: BERT or other transformer-based models will be used as the upstream model for feature extraction. This allows the model to capture contextual relationships in the text data and generate feature representations that are rich and accurate.

Downstream Classifiers: After feature extraction, the following models will be used for classification:

CNN: Convolutional Neural Networks will be used to capture local patterns in the text. Softmax: Softmax regression will be employed as a simple classifier for multi-class classification. RNN/LSTM/GRU: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU) will be explored for sequence-based tasks, especially useful for handling long dependencies in the text data. Hyperparameter Tuning: Using SageMaker Model Tuning (specifically Hyperparameter Optimization Jobs), the best combination of models (BERT + CNN/Softmax/RNN/LSTM/GRU) and hyperparameters will be found. This step is crucial for improving the performance of the models.

SageMaker Pipeline: The entire pipeline, from data preprocessing to model training and tuning, will be deployed on Amazon SageMaker. SageMaker will also handle model deployment, ensuring that the best-tuned model can be easily integrated into production environments.

berttraing

Merge request reports