ctgantraining (!8) · Merge requests · Qian Wang / talktiveproject

Qian Wang requested to merge ctgantraining into main Feb 09, 2025

IDSC 2017 Dataset:

The IDSC 2017 dataset will be used as the base for the model. This dataset will contain a variety of features and labels that can be used to train the CTGAN model. CTGAN for Data Augmentation:

CTGAN will be used as the main generative model to create synthetic data based on the available data in the IDSC 2017 dataset. This will help in augmenting the dataset, especially when the data is imbalanced or lacks sufficient coverage in certain classes or features. Data Preprocessing (Centering):

Data Preprocessing will include centering of the data. This process involves shifting the mean of the data to zero, which can help with the convergence of many machine learning models. Model Training and Hyperparameter Optimization:

Once the synthetic data is generated, we will train the CTGAN model to ensure that it can capture the distribution of the original dataset. SageMaker Model Tuning will be used to fine-tune the CTGAN model's hyperparameters to find the best possible configuration for optimal performance. This step is critical for improving the quality of the synthetic data generated and its impact on the downstream tasks. SageMaker Pipeline:

The entire workflow, including data preprocessing, model training, and hyperparameter tuning, will be automated and integrated into a SageMaker pipeline for continuous training and retraining.

ctgantraining

Merge request reports