Abluva | Where Trust and Data Security Unite

Blog▪Intrusion DetectionAISecurity

Synthetic Datasets in ML-Based Intrusion Detection

In the realm of cybersecurity, the training of Intrusion Detection Systems (IDS) with Machine Learning (ML) capabilities necessitates meticulous attention to data quality and diversity. Traditional datasets, though widely utilized, exhibit limitations that synthetic datasets adeptly address.

12 Jan 2024

This blog aims to delve into the understated yet crucial role of synthetic datasets in ML-based Intrusion Detection, emphasizing key considerations in their implementation. Additionally, we explore the limitations of prevalent IDS datasets and introduce Abluva's Enhanced IDS synthetic datasets as a measured solution.

Why Synthetic Data Matters for ML-based IDS

Synthetic data generation offers several advantages for IDS training

Cost-effectiveness
Creating real-world attack data for training can be resource-intensive and time-consuming. Synthetic data provides a cost-effective alternative, enabling the generation of large datasets with specific attack types and variations.

Data enrichment
Synthetic data can augment existing datasets, filling in gaps and adding diversity to address specific needs and enhance model generalizability.

Emerging threats
Staying ahead of evolving cyber threats is crucial. Synthetic data allows for the creation of simulated attacks based on emerging and hypothetical scenarios, preparing models for future threats.

Important Considerations for Working with Synthetic Data

While synthetic data offers immense potential, certain critical aspects need to be focused upon creating the dataset.

Data Quality
The realism and accuracy of the generated data are paramount. Ensure your generation model uses high-quality real-world data as a reference to capture realistic patterns and distributions.

Data Balance
Avoid imbalanced datasets where certain attack types are overrepresented while others are rare. This can lead to models biased towards detecting prevalent attacks and neglecting less frequent but equally dangerous ones.

Relevance/Adaptability to emerging threats
Synthetic data needs to evolve alongside real-world threats. Continuously update your generation process and data sources to reflect the latest cyber threat landscape.

Overcoming Overfitting
Synthetic data generated from limited training data can lead to overfitting, where the model performs well on the specific data it was trained on but fails to generalize to new scenarios. Use diverse training data and validation techniques to mitigate this risk.

Limitations of Popular IDS Datasets

Several factors can limit the effectiveness of IDS datasets for training and testing intrusion detection systems. Here's a breakdown of limitations for the datasets you mentioned:

CSE-CIC-IDS2018

Issue: Label inconsistencies and inaccuracies, impacting model training.
Challenge: Imbalanced classes, favoring certain attack types.
Limitation: Limited attack diversity, affecting model generalizability.

CIC-IDS-2017

Issue: Attack mimicry with emulated attacks.
Challenge: Focus on static packet-level features.
Limitation: Lack of temporal event relationships.

UNSW-NB15

Issue: Domain-dependent features.
Challenge: Artificially generated attack scenarios.
Limitation: Static labels for evolving attacks.

NSL KDD

Issue: Outdated dataset not reflecting current attack techniques.
Challenge: Redundant and irrelevant features.
Limitation: Class imbalance with normal traffic.

General Concerns across Datasets

Static Network Environment
Most datasets capture traffic from a single network setup, limiting generalizability to diverse network configurations.

Privacy Concerns
Sharing network traffic data can raise privacy concerns, requiring careful anonymization strategies.

Lack of Ground Truth
Verifying the accuracy and completeness of attack labels can be challenging, impacting model training effectiveness.

Abluva's Enhanced IDS Synthetic Data Sets

Abluva is committed to providing high-quality synthetic data for IDS training. Our team has created A new model using GAN called BLENDER-GAN. This model generates attack classes and benign classes that reflect better real world scenarios. Following Datasets have been enhanced using this model. (BLENDER-GAN will be published shortly.)

CSE-CIC-IDS 2018 V3
1,00,000 additional data points for the "Comb" class enhance this attack-focused dataset, improving model training efficacy.

NSL KDD V2
41,200 synthetic data points for the "Comb" class are added that were generated using Abluva’s Blender GAN, significantly increasing attack diversity and model generalizability.

UNSW NB 15 V3
A new "Comb" class with 15,000 data points mimicking real-world attack characteristics, enriching the dataset for more robust model training has been added.

CIC-IDS 2017 V2
We amplified the dataset with additional "Comb" class with 1,72,800 synthetic data points, making it a more comprehensive and versatile dataset for attack detection research.

You can read more about these datasets on our Synthetic Datasets page.

By leveraging the power of synthetic data and addressing its key considerations, we can build robust and resilient intrusion detection systems, securing our digital future in the face of ever-evolving cyber threats.