Logo
Logo

Synthetic Datasets in ML-Based Intrusion Detection

This blog aims to delve into the understated yet crucial role of synthetic datasets in ML-based Intrusion Detection, emphasizing key considerations in their implementation. Additionally, we explore the limitations of prevalent IDS datasets and introduce Abluva's Enhanced IDS synthetic datasets as a measured solution.


Why Synthetic Data Matters for ML-based IDS


Synthetic data generation offers several advantages for IDS training


Cost-effectiveness
Creating real-world attack data for training can be resource-intensive and time-consuming. Synthetic data provides a cost-effective alternative, enabling the generation of large datasets with specific attack types and variations.


Data enrichment
Synthetic data can augment existing datasets, filling in gaps and adding diversity to address specific needs and enhance model generalizability.


Emerging threats
Staying ahead of evolving cyber threats is crucial. Synthetic data allows for the creation of simulated attacks based on emerging and hypothetical scenarios, preparing models for future threats.


Important Considerations for Working with Synthetic Data


While synthetic data offers immense potential, certain critical aspects need to be focused upon creating the dataset.


Data Quality
The realism and accuracy of the generated data are paramount. Ensure your generation model uses high-quality real-world data as a reference to capture realistic patterns and distributions.


Data Balance
Avoid imbalanced datasets where certain attack types are overrepresented while others are rare. This can lead to models biased towards detecting prevalent attacks and neglecting less frequent but equally dangerous ones.


Relevance/Adaptability to emerging threats
Synthetic data needs to evolve alongside real-world threats. Continuously update your generation process and data sources to reflect the latest cyber threat landscape.


Overcoming Overfitting
Synthetic data generated from limited training data can lead to overfitting, where the model performs well on the specific data it was trained on but fails to generalize to new scenarios. Use diverse training data and validation techniques to mitigate this risk.


Limitations of Popular IDS Datasets


Several factors can limit the effectiveness of IDS datasets for training and testing intrusion detection systems. Here's a breakdown of limitations for the datasets you mentioned:


CSE-CIC-IDS2018

  • Issue: Label inconsistencies and inaccuracies, impacting model training.
  • Challenge: Imbalanced classes, favoring certain attack types.
  • Limitation: Limited attack diversity, affecting model generalizability.

CIC-IDS-2017

  • Issue: Attack mimicry with emulated attacks.
  • Challenge: Focus on static packet-level features.
  • Limitation: Lack of temporal event relationships.

UNSW-NB15

  • Issue: Domain-dependent features.
  • Challenge: Artificially generated attack scenarios.
  • Limitation: Static labels for evolving attacks.

NSL KDD

  • Issue: Outdated dataset not reflecting current attack techniques.
  • Challenge: Redundant and irrelevant features.
  • Limitation: Class imbalance with normal traffic.

General Concerns across Datasets


Static Network Environment
Most datasets capture traffic from a single network setup, limiting generalizability to diverse network configurations.


Privacy Concerns
Sharing network traffic data can raise privacy concerns, requiring careful anonymization strategies.


Lack of Ground Truth
Verifying the accuracy and completeness of attack labels can be challenging, impacting model training effectiveness.


Abluva's Enhanced IDS Synthetic Data Sets


Abluva is committed to providing high-quality synthetic data for IDS training. Our team has created A new model using GAN called BLENDER-GAN. This model generates attack classes and benign classes that reflect better real world scenarios. Following Datasets have been enhanced using this model. (BLENDER-GAN will be published shortly.)


  • CSE-CIC-IDS 2018 V3
  • 1,00,000 additional data points for the "Comb" class enhance this attack-focused dataset, improving model training efficacy.
  • Link Icon
  • NSL KDD V2
  • 41,200 synthetic data points for the "Comb" class are added that were generated using Abluva’s Blender GAN, significantly increasing attack diversity and model generalizability.
  • Link Icon
  • UNSW NB 15 V3
  • A new "Comb" class with 15,000 data points mimicking real-world attack characteristics, enriching the dataset for more robust model training has been added.
  • Link Icon
  • CIC-IDS 2017 V2
  • We amplified the dataset with additional "Comb" class with 1,72,800 synthetic data points, making it a more comprehensive and versatile dataset for attack detection research.
  • Link Icon

You can read more about these datasets on our Synthetic Datasets page.


By leveraging the power of synthetic data and addressing its key considerations, we can build robust and resilient intrusion detection systems, securing our digital future in the face of ever-evolving cyber threats.