Cybersecurity · Machine Learning

ML-Driven Intrusion Detection System

A machine learning IDS trained on real-world network traffic datasets with SHAP-based explainability for security analyst interpretation.

Status Completed · 2025

Type Personal Project

Platform Python / CLI

Team Solo

← Back to Projects

Python XGBoost SHAP Scikit-learn CIC-IDS2017 UNSW-NB15 Pandas

Overview

A machine learning–based Intrusion Detection System trained to identify malicious network traffic across multiple attack categories. Built on two benchmark datasets and designed with explainability in mind so security analysts can understand why a detection was made.

How It Works

Datasets

Trained and evaluated on two industry-standard IDS datasets: CIC-IDS2017 (Canadian Institute for Cybersecurity) and UNSW-NB15 (University of New South Wales). These datasets cover a wide range of attack types including DoS, DDoS, brute force, web attacks, infiltration, and botnets.

Feature Engineering Pipeline

Built a full preprocessing pipeline: feature extraction from raw network flow data, handling of severe class imbalance using SMOTE and undersampling techniques, normalization, and feature selection based on importance scores.

Detection Models

Compared multiple approaches: XGBoost (supervised, best overall performance), Random Forest (ensemble baseline), and anomaly-based detection using Isolation Forest. Evaluated tradeoffs in precision, recall, and false positive rate relevant to real-world IDS deployment.

SHAP Explainability

Applied SHAP (SHapley Additive exPlanations) to identify which network features drive attack classifications. This makes the model interpretable — a security analyst can see exactly why a specific flow was flagged, not just that it was flagged.

Technical Challenges

Class imbalance was the biggest challenge — attack traffic is a tiny fraction of normal traffic in both datasets. Naive models achieve 99% accuracy by predicting everything as benign. Getting meaningful precision and recall on rare attack classes required careful resampling strategy and threshold tuning.

Results / Status

A functional IDS pipeline with explainable detections, demonstrating both supervised and anomaly-based detection approaches and their real-world tradeoffs. The SHAP analysis revealed the top network features most predictive of each attack category.

Key takeaway: What's the most important thing to know about this project?