Intrusion Detection Experiment (DDoS Detection)¶
This notebook explores machine learning-based intrusion detection using the CICIDS2017 dataset.
Workflow¶
- Introduction
- Data loading
- Data preprocessing
- Model training
- Evaluation
- Feature importance analysis
- Visualization
- Error analysis
Dataset¶
CICIDS2017 (Canadian Institute for Cybersecurity)
A flow-based network traffic dataset containing benign and attack traffic.
Each record represents a network flow with 79 features.
Dataset size: 225,745 flows
Features: 79
Classes: BENIGN, DDoS
Result¶
Accuracy: 1.00 (FP=0, FN=4)
The model achieved near-perfect accuracy, suggesting that DDoS traffic in this dataset has highly distinctive characteristics compared to benign traffic.
# ==============================
# IMPORTS
# ==============================
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import seaborn as sns
# ==============================
# CONFIG
# ==============================
SEED = 42
import os
DATA_PATH = "Friday-WorkingHours-Afternoon-DDos.pcap_ISCX.csv"
1. Data loading¶
We load the CICIDS2017 dataset and erase spaces in columns.
# ==============================
# LOAD DATA
# ==============================
df = pd.read_csv(DATA_PATH)
df.columns = df.columns.str.strip()
df.head()
| Destination Port | Flow Duration | Total Fwd Packets | Total Backward Packets | Total Length of Fwd Packets | Total Length of Bwd Packets | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | ... | min_seg_size_forward | Active Mean | Active Std | Active Max | Active Min | Idle Mean | Idle Std | Idle Max | Idle Min | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 54865 | 3 | 2 | 0 | 12 | 0 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
| 1 | 55054 | 109 | 1 | 1 | 6 | 6 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
| 2 | 55055 | 52 | 1 | 1 | 6 | 6 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
| 3 | 46236 | 34 | 1 | 1 | 6 | 6 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
| 4 | 54863 | 3 | 2 | 0 | 12 | 0 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | BENIGN |
5 rows × 79 columns
df["Label"].value_counts()
Label DDoS 128027 BENIGN 97718 Name: count, dtype: int64
DDoS = Distributed Denial of Service
BENIGN = benign traffic
2. Data preprocessing¶
Change DDoS and BENIGN into numeric value so the machine can understand:
BENIGN → 0
DDoS → 1
# ==============================
# PREPROCESSING
# ==============================
df["Label"] = df["Label"].str.strip()
df["Label"] = df["Label"].map({
"BENIGN":0,
"DDoS":1
})
Infinity cleaning
import numpy as np
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)
Separate Labels and features
X = df.drop("Label", axis=1)
y = df["Label"]
3. Random Forest model training¶
# ==============================
# TRAIN MODEL
# ==============================
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=SEED
)
model = RandomForestClassifier(
n_estimators=200,
random_state=SEED,
n_jobs=-1
)
model.fit(X_train, y_train)
RandomForestClassifier(n_estimators=200, n_jobs=-1, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
4. Evaluation of detection performance¶
# ==============================
# EVALUATION
# ==============================
pred = model.predict(X_test)
print(classification_report(y_test, pred))
print(confusion_matrix(y_test, pred))
from sklearn.metrics import roc_auc_score
prob = model.predict_proba(X_test)[:,1]
auc = roc_auc_score(y_test, prob)
print("ROC-AUC:", auc)
precision recall f1-score support
0 1.00 1.00 1.00 19419
1 1.00 1.00 1.00 25724
accuracy 1.00 45143
macro avg 1.00 1.00 1.00 45143
weighted avg 1.00 1.00 1.00 45143
[[19419 0]
[ 4 25720]]
ROC-AUC: 0.9999999619645781
Note:
Near-perfect accuracy can occur on CICIDS2017 DDoS because attack traffic has highly distinctive flow-level patterns;
results may not generalize to stealthier/low-rate attacks.
The ROC-AUC score provides an additional evaluation metric for classification performance. It measures how well the model distinguishes between benign and attack traffic.
5. Feature importance analysis¶
Feature importance indicates which network flow features contribute most to the model's decision.
# ==============================
# FEATURE IMPORTANCE
# ==============================
importances = pd.Series(
model.feature_importances_,
index=X.columns
).sort_values(ascending=False)
importances.head(15)
Avg Fwd Segment Size 0.079347 Fwd Packet Length Mean 0.075830 Fwd Packet Length Max 0.075561 Init_Win_bytes_forward 0.061795 act_data_pkt_fwd 0.051683 Subflow Fwd Bytes 0.049415 Bwd Packet Length Min 0.044783 Total Length of Fwd Packets 0.040985 Subflow Fwd Packets 0.039623 Fwd Header Length.1 0.035786 Fwd IAT Std 0.035205 Fwd IAT Total 0.035072 Fwd Packet Length Std 0.033320 Destination Port 0.031995 Fwd Header Length 0.029577 dtype: float64
importances.head(15).plot(kind="barh")
plt.gca().invert_yaxis()
plt.title("Top Features for DDoS Detection")
plt.show()
6. PCA visualization of traffic patterns¶
To better understand the structure of the traffic data, I project the high-dimensional features into two dimensions using PCA.
# ==============================
# PCA VISUALIZATION
# ==============================
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
2D compression
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
Dataframing
pca_df = pd.DataFrame(X_pca, columns=["PC1", "PC2"])
pca_df["Label"] = y.values
Output Graph
plt.figure(figsize=(8,6))
sns.scatterplot(
data=pca_df,
x="PC1",
y="PC2",
hue="Label",
alpha=0.5
)
plt.title("PCA Visualization of Network Traffic")
plt.show()
Sampling
benign = pca_df[pca_df["Label"]==0].sample(2000)
ddos = pca_df[pca_df["Label"]==1].sample(2000)
plot_df = pd.concat([benign, ddos])
sns.scatterplot(
data=plot_df,
x="PC1",
y="PC2",
hue="Label"
)
<Axes: xlabel='PC1', ylabel='PC2'>
pca.explained_variance_ratio_, pca.explained_variance_ratio_.sum()
(array([0.21412279, 0.1471762 ]), np.float64(0.3612989973920781))
PC1 = 0.214
PC2 = 0.147
SUM = 0.361
Although this does not represent the full structure of the data, it is sufficient to visualize general traffic separation; enabling a 2D visualization of network traffic patterns.
7. Analysis of misclassified attacks¶
To understand the limitations of the model, we inspect the misclassified attack samples (false negatives).
# ==============================
# ERROR ANALYSIS
# ==============================
pred_series = pd.Series(pred, index=X_test.index)
fn_index = X_test[(y_test == 1) & (pred_series == 0)].index
fn_cases = df.loc[fn_index]
fn_cases
| Destination Port | Flow Duration | Total Fwd Packets | Total Backward Packets | Total Length of Fwd Packets | Total Length of Bwd Packets | Fwd Packet Length Max | Fwd Packet Length Min | Fwd Packet Length Mean | Fwd Packet Length Std | ... | min_seg_size_forward | Active Mean | Active Std | Active Max | Active Min | Idle Mean | Idle Std | Idle Max | Idle Min | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 101887 | 80 | 134 | 1 | 1 | 6 | 6 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | 1 |
| 82257 | 80 | 6 | 1 | 1 | 6 | 6 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | 1 |
| 45376 | 80 | 1663726 | 2 | 0 | 12 | 0 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | 1 |
| 196996 | 80 | 1974 | 3 | 0 | 18 | 0 | 6 | 6 | 6.0 | 0.0 | ... | 20 | 0.0 | 0.0 | 0 | 0 | 0.0 | 0.0 | 0 | 0 | 1 |
4 rows × 79 columns
fn_cases.mean()
df[df["Label"]==1].mean()
Destination Port 8.122740e+01
Flow Duration 1.695586e+07
Total Fwd Packets 4.472494e+00
Total Backward Packets 3.255856e+00
Total Length of Fwd Packets 3.190900e+01
...
Idle Mean 1.198550e+07
Idle Std 4.481584e+06
Idle Max 1.515447e+07
Idle Min 8.816552e+06
Label 1.000000e+00
Length: 79, dtype: float64
attack_mean = df[df["Label"]==1].mean()
comparison = pd.DataFrame({
"missed_attack": fn_cases.mean(),
"normal_attack": attack_mean
})
comparison
| missed_attack | normal_attack | |
|---|---|---|
| Destination Port | 80.00 | 8.122740e+01 |
| Flow Duration | 416460.00 | 1.695586e+07 |
| Total Fwd Packets | 1.75 | 4.472494e+00 |
| Total Backward Packets | 0.50 | 3.255856e+00 |
| Total Length of Fwd Packets | 10.50 | 3.190900e+01 |
| ... | ... | ... |
| Idle Mean | 0.00 | 1.198550e+07 |
| Idle Std | 0.00 | 4.481584e+06 |
| Idle Max | 0.00 | 1.515447e+07 |
| Idle Min | 0.00 | 8.816552e+06 |
| Label | 1.00 | 1.000000e+00 |
79 rows × 2 columns
Result: Accuracy 1.00, FN=4, FP=0 (Random Forest, CICIDS2017 Friday DDoS).
The misclassified attacks show significantly lower packet counts and shorter flow duration compared to typical DDoS traffic. This suggests that low-intensity attacks may appear similar to benign traffic, making them harder to detect.