Dataset Preprocessing and Imbalance
Tạo vào: 24 tháng 1, 2025
Tạo vào: 24 tháng 1, 2025
Examining dataset characteristics
I noticed an "Unnamed: 0" column and missing values in some columns. Progress is steady as I begin to address data preprocessing and visualization needs.
Identifying gaps
I found multiple columns with single missing values, including "Flow ID" and packet-related columns, as well as multiple rows in "CWE Flag Count" due to dtype object issues.
Investigation and adjustments
I'm piecing together insights on missing values in various columns and examining anomalies like negative values in "Flow Duration" and unusually high values in "Protocol." Adjustments will follow.
Refining the dataset
I’m refining the dataset by removing empty or duplicate columns, converting object types to numeric, and addressing missing or negative values. Finally, I’ll normalize numeric columns and visualize 'Label' distribution.
Refining and visualizing
I’m refining the dataset by removing artifacts like "Unnamed: 0," renaming columns, and managing missing values. Finally, I’ll visualize 'Label' distribution with a bar plot using Seaborn or Matplotlib.
Thought about dataset analysis and preprocessing for 10 seconds
Below is an example of how you can perform a deep preprocessing of your dataset and visualize class imbalance. The code is commented step-by-step, explaining what is being done and why. Feel free to adjust it to the particularities of your data and project requirements.
From your output, here are some key observations:
Dimensions:
Unnamed: 0
column that looks like an extra index (often created when saving or merging datasets). This can typically be dropped if it has no analytical value.Missing Values:
Flow ID
has 1 missing value,Bwd Packets/s
has 1 missing value,CWE Flag Count
or Down/Up Ratio
may have a couple of missing or invalid values and are stored as object
instead of numeric.Data Types:
float64
or int64
.object
but actually contain numeric data (e.g., CWE Flag Count
, Down/Up Ratio
, Fwd Avg Bytes/Bulk
). We should convert them to numeric types.Flow ID
is an ID-like column. Often, flow IDs, IP addresses, or timestamps are not directly used as numeric features for machine learning. They can be dropped unless you need them for a specific purpose.Possible Anomalies:
Flow Duration
has min = -1). You may need to remove or investigate negative or otherwise impossible values.Flow Bytes/s
or similarly derived rates that can happen if the duration is near zero.Label Distribution:
Label
column has 4 unique values. From the descriptive statistics:
Android_Adware
, Benign
, etc.Android_Adware
has the highest frequency (147443).import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import MinMaxScaler, StandardScaler # ---------------------------------------------------------------------------------- # 1. LOAD DATA # Replace 'your_dataset.csv' with your actual file path if needed. # ---------------------------------------------------------------------------------- df = pd.read_csv('your_dataset.csv', low_memory=False) # ---------------------------------------------------------------------------------- # 2. BASIC CLEANING # ---------------------------------------------------------------------------------- # (A) Drop or rename the 'Unnamed: 0' column if it is just an extra index. if 'Unnamed: 0' in df.columns: df.drop(columns=['Unnamed: 0'], inplace=True) # (B) Strip any leading/trailing spaces in column names (some columns have spaces). df.columns = [col.strip() for col in df.columns] # (C) Check for duplicates and remove them, if it makes sense for your domain. df.drop_duplicates(inplace=True) # (D) Convert columns that should be numeric but are read as object. # Identify which columns are supposed to be numeric. Below is an example list # based on your DataFrame info. Adjust as needed. columns_to_convert = [ 'CWE Flag Count', 'Down/Up Ratio', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk', 'Fwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate' ] for col in columns_to_convert: if col in df.columns: df[col] = pd.to_numeric(df[col], errors='coerce') # errors='coerce' turns invalid parsing into NaN # (E) Re-check for missing values after conversions missing_values = df.isnull().sum() print("Missing values after type conversions:\n", missing_values) # (F) Handle missing values. # Given that the dataset is quite large and only a few rows have NaNs, # a common approach is to drop rows with any missing values. Alternatively, # you can fill them with means/medians if that makes sense for your domain. df.dropna(inplace=True) # (G) Remove or fix invalid numeric values (e.g., negative flow durations). # For instance, if Flow Duration < 0 is invalid, drop those rows: if 'Flow Duration' in df.columns: df = df[df['Flow Duration'] >= 0] # ---------------------------------------------------------------------------------- # 3. FEATURE SELECTION (OPTIONAL) # ---------------------------------------------------------------------------------- # Often we drop columns that are IDs, IP addresses, or Timestamps that aren’t # meant to be used as numeric features (they won’t help classification/regression # models in typical flow-based analyses). Keep them if you want time-based splits. columns_to_drop = ['Flow ID', 'Source IP', 'Destination IP', 'Timestamp'] for col in columns_to_drop: if col in df.columns: df.drop(columns=[col], inplace=True) # ---------------------------------------------------------------------------------- # 4. NORMALIZATION / SCALING # ---------------------------------------------------------------------------------- # Typically you separate out your target variable (e.g., 'Label'). # Then apply scaling to numeric features only. if 'Label' in df.columns: labels = df['Label'] df.drop(columns=['Label'], inplace=True) else: labels = None # Example: Using MinMaxScaler OR StandardScaler on numeric features numeric_cols = df.select_dtypes(include=[np.number]).columns # You can choose either scaler depending on your approach: # scaler = MinMaxScaler() # Scales each feature to [0, 1] scaler = StandardScaler() # Transforms each feature to have 0 mean, unit variance df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) # (Re)Attach the label column at the end df['Label'] = labels # ---------------------------------------------------------------------------------- # 5. CLASS IMBALANCE VISUALIZATION # ---------------------------------------------------------------------------------- # Plot a bar chart of the counts for each label. plt.figure(figsize=(6,4)) sns.countplot(data=df, x='Label', order=df['Label'].value_counts().index) plt.title("Class Distribution (Label)") plt.xlabel("Label") plt.ylabel("Count") plt.xticks(rotation=45) plt.tight_layout() plt.show() # Alternatively, just using pandas: # df['Label'].value_counts().plot(kind='bar') # plt.show() # Now you have a cleaned, scaled dataframe, and a visualization of class imbalance. # ---------------------------------------------------------------------------------- # Print final info print("\nDataFrame shape after preprocessing:", df.shape) print("\nData Sample:") print(df.head())
Drop Unnamed: 0
:
This column often appears if you saved a DataFrame with index=True
previously. If it holds no valuable information, remove it.
Trim Column Names:
Some columns had leading/trailing spaces (e.g., " Source IP"
). This can cause confusion when referencing columns.
Drop Duplicates:
Ensures each row is unique if you do not want repeated flows. Decide if duplicates are truly duplicates or different segments of the same flow.
Convert object
-> float
:
Columns like CWE Flag Count
, Down/Up Ratio
were read as object
. Converting them to numeric ensures you can do mathematical operations or scaling on them. errors='coerce'
will turn any non-parsable values into NaN
, which you can handle later.
Handle Missing Values:
A small number of missing values can simply be dropped in a large dataset, unless there is a domain reason to fill them. Alternatively, you can fill with mean/median.
Remove Invalid Values:
Negative Flow Duration
is likely invalid. Dropping or setting them to zero depends on domain knowledge.
Feature Selection:
Some columns like Flow ID
, Timestamp
, or IP addresses are rarely used directly as features in typical classification tasks. Removing them can reduce noise.
Scaling:
[0,1]
range.Class Imbalance Visualization:
countplot
from Seaborn or a simple Pandas value_counts().plot(kind='bar')
helps confirm if you have a skewed distribution among classes (e.g., most samples might be Android_Adware
, fewer from other malware families, or Benign
traffic).With this outline, you should have a cleaner, more consistent dataset ready for machine learning pipelines, and a clear visualization of class distribution. Adjust specific steps (columns to drop, threshold for missing values, normalization approach) to match your project’s needs.