Dataset Preprocessing and Imbalance
作成日: 2025年1月24日
作成日: 2025年1月24日
Analyze CODE that i used for my dataset and OUTPUT from CODE. By analyzing everything. Write me code for Preprocessing steps: cleaning, normalization, etc. and Visualize class imbalance (e.g., using a bar plot). DEEPLY ANALYZE EVERYTHING AND DO Preprocessing steps: cleaning, normalization, etc. and Visualize class imbalance
CODE : {
import pandas as pd
csv_file_path = 'your_dataset.csv'
df = pd.read_csv(csv_file_path, low_memory=False) # low_memory=False helps with large files
print("Dataset Dimensions (rows, columns):", df.shape)
print("\nColumn Names:\n", df.columns.tolist())
print("\nColumn Data Types:\n", df.dtypes)
print("\nDataFrame Info:")
df.info()
print("\nDescriptive Statistics (including object columns):")
print(df.describe(include='all'))
missing_values = df.isnull().sum()
print("\nNumber of Missing Values per Column:\n", missing_values)
print("\nFirst 5 Rows of the Dataset:")
print(df.head())
print("\nLast 5 Rows of the Dataset:")
print(df.tail())
unique_counts = df.nunique()
print("\nNumber of Unique Values per Column:\n", unique_counts)
}
OUTPUT : {
Dataset Dimensions (rows, columns): (355630, 86)
Column Names:
['Unnamed: 0', 'Flow ID', ' Source IP', ' Source Port', ' Destination IP', ' Destination Port', ' Protocol', ' Timestamp', ' Flow Duration', ' Total Fwd Packets', ' Total Backward Packets', 'Total Length of Fwd Packets', ' Total Length of Bwd Packets', ' Fwd Packet Length Max', ' Fwd Packet Length Min', ' Fwd Packet Length Mean', ' Fwd Packet Length Std', 'Bwd Packet Length Max', ' Bwd Packet Length Min', ' Bwd Packet Length Mean', ' Bwd Packet Length Std', 'Flow Bytes/s', ' Flow Packets/s', ' Flow IAT Mean', ' Flow IAT Std', ' Flow IAT Max', ' Flow IAT Min', 'Fwd IAT Total', ' Fwd IAT Mean', ' Fwd IAT Std', ' Fwd IAT Max', ' Fwd IAT Min', 'Bwd IAT Total', ' Bwd IAT Mean', ' Bwd IAT Std', ' Bwd IAT Max', ' Bwd IAT Min', 'Fwd PSH Flags', ' Bwd PSH Flags', ' Fwd URG Flags', ' Bwd URG Flags', ' Fwd Header Length', ' Bwd Header Length', 'Fwd Packets/s', ' Bwd Packets/s', ' Min Packet Length', ' Max Packet Length', ' Packet Length Mean', ' Packet Length Std', ' Packet Length Variance', 'FIN Flag Count', ' SYN Flag Count', ' RST Flag Count', ' PSH Flag Count', ' ACK Flag Count', ' URG Flag Count', ' CWE Flag Count', ' ECE Flag Count', ' Down/Up Ratio', ' Average Packet Size', ' Avg Fwd Segment Size', ' Avg Bwd Segment Size', ' Fwd Header Length.1', 'Fwd Avg Bytes/Bulk', ' Fwd Avg Packets/Bulk', ' Fwd Avg Bulk Rate', ' Bwd Avg Bytes/Bulk', ' Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate', 'Subflow Fwd Packets', ' Subflow Fwd Bytes', ' Subflow Bwd Packets', ' Subflow Bwd Bytes', 'Init_Win_bytes_forward', ' Init_Win_bytes_backward', ' act_data_pkt_fwd', ' min_seg_size_forward', 'Active Mean', ' Active Std', ' Active Max', ' Active Min', 'Idle Mean', ' Idle Std', ' Idle Max', ' Idle Min', 'Label']
Column Data Types:
Unnamed: 0           int64
Flow ID             object
Source IP          object
Source Port         int64
Destination IP     object
...
Idle Mean          float64
Idle Std          float64
Idle Max          float64
Idle Min          float64
Label               object
Length: 86, dtype: object
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 355630 entries, 0 to 355629
Data columns (total 86 columns):
0   Unnamed: 0                    355630 non-null  int64
1   Flow ID                       355629 non-null  object
2    Source IP                    355630 non-null  object
3    Source Port                  355630 non-null  int64
4    Destination IP               355630 non-null  object
5    Destination Port             355630 non-null  float64
6    Protocol                     355630 non-null  float64
7    Timestamp                    355630 non-null  object
8    Flow Duration                355630 non-null  int64
9    Total Fwd Packets            355630 non-null  int64
10   Total Backward Packets       355630 non-null  int64
11  Total Length of Fwd Packets   355630 non-null  float64
12   Total Length of Bwd Packets  355630 non-null  float64
13   Fwd Packet Length Max        355630 non-null  float64
14   Fwd Packet Length Min        355630 non-null  float64
15   Fwd Packet Length Mean       355630 non-null  float64
16   Fwd Packet Length Std        355630 non-null  float64
17  Bwd Packet Length Max         355630 non-null  float64
18   Bwd Packet Length Min        355630 non-null  float64
19   Bwd Packet Length Mean       355630 non-null  float64
20   Bwd Packet Length Std        355630 non-null  float64
21  Flow Bytes/s                  355630 non-null  float64
22   Flow Packets/s               355630 non-null  float64
23   Flow IAT Mean                355630 non-null  float64
24   Flow IAT Std                 355630 non-null  float64
25   Flow IAT Max                 355630 non-null  float64
26   Flow IAT Min                 355630 non-null  float64
27  Fwd IAT Total                 355630 non-null  float64
28   Fwd IAT Mean                 355630 non-null  float64
29   Fwd IAT Std                  355630 non-null  float64
30   Fwd IAT Max                  355630 non-null  float64
31   Fwd IAT Min                  355630 non-null  float64
32  Bwd IAT Total                 355630 non-null  float64
33   Bwd IAT Mean                 355630 non-null  float64
34   Bwd IAT Std                  355630 non-null  float64
35   Bwd IAT Max                  355630 non-null  float64
36   Bwd IAT Min                  355630 non-null  float64
37  Fwd PSH Flags                 355630 non-null  float64
38   Bwd PSH Flags                355630 non-null  float64
39   Fwd URG Flags                355630 non-null  int64
40   Bwd URG Flags                355630 non-null  int64
41   Fwd Header Length            355630 non-null  int64
42   Bwd Header Length            355630 non-null  int64
43  Fwd Packets/s                 355630 non-null  float64
44   Bwd Packets/s                355629 non-null  float64
45   Min Packet Length            355629 non-null  float64
46   Max Packet Length            355629 non-null  float64
47   Packet Length Mean           355629 non-null  float64
48   Packet Length Std            355629 non-null  float64
49   Packet Length Variance       355629 non-null  float64
50  FIN Flag Count                355629 non-null  float64
51   SYN Flag Count               355629 non-null  float64
52   RST Flag Count               355629 non-null  float64
53   PSH Flag Count               355629 non-null  float64
54   ACK Flag Count               355629 non-null  float64
55   URG Flag Count               355629 non-null  float64
56   CWE Flag Count               355629 non-null  object
57   ECE Flag Count               355628 non-null  float64
58   Down/Up Ratio                355628 non-null  object
59   Average Packet Size          355627 non-null  float64
60   Avg Fwd Segment Size         355627 non-null  float64
61   Avg Bwd Segment Size         355627 non-null  float64
62   Fwd Header Length.1          355627 non-null  float64
63  Fwd Avg Bytes/Bulk            355627 non-null  object
64   Fwd Avg Packets/Bulk         355626 non-null  float64
65   Fwd Avg Bulk Rate            355626 non-null  float64
66   Bwd Avg Bytes/Bulk           355626 non-null  float64
67   Bwd Avg Packets/Bulk         355626 non-null  float64
68  Bwd Avg Bulk Rate             355626 non-null  float64
69  Subflow Fwd Packets           355626 non-null  float64
70   Subflow Fwd Bytes            355626 non-null  float64
71   Subflow Bwd Packets          355626 non-null  float64
72   Subflow Bwd Bytes            355626 non-null  float64
73  Init_Win_bytes_forward        355626 non-null  float64
74   Init_Win_bytes_backward      355626 non-null  float64
75   act_data_pkt_fwd             355626 non-null  float64
76   min_seg_size_forward         355626 non-null  float64
77  Active Mean                   355626 non-null  float64
78   Active Std                   355626 non-null  float64
79   Active Max                   355626 non-null  float64
80   Active Min                   355626 non-null  float64
81  Idle Mean                     355626 non-null  float64
82   Idle Std                     355626 non-null  float64
83   Idle Max                     355626 non-null  float64
84   Idle Min                     355626 non-null  float64
85  Label                         355630 non-null  object
dtypes: float64(69), int64(9), object(8)
memory usage: 233.3+ MB
Descriptive Statistics (including object columns):
Unnamed: 0                Flow ID    Source IP   Source Port  
count   355630.000000                 355629       355630  3.556300e+05
unique            NaN                 176955         2536           NaN
top               NaN  8.0.6.4-8.6.0.1-0-0-0  10.42.0.211           NaN
freq              NaN                   2097       178753           NaN
mean      1945.719231                    NaN          NaN  3.914129e+04
std       1840.477646                    NaN          NaN  1.794632e+04
min          0.000000                    NaN          NaN  0.000000e+00
25%        640.000000                    NaN          NaN  3.455900e+04
50%       1568.000000                    NaN          NaN  4.322600e+04
75%       2718.000000                    NaN          NaN  5.228300e+04
max      14157.000000                    NaN          NaN  1.045267e+06
    Destination IP   Destination Port       Protocol            Timestamp  \
count           355630       3.556300e+05  355630.000000               355630
unique            4746                NaN            NaN                72218
top          10.42.0.1                NaN            NaN  30/06/2017 01:14:36
freq             67213                NaN            NaN                  135
mean               NaN       5.202541e+03       9.765190                  NaN
std                NaN       1.452135e+04     750.541237                  NaN
min                NaN       0.000000e+00       0.000000                  NaN
25%                NaN       8.000000e+01       6.000000                  NaN
50%                NaN       4.430000e+02       6.000000                  NaN
75%                NaN       4.430000e+02       6.000000                  NaN
max                NaN       1.328159e+06  442719.666667                  NaN
     Flow Duration   Total Fwd Packets  ...   min_seg_size_forward  \
count     3.556300e+05       355630.000000  ...           3.556260e+05
unique             NaN                 NaN  ...                    NaN
top                NaN                 NaN  ...                    NaN
freq               NaN                 NaN  ...                    NaN
mean      1.092975e+07            7.433718  ...          -2.831850e+04
std       2.180861e+07           81.015658  ...           5.543875e+06
min      -1.000000e+00            0.000000  ...          -1.395063e+09
25%       4.888600e+04            1.000000  ...           2.000000e+01
50%       5.602255e+05            2.000000  ...           3.200000e+01
75%       1.076907e+07            5.000000  ...           3.200000e+01
max       1.199999e+08        27165.000000  ...           1.705248e+08
     Active Mean    Active Std    Active Max    Active Min     Idle Mean  \
count   3.556260e+05  3.556260e+05  3.556260e+05  3.556260e+05  3.556260e+05
unique           NaN           NaN           NaN           NaN           NaN
top              NaN           NaN           NaN           NaN           NaN
freq             NaN           NaN           NaN           NaN           NaN
mean    1.628143e+05  2.195225e+04  1.851564e+05  1.456534e+05  4.025724e+06
std     8.817765e+05  2.860244e+05  1.019477e+06  8.423659e+05  1.441518e+07
min     0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
25%     0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
50%     0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
75%     0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00
max     9.718488e+07  4.020815e+07  9.718488e+07  9.718488e+07  1.199514e+08
        Idle Std      Idle Max      Idle Min           Label  
count   3.556260e+05  3.556260e+05  3.556260e+05          355630
unique           NaN           NaN           NaN               4
top              NaN           NaN           NaN  Android_Adware
freq             NaN           NaN           NaN          147443
mean    3.182527e+05  4.304876e+06  3.779037e+06             NaN
std     2.882901e+06  1.514008e+07  1.407042e+07             NaN
min     0.000000e+00  0.000000e+00  0.000000e+00             NaN
25%     0.000000e+00  0.000000e+00  0.000000e+00             NaN
50%     0.000000e+00  0.000000e+00  0.000000e+00             NaN
75%     0.000000e+00  0.000000e+00  0.000000e+00             NaN
max     7.737303e+07  1.199514e+08  1.199514e+08             NaN
[11 rows x 86 columns]
Number of Missing Values per Column:
Unnamed: 0         0
Flow ID            1
Source IP         0
Source Port       0
Destination IP    0
..
Idle Mean          4
Idle Std          4
Idle Max          4
Idle Min          4
Label              0
Length: 86, dtype: int64
First 5 Rows of the Dataset:
Unnamed: 0                                Flow ID    Source IP  
0           0  172.217.6.202-10.42.0.211-443-50004-6  10.42.0.211
1           1  172.217.6.202-10.42.0.211-443-35455-6  10.42.0.211
2           2  131.253.61.68-10.42.0.211-443-51775-6  10.42.0.211
3           3  131.253.61.68-10.42.0.211-443-51775-6  10.42.0.211
4           4  131.253.61.68-10.42.0.211-443-51776-6  10.42.0.211
Source Port  Destination IP   Destination Port   Protocol  \
0         50004   172.217.6.202              443.0        6.0
1         35455   172.217.6.202              443.0        6.0
2         51775   131.253.61.68              443.0        6.0
3         51775   131.253.61.68              443.0        6.0
4         51776   131.253.61.68              443.0        6.0
         Timestamp   Flow Duration   Total Fwd Packets  ...  \
0  13/06/2017 11:52:39           37027                   1  ...
1  13/06/2017 11:52:39           36653                   1  ...
2  13/06/2017 11:52:42          534099                   8  ...
3  13/06/2017 11:52:43            9309                   3  ...
4  13/06/2017 11:52:42        19890496                   8  ...
min_seg_size_forward  Active Mean   Active Std   Active Max   Active Min  \
0                   32.0          0.0          0.0          0.0          0.0
1                   32.0          0.0          0.0          0.0          0.0
2                   20.0          0.0          0.0          0.0          0.0
3                   20.0          0.0          0.0          0.0          0.0
4                   20.0          0.0          0.0          0.0          0.0
Idle Mean   Idle Std   Idle Max   Idle Min           Label
0        0.0        0.0        0.0        0.0  Android_Adware
1        0.0        0.0        0.0        0.0  Android_Adware
2        0.0        0.0        0.0        0.0  Android_Adware
3        0.0        0.0        0.0        0.0  Android_Adware
4        0.0        0.0        0.0        0.0  Android_Adware
[5 rows x 86 columns]
Last 5 Rows of the Dataset:
Unnamed: 0                                  Flow ID     Source IP  
355625         405      172.217.7.14-10.42.0.211-80-38405-6  172.217.7.14
355626         406         10.42.0.211-10.42.0.1-7632-53-17   10.42.0.211
355627         407  10.42.0.211-104.192.110.245-45970-443-6   10.42.0.211
355628         408        10.42.0.211-10.42.0.1-51982-53-17   10.42.0.211
355629         409         10.42.0.211-10.42.0.1-9320-53-17   10.42.0.211
     Source Port   Destination IP   Destination Port   Protocol  \
355625            80      10.42.0.211            38405.0        6.0
355626          7632        10.42.0.1               53.0       17.0
355627         45970  104.192.110.245              443.0        6.0
355628         51982        10.42.0.1               53.0       17.0
355629          9320        10.42.0.1               53.0       17.0
              Timestamp   Flow Duration   Total Fwd Packets  ...  \
355625  17/06/2017 01:29:11          126711                   1  ...
355626  17/06/2017 01:30:33           48012                   1  ...
355627  17/06/2017 01:29:45        20028018                  11  ...
355628  17/06/2017 01:29:45          347926                   1  ...
355629  17/06/2017 01:30:33          125473                   1  ...
     min_seg_size_forward  Active Mean   Active Std   Active Max  \
355625                   32.0          0.0          0.0          0.0
355626                   32.0          0.0          0.0          0.0
355627                   20.0     367528.0          0.0     367528.0
355628                   20.0          0.0          0.0          0.0
355629                   32.0          0.0          0.0          0.0
     Active Min   Idle Mean   Idle Std    Idle Max    Idle Min   Label  
355625          0.0         0.0        0.0         0.0         0.0  Benign
355626          0.0         0.0        0.0         0.0         0.0  Benign
355627     367528.0  19660490.0        0.0  19660490.0  19660490.0  Benign
355628          0.0         0.0        0.0         0.0         0.0  Benign
355629          0.0         0.0        0.0         0.0         0.0  Benign
[5 rows x 86 columns]
Number of Unique Values per Column:
Unnamed: 0          14158
Flow ID            176955
Source IP           2536
Source Port        51343
Destination IP      4746
...
Idle Mean           39823
Idle Std            8887
Idle Max           39789
Idle Min           39784
Label                   4
Length: 86, dtype: int64
}
Below is an example of how you can perform a deep preprocessing of your dataset and visualize class imbalance. The code is commented step-by-step, explaining what is being done and why. Feel free to adjust it to the particularities of your data and project requirements.
From your output, here are some key observations:
Dimensions:
Unnamed: 0 column that looks like an extra index (often created when saving or merging datasets). This can typically be dropped if it has no analytical value.Missing Values:
Flow ID has 1 missing value,Bwd Packets/s has 1 missing value,CWE Flag Count or Down/Up Ratio may have a couple of missing or invalid values and are stored as object instead of numeric.Data Types:
float64 or int64.object but actually contain numeric data (e.g., CWE Flag Count, Down/Up Ratio, Fwd Avg Bytes/Bulk). We should convert them to numeric types.Flow ID is an ID-like column. Often, flow IDs, IP addresses, or timestamps are not directly used as numeric features for machine learning. They can be dropped unless you need them for a specific purpose.Possible Anomalies:
Flow Duration has min = -1). You may need to remove or investigate negative or otherwise impossible values.Flow Bytes/s or similarly derived rates that can happen if the duration is near zero.Label Distribution:
Label column has 4 unique values. From the descriptive statistics:
Android_Adware, Benign, etc.Android_Adware has the highest frequency (147443).pythonimport pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import MinMaxScaler, StandardScaler # ---------------------------------------------------------------------------------- # 1. LOAD DATA # Replace 'your_dataset.csv' with your actual file path if needed. # ---------------------------------------------------------------------------------- df = pd.read_csv('your_dataset.csv', low_memory=False) # ---------------------------------------------------------------------------------- # 2. BASIC CLEANING # ---------------------------------------------------------------------------------- # (A) Drop or rename the 'Unnamed: 0' column if it is just an extra index. if 'Unnamed: 0' in df.columns: df.drop(columns=['Unnamed: 0'], inplace=True) # (B) Strip any leading/trailing spaces in column names (some columns have spaces). df.columns = [col.strip() for col in df.columns] # (C) Check for duplicates and remove them, if it makes sense for your domain. df.drop_duplicates(inplace=True) # (D) Convert columns that should be numeric but are read as object. # Identify which columns are supposed to be numeric. Below is an example list # based on your DataFrame info. Adjust as needed. columns_to_convert = [ 'CWE Flag Count', 'Down/Up Ratio', 'Fwd Avg Bytes/Bulk', 'Fwd Avg Packets/Bulk', 'Fwd Avg Bulk Rate', 'Bwd Avg Bytes/Bulk', 'Bwd Avg Packets/Bulk', 'Bwd Avg Bulk Rate' ] for col in columns_to_convert: if col in df.columns: df[col] = pd.to_numeric(df[col], errors='coerce') # errors='coerce' turns invalid parsing into NaN # (E) Re-check for missing values after conversions missing_values = df.isnull().sum() print("Missing values after type conversions:\n", missing_values) # (F) Handle missing values. # Given that the dataset is quite large and only a few rows have NaNs, # a common approach is to drop rows with any missing values. Alternatively, # you can fill them with means/medians if that makes sense for your domain. df.dropna(inplace=True) # (G) Remove or fix invalid numeric values (e.g., negative flow durations). # For instance, if Flow Duration < 0 is invalid, drop those rows: if 'Flow Duration' in df.columns: df = df[df['Flow Duration'] >= 0] # ---------------------------------------------------------------------------------- # 3. FEATURE SELECTION (OPTIONAL) # ---------------------------------------------------------------------------------- # Often we drop columns that are IDs, IP addresses, or Timestamps that aren’t # meant to be used as numeric features (they won’t help classification/regression # models in typical flow-based analyses). Keep them if you want time-based splits. columns_to_drop = ['Flow ID', 'Source IP', 'Destination IP', 'Timestamp'] for col in columns_to_drop: if col in df.columns: df.drop(columns=[col], inplace=True) # ---------------------------------------------------------------------------------- # 4. NORMALIZATION / SCALING # ---------------------------------------------------------------------------------- # Typically you separate out your target variable (e.g., 'Label'). # Then apply scaling to numeric features only. if 'Label' in df.columns: labels = df['Label'] df.drop(columns=['Label'], inplace=True) else: labels = None # Example: Using MinMaxScaler OR StandardScaler on numeric features numeric_cols = df.select_dtypes(include=[np.number]).columns # You can choose either scaler depending on your approach: # scaler = MinMaxScaler() # Scales each feature to [0, 1] scaler = StandardScaler() # Transforms each feature to have 0 mean, unit variance df[numeric_cols] = scaler.fit_transform(df[numeric_cols]) # (Re)Attach the label column at the end df['Label'] = labels # ---------------------------------------------------------------------------------- # 5. CLASS IMBALANCE VISUALIZATION # ---------------------------------------------------------------------------------- # Plot a bar chart of the counts for each label. plt.figure(figsize=(6,4)) sns.countplot(data=df, x='Label', order=df['Label'].value_counts().index) plt.title("Class Distribution (Label)") plt.xlabel("Label") plt.ylabel("Count") plt.xticks(rotation=45) plt.tight_layout() plt.show() # Alternatively, just using pandas: # df['Label'].value_counts().plot(kind='bar') # plt.show() # Now you have a cleaned, scaled dataframe, and a visualization of class imbalance. # ---------------------------------------------------------------------------------- # Print final info print("\nDataFrame shape after preprocessing:", df.shape) print("\nData Sample:") print(df.head())
Drop Unnamed: 0:
This column often appears if you saved a DataFrame with index=True previously. If it holds no valuable information, remove it.
Trim Column Names:
Some columns had leading/trailing spaces (e.g., " Source IP"). This can cause confusion when referencing columns.
Drop Duplicates:
Ensures each row is unique if you do not want repeated flows. Decide if duplicates are truly duplicates or different segments of the same flow.
Convert object -> float:
Columns like CWE Flag Count, Down/Up Ratio were read as object. Converting them to numeric ensures you can do mathematical operations or scaling on them. errors='coerce' will turn any non-parsable values into NaN, which you can handle later.
Handle Missing Values:
A small number of missing values can simply be dropped in a large dataset, unless there is a domain reason to fill them. Alternatively, you can fill with mean/median.
Remove Invalid Values:
Negative Flow Duration is likely invalid. Dropping or setting them to zero depends on domain knowledge.
Feature Selection:
Some columns like Flow ID, Timestamp, or IP addresses are rarely used directly as features in typical classification tasks. Removing them can reduce noise.
Scaling:
[0,1] range.Class Imbalance Visualization:
countplot from Seaborn or a simple Pandas value_counts().plot(kind='bar') helps confirm if you have a skewed distribution among classes (e.g., most samples might be Android_Adware, fewer from other malware families, or Benign traffic).With this outline, you should have a cleaner, more consistent dataset ready for machine learning pipelines, and a clear visualization of class distribution. Adjust specific steps (columns to drop, threshold for missing values, normalization approach) to match your project’s needs.