Malware Profiling and Classification using machine learning algorithms

Authors

Rathee, Hanisha

Issue Date

2024

Degree

MSc in Data Analytics

Publisher

Dublin Business School

Rights

Abstract

The study done on "Malware Profiling and Classification using Machine Learning Algorithms" compares multiple machine learning models for malware detection and profiling to enhance cybersecurity. Machine learning adaptable skills were used to identify and classify complex malware threats with the help of algorithms like SVM, Random Forest, and Autoencoders to analyze historical malware data. These models were selected for their pattern recognition and anomaly detection successes in large datasets. The data was carefully preprocessed to assure accuracy and relevance and machine learning algorithms were taught to recognize complex malware patterns. Each model was rigorously evaluated using 70% training and 30% validation data throughout the inquiry. The models' performance was assessed using accuracy, precision, recall, F1 score, AUC, and ROC curve. The SVM model gives proper results identifying safe and dangerous software with 0.99 AUC. However, the Random Forest and Autoencoder models scored 1.0 in the AUC statistic which is ideal. These results showed that these models had nearly minimal false positives, which is crucial in malware detection systems. A near-perfect score of 0.9999 showed that the Random Forest model accurately classified data points with 1.0 accuracy suggests that the model predicted no false positives. The model recognized almost all real malware occurrences with a recall score of 0.9998 indicating success. The model's balanced prediction abilities were confirmed by its 0.9999 F1 score whereas the Autoencoder model had an accuracy of 0.9959, a precision of 0.9955, and an F1 score that matched the Random Forest. The recall rate of 0.9962 indicated that it was somewhat better at detecting true positives than the Random Forest model. This model's AUC score was 1.0, and its ROC performance was crucial for confidently differentiating classes. A ROC curve comparison showed that Random Forest and Autoencoder models performed better. The ROC curve shows how efficiently binary classifier systems identify malware as when the curve closely matches the left-hand and top ROC space boundaries, the model is more accurate. Both models have excellent classifier behavior as seen by the ROC curve in the top left corner. All models examined were effective, however the Autoencoder model was somewhat better, making it the preferred malware classification and profiling technique.