Detection of Hindi Spam Emails Using NLP

Authors

Parshuram Dhamale, Shraddha

Issue Date

2023

Degree

MSc in Business Analytics

Publisher

Dublin Business School

Rights

Abstract

In modern times, the business and education sectors embrace email for collaboration and interaction. Email is a fast and easy means of communicating for both quick and prolonged periods of time. Email is growing into an effective way of exchanging information, which results in unsolicited bulk or spam. Such emails harvest sensitive information from individuals or business-related facts, as well as cultivate pornographic material or marketing services. Since the Hindi and English languages are so dissimilar, detecting spam emails in Hindi is challenging. These tactics are broadly characterized as context-based or non-context-based. We analyzed and assessed many research materials in this paper. Previous research papers’ findings assist in the development of spam detection algorithms for a variety of platforms, including social media, email, and text messaging. This project aims to increase the precision and efficacy of spam identification in order to improve user experiences, defend users from potential threats or malicious activities, and keep online communication channels safe. Researchers have widely employed Natural Language Processing (NLP) techniques to detect spam emails in the English language during the previous five years. These methods attempt to analyze the textual content of emails in order to identify components that can discriminate between legitimate and spam messages. The aim of this research is to develop an efficient system for identifying and filtering spam emails in Hindi using similar techniques. It is necessary to have a reliable Hindi spam detection system. At least research available in the Hindi language was a major challenge in this field. We proposed a system that reliably detects Hindi spam emails using NLP. We analyzed and studied multiple machine learning techniques such as Logistic Regression, Random Forest, Decision Tree, Naive Bayes, and Support Vector Classifier. Ultimately choosing logistic regression to construct the system. The system provides an average accuracy of 97.72% by implementing the K-fold Cross Validation technique.