Machine learning approach for identifying suspicious uniform resource locators (URLs) on Reddit social network
No Thumbnail Available
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Routledge Taylor and Francis Group
Abstract
Description
The applications and advantages of the Internet for real-time information sharing can never be over-emphasized. These
great benefits are too numerous to mention but they are being seriously hampered and made vulnerable due to phishing
that is ravaging cyberspace. This development is, undoubtedly, frustrating the efforts of the Global Cyber Alliance – an
agency with a singular purpose of reducing cyber risk. Consequently, various researchers have attempted to proffer
solutions to phishing. These solutions are considered inefficient and unreliable as evident in the conflicting claims by
the authors. Against this backdrop, this work has attempted to find the best approach to solving the challenge of
identifying suspicious uniform resource locators (URLs) on Reddit social networks. In an effort to handle this
challenge, attempts have been made to address two major problems. The first is how can the suspicious URLs be
identified on Reddit social networks with machine learning techniques? And the second is how can internet users be
safeguarded from unreliable and fake URLs on the Reddit social network? This work adopted six machine learning
algorithms – AdaBoost, Gradient Boost, Random Forest, Linear SVM, Decision Tree, and Naïve Bayes Classifier – for
training using features obtained from Reddit social network and for additional processing. A total sum of 532,403 posts
were analyzed. At the end of the analysis, only 87,083 posts were considered suitable for training the models. After the
experimentation, the best performing algorithm was AdaBoost with an accuracy level of 95.5% and a precision of 97.57%.
Keywords
QA Mathematics, QA75 Electronic computers. Computer science