International Research Journal of Engineering and Technology (IRJET) Volume: 09 Issue: 06 | Jun 2022
www.irjet.net
e-ISSN: 2395-0056 p-ISSN: 2395-0072
Hate Speech Identification Using Machine Learning Nikhilraj Gadekar1, Mario Pinto 2 1Student,
2Assistant
Information Technology Department, Goa College of Engineering, India Professor, Information Technology Department, Goa College of Engineering, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - With the rise of social media people have the
Twitter is the 3rd most widely used social media platform in India with 295.44 million active users. Political and social issues discussed on Twitter are heavily polarizing as people are very attached to their political ideology the threads on Twitter discussing these issues seem to have a lot of hate in them.
liberty to express themselves, many users misuse this liberty and we can see abuse and hate spread all across social media platforms, be it in the form of comments, blogs, etc. Machine Learning is widely used in various fields of research, we intend to use machine learning to automate the detection of hate speech. In this paper, we have used subjectivity analysis and semantic features to create a lexicon that builds a classifier to identify hate speech.
In this paper the dataset that we have used deals with tweets that are majorly political and social, the dataset has been segregated into two major categories which are hostile and non-hostile, the hostile category has been further divided into Non-hostile, fake, defamation, hate and offensive.
Key Words: Hate Speech, Hostile, Subjectivity Analysis, Lexicon, Machine Learning, Cyber-bullying
In our research, we propose a model that can successfully classify if the tweets are fake, defamation, hate, offensive and non-hostile. We use a rule-based approach which has three major steps I Subjectivity Analysis II Building hate speech lexicon III Identifying theme-based nouns
1. INTRODUCTION With a population of 1.40 billion people which is also increasing at a very steady pace day by day, India has the second-largest population in the world. India also boasts of having one of the fastest-growing economies in the world which means it becomes a very important playground for social media companies.
The other part of the paper is organized as follows: In sections 2 and 3 related works and methodology have been elaborated. In section 4 dataset details have been discussed and in section 5 the experimental results have been done. The research paper is terminated in section 6.
The people of India spend about 2.36 hours of their everyday time on social media on average. The country’s internet penetration rate stands at 47%, and it has 467 million social media users as of 2022 and that number will only grow with time.
2 Related Works: A lot of work on hate speech detection on social media platforms has been done in the past in various languages like English and many other Western languages, but very little work has been done in this area in low-resource languages like Hindi which is the 3rd largest spoken language in the world. The author of the paper [1] has collected 197,566 comments from four social media platforms which are YouTube, Reddit, Wikipedia, and Twitter with 80% of the comments labeled as non-hateful and the remaining labeled as hateful. The author has used classification algorithms like Logistic Regression (LR), Naïve Bayes (NB), Support Vector Machines (SVM), Extreme Gradient Boosting (XGBoost), and Fast Feed Neural Network (FNN), the author has used various Feature Engineering techniques which are Simple features, BOW, TF-IDF, Word2Vec and BERT in combination with these classification algorithms and found that XGBoost with BERT gives the best accuracy. The study in [2] states that the author has collected the dataset from Facebook and in the Bangla language and has labeled the data as Hate or Non-Hate, the authors have used TF-IDF for extracting the features and have used SVM and Naïve Bayes algorithms for classification. The authors achieve an accuracy of 70% and 72% using SVM and Naïve Bayes respectively. The research in [3] presents an approach to detect offense in memes using Natural Language
The world is shrinking with the use of the internet but with the positives also come the negatives, and one of the major cons of social media is Hate speech which can also be called abusive writing, cyberbullying, etc. The National Crime Records Bureau figures show a 36% increase in cyberstalking and cyberbullying cases in India post the pandemic. Around 9.2% of 630 adolescents surveyed in the Delhi-National Capital Region had experienced cyberbullying and half of them had not reported it to teachers, guardians, or the social media companies concerned, a recent study by Child Rights and You, a non-governmental organization, found. The manual process to identify hate speech is slow, tedious, and labor-intensive. Therefore automatic hate speech detection becomes very important. Despite Hindi being the third most spoken language in the world, and a significant presence of Hindi content on social media platforms we couldn’t find much work done to detect hate speech using technology.
© 2022, IRJET
|
Impact Factor value: 7.529
|
ISO 9001:2008 Certified Journal
|
Page 3398