IMPACT OF THE SYNTACTIC DEPENDENCIES IN THE SENTENCES ON THE QUALITY OF THE IDENTIFICATION OF THE TOXIC COMMENTS IN THE SOCIAL NETWORKS

  • Serhiy Shtovba Vinnytsia National Technical University
  • Olena Shtovba Vinnytsia National Technical University
  • Olexandr Yahymovych Vinnytsia National Technical University
  • Mykola Petrychko Vinnytsia National Technical University
Keywords: text mining, natural language processing, syntactic dependencies, toxic comments, social network, identification, machine learning, features selection

Abstract

Social networks often become a medium for threats, insults and other components of cyberbullying. A huge number of people are involved in online social networks, therefore, there is a need for automation of the  activities to protect users from anti-social behavior. One of the important tasks of such activity is the identification of  the toxic comments that contain threats, insults, obscene etc. The bag of words statistics and bag of symbols statistics are typical features for the toxic comments identification. The effect of syntactic dependencies in sentences on the quality of identification of the social network toxic comments is studied in the article. Syntactic dependences are relationships with proper nouns, personal pronouns, possessive pronouns, etc. 20 syntactic features of sentences have been verified in the total. The article shows that 3 additional specific features significantly improve the quality of toxic comments identification. These three features are: the number of dependences with proper nouns in the singular, the number of dependences that contain bad words, and the number of dependences between personal pronouns and bad words. The experiments are based on data from kaggle- competition "Toxic Comment Classification Challenge". The original kaggle-task of categorizing the toxic comments was modified to the classification one with two alternatives: a neutral comment and a toxic comment. For our experiments, the original dataset with 159751 comments was reduced to 106590 comments due to problems with human-free extraction of the syntactic features. The toxic comment rate is 12.8% in the modified dataset. We use mean of the error rates for each types of misclassification as the metric of quality due to unbalanced dataset. A decision tree is used as a classifier. The decision trees were synthesized for two splitting rules: Gini index and entropy criterion.

Author Biographies

Serhiy Shtovba, Vinnytsia National Technical University

Dr. Sc. (Eng.), Professor, Professor with the Computer Control Systems Department

Olena Shtovba, Vinnytsia National Technical University

Associate Professor, PhD, Associate Professor with the Department of Management, Marketing and Economics

Olexandr Yahymovych, Vinnytsia National Technical University

PhD-student, Automation and Intelligent Information Technology

Mykola Petrychko, Vinnytsia National Technical University

Student, Department of Computer Systems and Automation

Published
2019-11-27
How to Cite
[1]
S. Shtovba, O. Shtovba, O. Yahymovych, and M. Petrychko, “IMPACT OF THE SYNTACTIC DEPENDENCIES IN THE SENTENCES ON THE QUALITY OF THE IDENTIFICATION OF THE TOXIC COMMENTS IN THE SOCIAL NETWORKS”, SWVNTU, no. 4, Nov. 2019.
Section
Information Technologies and Computer Engineering