E-Locus - Institutional Repository of the University of Crete

Home Collections School/Department School of Sciences and Engineering Department of Computer Science Doctoral theses

Doctoral theses

Search command : Author="Παπαγιαννάκης" And Author="Γεώργιος"

Current Record: 6 of 125

[Add to Basket]

Identifier

000463783

Title

Deception detection from text in a multilingual and multicultural context

Alternative Title

Ανίχνευση εξαπάτησης από κείμενο σε πολυγλωσσικό και πολυπολιτισμικό πλαίσιο

Author

Παπαντωνίου, Αικατερίνη Χ.

Thesis advisor

Πλεξουσάκης, Δημήτριος

Reviewer

Φλουρής, Γεώργιος
Τζίτζικας, Γιάννης
Αργυρός, Αντώνιος
Κομοδάκης, Νίκος
Ανδρουτσόπουλος, Ίων
Σταματάτος, Ευστάθιος

Abstract

Automatic deception detection is a crucial task that has many applications both in direct physical and in computer-mediated human communication. In this thesis, we focus on automatic deception detection in text across cultures and on different languages. In this context, we view culture through the prism of the individualism/collectivism dimension and we approximate culture by using country as a proxy. Having as a starting point recent conclusions drawn from the social psychology discipline, we explore if differences in the usage of specific linguistic deception cues across cultures can be confirmed and attributed to cultural norms with respect to the individualism/collectivism divide. In addition, we investigate if a universal feature set for cross-cultural text deception detection tasks exists. For these goals, we performed a thorough statistical analysis (Mann-Whitney tests and Multiple Logistic Regression) over eleven datasets from five languages (English, Dutch, Russian, Spanish and Romanian), from six countries (United States of America, Belgium, India, Russia, Mexico and Romania). The analysis showed the absence of a universal feature set and also the volatility and sensitivity of the deception cues even across domains and genres in the same culture/language. Furthermore, the analysis revealed some differences in deception cues across cultures and languages e.g., in the expression of sentiment and at the same time the cross-cultural validity of some others. To evaluate the predictive power of different feature sets and approaches we created culture/language-aware classifiers by experimenting with a wide range of n-gram features from several levels of linguistic analysis, namely phonology, morphology and syntax, other linguistic cues like word and phoneme counts, pronouns use, etc., and token embeddings. We also experimented with the combination of these features while the aforementioned datasets were employed for training/testing. We applied two classification methods, namely logistic regression and fine-tuned BERT models both monolingual and crosslingual. Overall the fine-tuning of the BERT model outperforms other approaches but interestingly there are cases in the combination of BERT embeddings with linguistic features is beneficial. The experimentation with multilingual embeddings, as a case of zero-shot transfer learning, also showed promising results. We introduce a new dataset in the context of April Fools’ Day articles for the Greek language. To the best of our knowledge, this is the first publicly available deception dataset for Greek. The conclusion based on a similar analysis to the above and in comparison with an English April Fools’ Day Dataset mainly aligns with the results of the first part of the thesis. Lastly, we focus on how well various automatic deception detection models can generalize in unseen distributions and domains. Using a rich set of diverse testing data in English and in Spanish, we explore the performance gap between cue-based models and BERT-type models and their combination. Generalization techniques from the literature are also considered in an effort to enhance the generalization capabilities of the models. Transformer-based approaches overall outperform cue-only-based approaches, but both the infusion of explicit cues of deception and the generalization techniques are beneficial.

Language

English

Subject

Culture

Machine learning

NLP

Επεξεργασία φυσικής γλώσσας

Κουλτούρα

Μηχανική μάθηση

Issue date

2024-03-22

Collection