Abstract |
Automatic deception detection is a crucial task that has many applications both in direct
physical and in computer-mediated human communication. In this thesis, we focus on automatic deception detection in text across cultures and on different languages. In this context,
we view culture through the prism of the individualism/collectivism dimension and we approximate culture by using country as a proxy. Having as a starting point recent conclusions
drawn from the social psychology discipline, we explore if differences in the usage of specific
linguistic deception cues across cultures can be confirmed and attributed to cultural norms
with respect to the individualism/collectivism divide. In addition, we investigate if a universal feature set for cross-cultural text deception detection tasks exists. For these goals,
we performed a thorough statistical analysis (Mann-Whitney tests and Multiple Logistic Regression) over eleven datasets from five languages (English, Dutch, Russian, Spanish and
Romanian), from six countries (United States of America, Belgium, India, Russia, Mexico
and Romania). The analysis showed the absence of a universal feature set and also the
volatility and sensitivity of the deception cues even across domains and genres in the same
culture/language. Furthermore, the analysis revealed some differences in deception cues
across cultures and languages e.g., in the expression of sentiment and at the same time the
cross-cultural validity of some others.
To evaluate the predictive power of different feature sets and approaches we created
culture/language-aware classifiers by experimenting with a wide range of n-gram features
from several levels of linguistic analysis, namely phonology, morphology and syntax, other
linguistic cues like word and phoneme counts, pronouns use, etc., and token embeddings. We
also experimented with the combination of these features while the aforementioned datasets
were employed for training/testing. We applied two classification methods, namely logistic
regression and fine-tuned BERT models both monolingual and crosslingual. Overall the
fine-tuning of the BERT model outperforms other approaches but interestingly there are
cases in the combination of BERT embeddings with linguistic features is beneficial. The
experimentation with multilingual embeddings, as a case of zero-shot transfer learning, also
showed promising results.
We introduce a new dataset in the context of April Fools’ Day articles for the Greek
language. To the best of our knowledge, this is the first publicly available deception dataset
for Greek. The conclusion based on a similar analysis to the above and in comparison with
an English April Fools’ Day Dataset mainly aligns with the results of the first part of the
thesis.
Lastly, we focus on how well various automatic deception detection models can generalize
in unseen distributions and domains. Using a rich set of diverse testing data in English and in
Spanish, we explore the performance gap between cue-based models and BERT-type models
and their combination. Generalization techniques from the literature are also considered
in an effort to enhance the generalization capabilities of the models. Transformer-based
approaches overall outperform cue-only-based approaches, but both the infusion of explicit
cues of deception and the generalization techniques are beneficial.
|