« Colloque STIC de l'ANR | Page d'accueil | Un challenge sur la détection du vandalisme sur Wikipedia »

30.10.2009

Un article de revue soutenu par MADSPAM 2.0

L'article :

T. Lavergne, T. Urvoy, F. Yvon, Filtering artificial texts with statistical machine learning techniques (pdf)

Sera bientôt publié dans la revue "Language Resources and Evaluation (LRE), special issue on Plagiarism and Authorship Analysis"

Abstract Fake content is flourishing on the Internet, ranging from basic random word sal-
ads to web scraping. Most of this fake content is generated for the purpose of nourishing
fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using
automatically generated texts render such sites harder to detect than using copies of existing
pages. In this paper, we present three methods aimed at distinguishing natural texts from
artificially generated ones: the first method uses basic lexicometric features, the second one
uses standard language models and the third one is based on a relative entropy measure
which captures short range dependencies between words. Our experiments show that lexi-
cometric features and language models are efficient to detect most generated texts, but fail to
detect texts that are generated with high order Markov models. By comparison our relative
entropy scoring algorithm, especially when trained on a large corpus, allows to detect these
“hard” text generators with a high degree of accuracy.

Écrire un commentaire