30.10.2009

Un article de revue soutenu par MADSPAM 2.0

L'article :

T. Lavergne, T. Urvoy, F. Yvon, Filtering artificial texts with statistical machine learning techniques (pdf)

Sera bientôt publié dans la revue "Language Resources and Evaluation (LRE), special issue on Plagiarism and Authorship Analysis"

Abstract Fake content is flourishing on the Internet, ranging from basic random word sal-
ads to web scraping. Most of this fake content is generated for the purpose of nourishing
fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using
automatically generated texts render such sites harder to detect than using copies of existing
pages. In this paper, we present three methods aimed at distinguishing natural texts from
artificially generated ones: the first method uses basic lexicometric features, the second one
uses standard language models and the third one is based on a relative entropy measure
which captures short range dependencies between words. Our experiments show that lexi-
cometric features and language models are efficient to detect most generated texts, but fail to
detect texts that are generated with high order Markov models. By comparison our relative
entropy scoring algorithm, especially when trained on a large corpus, allows to detect these
“hard” text generators with a high degree of accuracy.

05.10.2009

Colloque STIC de l'ANR

Le colloque STIC de l’ANR aura lieu à Paris la Villette du 5 au 7 janvier 2010.


Réunion d'automne au LIP6

Réunion MADSPAM le mardi 20 octobre de 14h à 16h30 au LIP6.

ODJ:

- livrables 2010

- revue du 12 octobre

En raison d'un mouvement social à la SNCF, nous nous contenterons d'une réunion téléphonique

(c.f. e-mail).