<< return

RJCRI 06 : Unnatural Language Detection

In the context of web search engines, the escalation between ranking techniques and spamdexing techniques has led to the appearance of faked contents in web pages. If random sequences of keywords are easily detectable, web pages produced by dedicated content generators are a lot more difcult to detect. Motivated by search engines applications, we will focus on the problem of automatic unnatural language detection. We will study both syntactical and semantical aspects of this problem, and for both of them we will present probabilistic and symbolic approaches.

[pdf] [lnk]
@InProceedings{lavergne2006rjcri,
	author    = {Lavergne, Thomas},
	title     = {Unnatural Language Detection},
	booktitle = {Young Scientist' conference on Information Retrieval ({RJCRI}'06)},
	year      = {2006},
	pages     = {383--388},
	month     = {3},
	location  = {Lyon, France},
	url       = {http://www.irit.fr/ARIA/2006/383.pdf}
}

AIRWeb 06 : Tracking Web Spam with Hidden Style Similarity

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimous spamdexing attempts (e.g. link farms, faked directories...).
Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classifcation methods, but is more related to stylometry.
In this paper, we present a (hidden) style similarity measure based on extra-textual features in html source code. We also describe a method to clusterize a large collection of documents according to this measure. The clustering algorithm being based on fingerprints, we also give some recalls about fingerprinting.
By conveniently sorting the generated clusters, one can efficiently track back instances of a particular automatic content generation method among web pages collected using a crawler. This is particularly useful to detect pages across different sites sharing the same design - this is often a good hint of either spamdexing attempt or mirrored content.

[pdf] [lnk]
@InProceedings{urvoy2006airweb,
	author    = {Urvoy, Tanguy and Lavergne, Thomas and Filoche, Pascal},
	title     = {Tracking Web Spam with Hidden Style Similarity},
	booktitle = {International Workshop on Adversarial Information Retrieval on the Web ({AIRW}eb'06)},
	year      = {2006},
	pages     = {25--31},
	month     = {8},
	location  = {Seattle, Washington, {USA}},
	url       = {http://airweb.cse.lehigh.edu/2006/urvoy.pdf}
}

ACM Tweb : Tracking Web Spam with HTML Style Similarities

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
Those pages built using the same generating method (template or script) share a common look and feel that is not easily detected by common text classification methods, but is more related to stylometry.
In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.
We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

[pdf] [lnk]
@article{urv08acmtweb,
        author    = {Urvoy, Tanguy and Chauveau, Emmanuel and Filoche, Pascal and Lavergne, Thomas},
        title     = {Tracking Web Spam with {HTML} Style Similarities},
        journal   = {{ACM} Trans. Web},
        volume    = {2},
        number    = {1},
        year      = {2008},
        issn      = {1559-1131},
        pages     = {1--28},
        doi       = {DOI:10.1145/1326561.1326564},
        publisher = {{ACM}},
        address   = {New York, {NY}, {USA}}
}

JADT 08 : Taxonomie de textes peu-naturels

In this paper, we define what is a natural text in a pragmatic way. Then, we present various types of unnatural texts and more particularly the simplest generators, which are also the most widespread in spamdexing. Finally, we describe some statistical tests which allow a first filtering of unnatural texts.
Dans cet article nous définissons de manière pragmatique ce quest un texte naturel. Puis nous présentons différentes catégories de textes non-naturels et plus particulièrement les méthodes de génération les plus simples qui sont aussi les plus répandues dans le cadre du spamdexing. Enfin nous proposons quelques tests statistiques permettant un premier filtrage des textes non-naturels.

[pdf] [lnk]
@InProceedings{lavergne2008jadt,
        author    = {Lavergne, Thomas},
        title     = {Taxonomie de textes peu-naturels},
        booktitle = {Proceedings of 9th International Conference on Textual Data statistical Analysis},
        year      = {2008},
        pages     = {679--687},
        url       = {http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2008/pdf/lavergne.pdf}
}

PAN 08 : Detecting Fake Content with Relative Entropy Scoring

How to distinguish natural texts from artificially generated ones ? Fake content is commonly encountered on the Internet, ranging from web scraping to random word salads. Most of this fake content is generated for spam purpose. In this paper, we present two methods to deal with this problem. The first one uses classical language models, while the second one is a novel approach using short range information between words.

[pdf] [lnk]
@InProceedings{lavergne2008pan,
        author    = {Lavergne, Thomas and Urvoy, Tanguy and Yvon, Fran\c{c}ois},
        title     = {Detecting Fake Content with Relative Entropy Scoring},
        booktitle = {International Workshop on Plagiarism Analysis, Authorship Identification,
                     and Near-Duplicate Detection ({PAN})},
        year      = {2008},
        url       = {http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-377/paper4.pdf}
}

PHD Thesis: Détection des textes non-naturels

[pdf]
@phdthesis{lavergne2009phd,
        author = {Lavergne, Thomas},
        title  = {D\'etection des textes non-naturels},
        school = {{ENST} Paris and Orange Labs},
        month  = {4},
        year   = {2009}
}

ACL-IJCNLP 09 : Introduction of a new paraphrase generation tool based on Monte-Carlo sampling

We propose a new specifically designed method for paraphrase generation based on Monte-Carlo sampling and show how this algorithm is suitable for its task. Moreover, the basic algorithm presented here leaves a lot of opportunities for future improvement. In particular, our algorithm does not constraint the scoring function in opposite to Viterbi based decoders. It is now possible to use some global features in paraphrase scoring functions. This algorithm opens new outlooks for paraphrase generation and other natural language processing applications like statistical machine translation.

[pdf] [lnk]
@InProceedings{chevelu2009acl,
        author    = {Chevelu, Jonathan and Lavergne, Thomas and Lepage, Yves and Moudenc, Thierry},
        title     = {Introduction of a new paraphrase generation tool based on Monte-Carlo sampling},
        booktitle = {Joint Conference of the 47th Annual Meeting of the Association for Computational
                     Linguistics and the 4th International Joint Conference on Natural Language
                     Processing (ACL-IJCNLP)},
        year      = {2009},
	pages     = {249--252},
	url       = {http://www.aclweb.org/anthology/P/P09/P09-2063.pdf}
}

PACLING 09 : Transformation rules and Monte-Carlo sampling: a different approach for statistical paraphrase generation

Paraphrase generation is often presented as a monolingual statistical machine translation problem. This approach cannot take advantage of paraphrases particularities by transforming only parts of sentences. We propose a different paradigm for statistical paraphrase generation where a paraphrase is seen as the application of a set of transformation rules on a sentence. We propose a new method, adapted to this point of view, based on Monte-Carlo sampling and show how this algorithm is suitable for paraphrase generation. Moreover, the basic algorithm presented here leaves a lot of opportunities for future improvement. In particular, our algorithm does not constraint the scoring function in opposite to Viterbi based decoders. It is now possible to use some global features in paraphrase scoring functions. This algorithm opens new outlooks for paraphrase generation and other natural language processing applications like statistical machine translation.

[pdf] [lnk]
@InProceedings{chevelu2009pacling,
        author    = {Chevelu, Jonathan and Lavergne, Thomas and Lepage, Yves and Moudenc, Thierry},
        title     = {Transformation rules and Monte-Carlo sampling: a different approach for
                     statistical paraphrase generation},
        booktitle = {Conference of the Pacific Association for Computational Linguistics (PACLING)},
        year      = {2009},
        pages     = {230--235}
}

LRE: Filtering artificial texts with statistical machine learning techniques

Fake content is flourishing on the Internet, ranging from basic random word salads to web scraping. Most of this fake content is generated for the purpose of nourishing fake web sites aimed at biasing search engine indexes: at the scale of a search engine, using automatically generated texts render such sites harder to detect than using copies of existing pages. In this paper, we present three methods aimed at distinguishing natural texts from artificially generated ones: the first method uses basic lexicometric features, the second one uses standard language models and the third one is based on a relative entropy measure which captures short range dependencies between words. Our experiments show that lexicometric features and language models are efficient to detect most generated texts, but fail to detect texts that are generated with high order Markov models. By comparison our relative entropy scoring algorithm, especially when trained on a large corpus, allows to detect these “hard” text generators with a high degree of accuracy.

[pdf] [lnk]
@Article{lavergne2010lre,
	author    = {Lavergne, Thomas and Urvoy, Tanguy and Yvon, Fran\c{c}ois},
	title     = {Filtering artificial texts with statistical machine learning techniques},
	journal   = {Language Resources and Evaluation},
	doi       = {DOI:10.1007/s10579-009-9113-0},
	publisher = {Springer}
}

Efficient Learning of Sparse Conditional Random Fields for Supervised Sequence Labelling

Conditional Random Fields (CRFs) constitute a popular and efficient approach for supervised sequence labelling. CRFs can cope with large description spaces and can integrate some form of structural dependency between labels. In this contribution, we address the issue of efficient feature selection for CRFs based on imposing sparsity through an L1 penalty. We first show how sparsity of the parameter set can be exploited to significantly speed up training and labelling. We then introduce coordinate descent parameter update schemes for CRFs with L1 regularization. We finally provide some empirical comparisons of the proposed approach with state-of-the-art CRF training strategies. In particular, it is shown that the proposed approach is able to take profit of the sparsity to speed up processing and hence potentially handle larger dimensional models.

(accepted) [pdf preprint] [lnk]
@Article{sokolovska2010stsp,
	author    = {Sokolovska, Nataliya and Lavergne, Thomas and Capp\'e, Olivier and  Yvon, Fra\c{c}ois},
	title     = {Efficient learning of sparse conditional random fields for supervised sequence labelling},
	journal   = {Journal of Selected Topics in Signal Processing},
	publisher = {{IEEE}},
}