<< return

RJCRI 06 : Unnatural Language Detection

In the context of web search engines, the escalation between ranking techniques and spamdexing techniques has led to the appearance of faked contents in web pages. If random sequences of keywords are easily detectable, web pages produced by dedicated content generators are a lot more difcult to detect. Motivated by search engines applications, we will focus on the problem of automatic unnatural language detection. We will study both syntactical and semantical aspects of this problem, and for both of them we will present probabilistic and symbolic approaches.

[pdf] [lnk]
@InProceedings{lav06rjcri,
	author    = {Thomas Lavergne},
	title     = {Unnatural Language Detection},
	booktitle = {RJCRI'O6: Young Scientist' conference on Information Retrieval},
	year      = {2006},
	pages     = {383--388},
	url       = {http://www.irit.fr/ARIA/2006/383.pdf}
}

AIRWeb 06 : Tracking Web Spam with Hidden Style Similarity

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g. commercial sites, blogs and other sites powered by a web authoring software), as well as less legitimous spamdexing attempts (e.g. link farms, faked directories...).
Those pages built using the same generating method (template or script) share a common "look and feel" that is not easily detected by common text classifcation methods, but is more related to stylometry.
In this paper, we present a (hidden) style similarity measure based on extra-textual features in html source code. We also describe a method to clusterize a large collection of documents according to this measure. The clustering algorithm being based on fingerprints, we also give some recalls about fingerprinting.
By conveniently sorting the generated clusters, one can efficiently track back instances of a particular automatic content generation method among web pages collected using a crawler. This is particularly useful to detect pages across different sites sharing the same design - this is often a good hint of either spamdexing attempt or mirrored content.

[pdf] [lnk]
@InProceedings{urv06airweb,
	author    = {Tanguy Urvoy and Thomas Lavergne and Pascal Filoche},
	title     = {Tracking Web Spam with Hidden Style Similarity},
	booktitle = {International Workshop on Adversarial Information Retrieval on the Web (AIRWeb)},
	year      = {2006},
	url       = {http://airweb.cse.lehigh.edu/2006/urvoy.pdf}
}

ACM Tweb : Tracking Web Spam with HTML Style Similarities

Automatically generated content is ubiquitous in the web: dynamic sites built using the three-tier paradigm are good examples (e.g., commercial sites, blogs and other sites edited using web authoring software), as well as less legitimate spamdexing attempts (e.g., link farms, faked directories).
Those pages built using the same generating method (template or script) share a common look and feel that is not easily detected by common text classification methods, but is more related to stylometry.
In this work we study and compare several HTML style similarity measures based on both textual and extra-textual features in HTML source code. We also propose a flexible algorithm to cluster a large collection of documents according to these measures. Since the proposed algorithm is based on locality sensitive hashing (LSH), we first review this technique.
We then describe how to use the HTML style similarity clusters to pinpoint dubious pages and enhance the quality of spam classifiers. We present an evaluation of our algorithm on the WEBSPAM-UK2006 dataset.

[pdf] [lnk]
@article{urv08acmtweb,
        author    = {Tanguy Urvoy and Emmanuel Chauveau and Pascal Filoche and Thomas Lavergne},
        title     = {Tracking Web Spam with HTML Style Similarities},
        journal   = {ACM Trans. Web},
        volume    = {2},
        number    = {1},
        year      = {2008},
        issn      = {1559-1131},
        pages     = {1--28},
        doi       = {http://doi.acm.org/10.1145/1326561.1326564},
        publisher = {ACM},
        address   = {New York, NY, USA}
}

JADT 08 : Taxonomie de textes peu-naturels

In this paper, we define what is a natural text in a pragmatic way. Then, we present various types of unnatural texts and more particularly the simplest generators, which are also the most widespread in spamdexing. Finally, we describe some statistical tests which allow a first filtering of unnatural texts.
Dans cet article nous définissons de manière pragmatique ce quest un texte naturel. Puis nous présentons différentes catégories de textes non-naturels et plus particulièrement les méthodes de génération les plus simples qui sont aussi les plus répandues dans le cadre du spamdexing. Enfin nous proposons quelques tests statistiques permettant un premier filtrage des textes non-naturels.

[pdf] [lnk]
@InProceedings{lav08jadt,
        author    = {Thomas Lavergne},
        title     = {Taxonomie de textes peu-naturels},
        booktitle = {Proceedings of 9th International Conference on Textual Data statistical Analysis},
        year      = {2008},
        url       = {http://www.cavi.univ-paris3.fr/lexicometrica/jadt/jadt2008/pdf/lavergne.pdf},
        pages     = {679--687}
}

PAN 08 : Detecting Fake Content with Relative Entropy Scoring

To appear in proceedings of PAN'08 Workshop

How to distinguish natural texts from artificially generated ones ? Fake content is commonly encountered on the Internet, ranging from web scraping to random word salads. Most of this fake content is generated for spam purpose. In this paper, we present two methods to deal with this problem. The first one uses classical language models, while the second one is a novel approach using short range information between words.

[pdf] [lnk]
@InProceedings{lav08pan,
        author    = {Thomas Lavergne and Tanguy Urvoy and Fran\c{c}ois Yvon},
        title     = {Detecting Fake Content with Relative Entropy Scoring},
        booktitle = {International Workshop on Plagiarism Analysis, Authorship Identification,
	             and Near-Duplicate Detection (PAN)},
        year      = {2008},
        url       = {http://}
}