The Template Detection and Content Extraction Benchmark Suite

Benchmarks

TECO (TEmplate detection and COntent extraction benchmarks suite) is a benchmark suite specifically designed for template detection and content extraction. However, it can be used for testing and evaluation of any technique that is applied on webpages. It is composed of 150 real websites downloaded from Internet. It contains heterogenous websites such as blogs, companies, forums, personal websites, sports websites, newspapers, etc. Some of the websites are well known, like the BBC website or the FIFA website, and others are less known like personal blogs or small companies websites.

The suite includes a gold standard that can be used for template detection and for content extraction. All benchmarks have been labelled so that every HTML element of the webpages indicates whether it should be classified as main content or not, and whether it should be classified as template or not.

Open source

The plugin is distributed as open source under the BSD open source license. Any redistribution of any software that contains or makes use of this plugin must retain the same BSD open source license.

Feedback

We greatly thank any feedback from the users of the plugin. Any contribution that can help us to improve the usability or performance of the plugin is highly appreciated. Please report your feedback to jsilva@dsic.upv.es.

Made in the Universitat Politècnica de València (UPV)

This software has been designed and implemented in the computer science labs of the UPV.