Current milestone: texrex-neuedimensionen
Web document cleaning/crawl file processing tools
written by Roland Schäfer for the COW project.
texrex is a free software for processing data files from crawls and turn them into a corpus of web documents.
Currently, it is limited to reading ARC files, but other input modules can be developed quickly.
It performs the following processing steps:
- read ARC files document by document
- filter perfect duplicates using a Bloom filter
- strip HTML, scripts, stylesheets
- extract meta information from crawl headers
- normalize encodings to UTF-8 (using ICU), optionally treating all ISO-8859-1 input as Win-1252
- convert all HTML entities to appropriate codepoints (including rogue Win-1252)
- detect, remove, and/or annotate boilerplate blocks using a Multi-Layer Perceptron trained on 38 features
(This method achieves far over 90% correct decisions in our evaluations and is thus far better than the previous state of the art. To be published.)
- assess the text quality of the documents by looking at frequencies of short frequent word (requires language-specific models)
- create w-shingling document fingerprints and filter near-duplicate documents
- perform in-document deduplication (remove repeated paragraphs, insert a backreference to first copy)
- perform additional normalization (e.g., reduce diverse Unicode dashes and hyphens to the basic codepoint)
- write standard-compliant XML output
- add server IP geolocation meta information (country, region, city – currently based on GeoLite)
Technologically, the main features of texrex are:
- written in FreePascal (Object FPC mode)
- licensed under LGPL (Pascal units) and GPL (Pascal programs), as well as the licenses used by ICU and FANN for the header translations of those libraries
- uses multi-threading for single-machine parallelization
- uses simple INI files to configure processing jobs for the main tool
- can be run in the background, using an included IPC client to control the process
- depends only on two additional libraries: ICU and FANN
New tools included since texrex-neuedimensionen (June 2014):
- HyDRA hard hyphenation remover
- rofl tool to fix run-together sentences
Last updated 2014-06-23.
This is an infrequently updated page.
Please go to the Sourceforge page for up-to-date information.