Prepro2010 - Text Preprocessing Tool


Overview   Representation   Evaluation   API: TEI-Conversion   API: Language-Detection   API: Compound-Nouns   Architecture   C++ Usage   TagSet   Contact   Team   References  

Prepro2010 is a text/html preprocessing tool written in C++ (Qt Development Frameworks) and can be used to do tasks like:

It is based on a re-implementation of the TnT-Tagger algorithm with an embedded GATE-ANNIE module for a Named Entity Recognition and an additional developed Document Structure Analysis modul for an automatic TEI-P5 XML conversion.

Prepro2010 comes with three language models, German, English, and one for Latin. The German model is trained on the NEGRA corpus using the Stuttgart-Tübingen-Tagset (STTS). The English model is trained on the Penn Treebank. The Latin model is trained on the Perseus Latin Treebank.

The preprocessing tool can directly be applied by using one of these three language models within three different interfaces:


[top]

Representation

Each input file or string stream (UTF-8) is converted into a tokenized TEI-P5 representation, such as:

	...
	<w xml:id="xd1_wo1" type="#NN" subtype="#geoCity" lemma="München" function="dic">München</w>
	...

The attribute type refers to the Part-of-Speech-Tag, subtype to the Named Entity Tag, and the lemma to baseform representation of the input wordform. The attribute function denotes the resource for the lemmatization. dic (dictionary) refers to lemmata covered by the fullform lexicon with the used Part-of-Speech-Tag; pro (probabilistic) refers to lemmata covered by the fullform lexicon, however not with the used Part-of-Speech-Tag; unk (unknown) refers to lemmata which are not in the fullform lexicon of the tagger, and could not be lemmatized.

Multiword units and noun segments within a sentence (<s>) are annoted using the segment tag (<seg>):

	<seg xml:id="xd1_segm_1">
		<w xml:id="xd1_wo1" type="#NE" subtype="#companyName" lemma="FC" function="pro">FC</w>
		<w xml:id="xd1_wo2" type="#NE" subtype="#companyName" lemma="Bayern" function="dic">Bayern</w>
		<w xml:id="xd1_wo3" type="#NE" subtype="#companyName" lemma="München" function="dic">München</w>
	</seg>
In addition, the tool generates some frequency statistics of the occuring elements of the input text, such as:

  <sourceDesc>
	<p xml:id="NumberOfToken">161</p>
	<p xml:id="NumberOfUnknown">11</p>
	<p xml:id="NumberOfProbUnknown">11</p>
	<p xml:id="NumberOfWordforms">105</p>
	<p xml:id="NumberOfLemmata">101</p>
	<p xml:id="HapaxLegomena(Wordform)">81</p>
	<p xml:id="HapaxLegomena(Lemma)">78</p>
	<p xml:id="sentence">8</p>
	<p xml:id="paragraph">4</p>
	<p xml:id="div">1</p>
	<p xml:id="header">0</p>
	<p xml:id="table">0</p>
	<p xml:id="length">855</p>
	<p xml:id="namedEntity">12</p>
	<p xml:id="noun">38</p>
	<p xml:id="adjective">10</p>
	<p xml:id="verb">24</p>
	<p xml:id="punctation">20</p>
	<p xml:id="frequentNoun">Moratorium,Barack,Obama,New,Orleans,Golf,Unternehmen,Anwalt</p>
	<p xml:id="frequentPhrase">Präsident Barack Obama,Anwalt David Rosenblum,Deepwater Horizon</p>
  </sourceDesc>

You can run a verbose mode (Add Probability / &extra=prob / addAdditionalInfos=true), in order to get further probabilistic information (PoS-to-Lemma within the lexicon; and PoS-to-Wordform emission probability) about the tagging and lemmatization process. Note that the interpretation follows directly after the <w> tag:

	...
	 <w xml:id="xd1_wo1" type="#NN" subtype="#geoCountry" lemma="Bayer" function="dic">Bayern</w>
         <interpGrp resp="intPg_2" type="PoS-Lemma">
               <interp xml:id="intP_2_1" type="NN">Bayer</interp>
               <interp xml:id="intP_2_2" type="NE">Bayern</interp>
               <interp xml:id="intP_2_4" type="NN">0.55546</interp>
               <interp xml:id="intP_2_5" type="NE">0.44454</interp>
         </interpGrp>
	...


[top]

Evaluation

The preprocessing speed depends on the ambiguity of words (PoS possibilities) and the number of unknown words (suffix/prefix analysis) in the text. It processes currently 10,000 tokens within 1.5 seconds using all modules (tagging, lemmatization, ...) on Linux. We evaluated the preprocessing architecture on different standard corpora and treebanks:

?Preprocessing Step Language Parameter F1-Score Corpus
Lemmatization de 888,573 word forms .921 Negra Corpus
PoS-Tagging de/en 3,000 / 5,000 sentences .975 / .956 Negra Corpus / Penn Treebank
Language Identification 21 lang. 50 / 100 chars .956 /.970 Wikipedia


[top]

TEI-Conversion

Language:
German English Latin
Output-Format:
Add Probability

Input:

Output:

[top]

Language Detection

Currently we support the following languages (70):
Albanisch, Althochdeutsch, Altsächsisch, Aserbaidschanisch, Asturische Sprache, Baskisch, Bengalisch, Bishnupriya Manipuri, Bosnisch, Bretonisch, Bulgarisch, Cebuano, Chinesisch, Deutsch, Dänisch, Englisch, Esperanto, Estnisch, Finnisch, Französisch, Georgisch, Griechisch, Hebräisch, Hindi, Indonesisch, Italienisch, Japanisch, Javanisch, Katalanisch, Koreanisch, Kroatisch, Kurdisch, Lateinisch, Lettisch, Litauisch, Luxemburgisch, Malaiisch, Marathi, Mazedonisch, Neapolitanisch, Nepal Bhasa, Niederländisch, Norwegisch, Okzitanisch, Persisch, Plattdeutsch, Polnisch, Portugiesisch, Rumänisch, Russisch, Schwedisch, Serbisch, Serbokroatisch, Sizilianisch, Slowakisch, Slowenisch, Spanisch, Sundanesisch, Tagalog, Tamilisch, Thailändisch, Türkisch, Tschechisch, Ukrainisch, Ungarisch, Vietnamesisch, Volapk, Walisisch, Wallonisch, Weißrussich


Input:

Language:

[top]

Compound Nouns

This application decomposes compound noun into two or more words (German only) by means of a noun-based semantic relatedness measure.


Compound Noun:

Decompostion:

[top]

Preprocessing Architecture

  


[top]

Preprocessing Usage

You can use the preprocessing tool within three different interfaces:

PortSocket-API

You can connect to the Socket-Interface through (PHP-Socket-Example):

	fsockopen("141.2.89.22", 6665);
	$inputPortStream ="$myRawData&action=teixml&lang=german";
	fwrite($fp, $inputPortStream);
	$teiP5Xml = stream_get_contents($fp);
	fclose($fp);

Try the XML-API here

C++ Binary

You can check the functionality of the BieleTagger as a binary application (default model is german) by typing at the Server Varda/Hydra:

	./Prepro2010 germanInput.txt
For choosing the language model, just add one of the three model names (german|english|latin)
	./Prepro2010 englishInput.txt english
For choosing an output file, just add a third parameter filename
	./Prepro2010 englishInput.txt english englishOutput.tei

[top]

C++ Shared Library

You can use the tool as a shared library (Prepro2010Lib) by:


	LD_LIBRARY_PATH +=/usr/local/Preprocessor2010/
	LD_LIBRARY_PATH +=/usr/local/tidy/


	#include <QtCore>
	#include "Actionizer/Actionizer.h"


	QString textContent = "text for tagging!";
	Actionizer Ac("english");
	QString resultTeiP5 = Ac.getTeiXmlByTextString(textContent);


C++ Shared Library Functions

Additional functions (Actionizer.h) may be used by:


	 //Returns TEI-XML-Representation
	QString               getTeiXmlByTextString(QString &inputText);

	//Returns the lemmata of a word
	QString               getLemmaOfWord(QString &word);

	//Returns the PoS-Tag of a word
	QString               getWklOfWord(const QString &word);

	//Returns lemma [0] and PoS-Tag [1] of a word
	QVector<QString>      getLemmaAndWklOfWord(QString &word);

	//Returns the stem of a word
	QString               getStemOfWord(QString &word, QString &usedLanguage);

	//Tokenizes the input - each token as an entry
	QVector<QString>      getTokenizedVectorByTextString(QString &inputText);

	//Adds sentence boundaries (<S>,</S>)
	QVector<QString>      getSentenizedVectorByTokenizedVector(QVector &tokenizedVector);


	//Adds PoS-Tags to the tokenized and senteniced vector (!!!<S>,</S>!!! must be already assigned)
	QVector<QString>      getTaggedVectorByTokenizedVector(QVector &sentenicedVector);


	//Adds lemmata information to the tokenized and senteniced vector (!!!<S>,</S>!!! must be already assigned)
	QVector<QString>      getLemmatizedVectorByTokenizedVector(QVector &sentenicedVector);


	//Returns language name
	QString               detectLanguageByTextString(QString &inputText);

	//Returns all lemmata [0] and PoS-Tags [1] of a word in a vector
	QVector<QVector<QString>>      getLemmaPosCombinationsByWordString(QString &word);

	//Return all entity classes of a word
	QVector<QString>      getEntityClassesByWordString(QString &word);


		

C++ Shared Library Header (TEI)

	QString		getTeiXmlByTextString(QString &inputText,
					QString corpusName,
					QString corpusSource,
					QString textId,
					QString textType,
					bool isCorpus,
					bool addAdditionalInfos
					);


[top]

C++ Shared Library Individual Modules


	QString inputText;
	KnowledgeBase Kb;
	QVector <QString> tokenizedVector;


	// ************** Tokenization of Input-String and Document-Structure Analysis (tidy) ************** //
	Tokenizer To;
	To.getTokenizedInput(inputText,tokenizedVector, Kb);



	// ************** Sentence Detection following Kiss-Strunk  ************** //
	Sentencer Se;
	Se.getSentenicerInput(tokenizedVector, Kb);



	// ************** Tagging following TnT-Tagger HMM  ************** //
	QVector <QString> taggedVector;
	taggedVector.fill("",tokenizedVector.size());
	QVector <QString> lemmatizedVector = tokenizedVector;
	QVector <QString> entityVector;
	entityVector.fill("",tokenizedVector.size());
	QVector <QString> additionalInfoVector;
	additionalInfoVector.fill("",tokenizedVector.size());

	Tagger Tg(Kb);
	Tg.getTaggedInput(tokenizedVector,
				taggedVector,
				lemmatizedVector,
				entityVector,
				Kb,
				additionalInfoVector,
				addAdditionalInfos
				);


	// ************** TEI-P5 Conversion  ************** //
	Outizer Ot;
	QString teiXml = Ot.getTeiXml(tokenizedVector,
					taggedVector,
					lemmatizedVector,
					entityVector,
					additionalInfoVector,
					textCounter,textType,
					textId,
					corpusName,
					corpusSource,
					addAdditionalInfos
					);

	

[top]

C++ Train and Evaluate Model

Prepro2010 is trainable on different languages (Actionizer.h). You need a tagged corpus/treebank in the format shown below: one wordform per line, the first column is the wordform, the second column is the tag. Sentence boundaries are marked by #BOS | #EOS The output files of the training phase are written in a folder modelName


	#BOS	\t	1
	In	\t	APPR
	nova	\t	ADJ
	fert	\t	V
	animus	\t	N
	mutatas	\t	V
	dicere	\t	V
	formas	\t	N
	corpora	\t	N
	#EOS	\t	2
	#BOS	\t	3
	...

Using this treebank format as an input, you can train any withespace separatable language using the below function. Additional you can evaluate the performance of the model by using an evaluation partition (same treebank format).

	Actionizer Ac;
	Ac.trainTagger("french", "tigerTreebank.tsv");

	bool              trainTagger(QString modelName, QString treebankFile);
	bool              evaluateTagger(QString treebankFile);

Copy the generated data into the resource folder of the LibraryVersion (e.g.resources/french/tagger/). In order to extract/write a fullform lexicon out of the database use:


	bool              extractFullFormLexiconFromDB(QString host, QString dbName, QString user, QString pass, QString lexiconName);

	

[top]

TagSet


[top]

Contact

Prof. Dr. Alexander Mehler
Fachbereich für Informatik und Mathematik
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: meh...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/21


Team

Rüdiger Gleim
Fachbereich für Informatik und Mathematik
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: gle...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/29

Fachbereich für Informatik und Mathematik
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: meh...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/21

Ulli Waltinger
University of Bielefeld
Faculty of Technology
Text Technology / Applied Computational Linguistics
ulli...@uni-bielefeld.de


Reference

Thorsten Brants, 2000. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, Seattle, WA.

H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.

Ulli Waltinger and Alexander Mehler (2009). The Feature Difference Coefficient: Classification by Means of Feature Distributions. In Proceedings of Text Mining Services (TMS), March 23-25, Leipzig, Germany, 2009

Ulli Waltinger and Alexander Mehler (2008). Who is it? Context sensitive named entity and instance recognition by means of Wikipedia. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-2008)

Alexander Mehler and Rüdiger Gleim and Alexandra Ernst and Ulli Waltinger (2008). WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases. In Sprache und Datenverarbeitung. International Journal for Language Data Processing, 2008.

Ulli Waltinger, Alexander Mehler 2009. Social Semantics And Its Evaluation By Means Of Semantic Relatedness And Open Topic Models. Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milan (Italy), 2009.

Ulli Waltinger, Irene Cramer and Tonio Wandmacher 2009. From Social Networks To Distributional Properties: A Comparative Study On Computing Semantic Relatedness. Proceedings of the Annual Meeting of the Cognitive Science Society - CogSci 2009, Amsterdam (NL), 2009.


[top]

Last changed: 25 June 2010, Ulli Waltinger