Prepro2010 is a text/html preprocessing tool written in C++ (Qt Development Frameworks) and can be used to do tasks like:
It is based on a re-implementation of the TnT-Tagger algorithm with an embedded GATE-ANNIE module for a Named Entity Recognition and an additional developed Document Structure Analysis modul for an automatic TEI-P5 XML conversion.
Prepro2010 comes with three language models, German, English, and one for Latin. The German model is trained on the NEGRA corpus using the Stuttgart-Tübingen-Tagset (STTS). The English model is trained on the Penn Treebank. The Latin model is trained on the Perseus Latin Treebank.
The preprocessing tool can directly be applied by using one of these three language models within three different interfaces:
Each input file or string stream (UTF-8) is converted into a tokenized TEI-P5 representation, such as:
... <w xml:id="xd1_wo1" type="#NN" subtype="#geoCity" lemma="München" function="dic">München</w> ...
The attribute type refers to the Part-of-Speech-Tag, subtype to the Named Entity Tag, and the lemma to baseform representation of the input wordform. The attribute function denotes the resource for the lemmatization. dic (dictionary) refers to lemmata covered by the fullform lexicon with the used Part-of-Speech-Tag; pro (probabilistic) refers to lemmata covered by the fullform lexicon, however not with the used Part-of-Speech-Tag; unk (unknown) refers to lemmata which are not in the fullform lexicon of the tagger, and could not be lemmatized.
Multiword units and noun segments within a sentence (<s>) are annoted using the segment tag (<seg>):
<seg xml:id="xd1_segm_1"> <w xml:id="xd1_wo1" type="#NE" subtype="#companyName" lemma="FC" function="pro">FC</w> <w xml:id="xd1_wo2" type="#NE" subtype="#companyName" lemma="Bayern" function="dic">Bayern</w> <w xml:id="xd1_wo3" type="#NE" subtype="#companyName" lemma="München" function="dic">München</w> </seg>In addition, the tool generates some frequency statistics of the occuring elements of the input text, such as:
<sourceDesc> <p xml:id="NumberOfToken">161</p> <p xml:id="NumberOfUnknown">11</p> <p xml:id="NumberOfProbUnknown">11</p> <p xml:id="NumberOfWordforms">105</p> <p xml:id="NumberOfLemmata">101</p> <p xml:id="HapaxLegomena(Wordform)">81</p> <p xml:id="HapaxLegomena(Lemma)">78</p> <p xml:id="sentence">8</p> <p xml:id="paragraph">4</p> <p xml:id="div">1</p> <p xml:id="header">0</p> <p xml:id="table">0</p> <p xml:id="length">855</p> <p xml:id="namedEntity">12</p> <p xml:id="noun">38</p> <p xml:id="adjective">10</p> <p xml:id="verb">24</p> <p xml:id="punctation">20</p> <p xml:id="frequentNoun">Moratorium,Barack,Obama,New,Orleans,Golf,Unternehmen,Anwalt</p> <p xml:id="frequentPhrase">Präsident Barack Obama,Anwalt David Rosenblum,Deepwater Horizon</p> </sourceDesc>
You can run a verbose mode (Add Probability / &extra=prob / addAdditionalInfos=true), in order to get further probabilistic information (PoS-to-Lemma within the lexicon; and PoS-to-Wordform emission probability) about the tagging and lemmatization process. Note that the interpretation follows directly after the <w> tag:
...
<w xml:id="xd1_wo1" type="#NN" subtype="#geoCountry" lemma="Bayer" function="dic">Bayern</w>
<interpGrp resp="intPg_2" type="PoS-Lemma">
<interp xml:id="intP_2_1" type="NN">Bayer</interp>
<interp xml:id="intP_2_2" type="NE">Bayern</interp>
<interp xml:id="intP_2_4" type="NN">0.55546</interp>
<interp xml:id="intP_2_5" type="NE">0.44454</interp>
</interpGrp>
...
The preprocessing speed depends on the ambiguity of words (PoS possibilities) and the number of unknown words (suffix/prefix analysis) in the text. It processes currently 10,000 tokens within 1.5 seconds using all modules (tagging, lemmatization, ...) on Linux. We evaluated the preprocessing architecture on different standard corpora and treebanks:
| ?Preprocessing Step | Language | Parameter | F1-Score | Corpus |
| Lemmatization | de | 888,573 word forms | .921 | Negra Corpus |
| PoS-Tagging | de/en | 3,000 / 5,000 sentences | .975 / .956 | Negra Corpus / Penn Treebank |
| Language Identification | 21 lang. | 50 / 100 chars | .956 /.970 | Wikipedia |
Currently we support the following languages (70):
Albanisch,
Althochdeutsch,
Altsächsisch,
Aserbaidschanisch,
Asturische Sprache,
Baskisch,
Bengalisch,
Bishnupriya Manipuri,
Bosnisch,
Bretonisch,
Bulgarisch,
Cebuano,
Chinesisch,
Deutsch,
Dänisch,
Englisch,
Esperanto,
Estnisch,
Finnisch,
Französisch,
Georgisch,
Griechisch,
Hebräisch,
Hindi,
Indonesisch,
Italienisch,
Japanisch,
Javanisch,
Katalanisch,
Koreanisch,
Kroatisch,
Kurdisch,
Lateinisch,
Lettisch,
Litauisch,
Luxemburgisch,
Malaiisch,
Marathi,
Mazedonisch,
Neapolitanisch,
Nepal Bhasa,
Niederländisch,
Norwegisch,
Okzitanisch,
Persisch,
Plattdeutsch,
Polnisch,
Portugiesisch,
Rumänisch,
Russisch,
Schwedisch,
Serbisch,
Serbokroatisch,
Sizilianisch,
Slowakisch,
Slowenisch,
Spanisch,
Sundanesisch,
Tagalog,
Tamilisch,
Thailändisch,
Türkisch,
Tschechisch,
Ukrainisch,
Ungarisch,
Vietnamesisch,
Volapk,
Walisisch,
Wallonisch,
Weißrussich
This application decomposes compound noun into two or more words (German only) by means of a noun-based semantic relatedness measure.
You can use the preprocessing tool within three different interfaces:
You can connect to the Socket-Interface through (PHP-Socket-Example):
fsockopen("141.2.89.22", 6665);
$inputPortStream ="$myRawData&action=teixml&lang=german";
fwrite($fp, $inputPortStream);
$teiP5Xml = stream_get_contents($fp);
fclose($fp);
Try the XML-API here
You can check the functionality of the BieleTagger as a binary application (default model is german) by typing at the Server Varda/Hydra:
./Prepro2010 germanInput.txtFor choosing the language model, just add one of the three model names (german|english|latin)
./Prepro2010 englishInput.txt englishFor choosing an output file, just add a third parameter filename
./Prepro2010 englishInput.txt english englishOutput.tei
You can use the tool as a shared library (Prepro2010Lib) by:
LD_LIBRARY_PATH +=/usr/local/Preprocessor2010/
LD_LIBRARY_PATH +=/usr/local/tidy/
#include <QtCore>
#include "Actionizer/Actionizer.h"
QString textContent = "text for tagging!";
Actionizer Ac("english");
QString resultTeiP5 = Ac.getTeiXmlByTextString(textContent);
Additional functions (Actionizer.h) may be used by:
//Returns TEI-XML-Representation QString getTeiXmlByTextString(QString &inputText); //Returns the lemmata of a word QString getLemmaOfWord(QString &word); //Returns the PoS-Tag of a word QString getWklOfWord(const QString &word); //Returns lemma [0] and PoS-Tag [1] of a word QVector<QString> getLemmaAndWklOfWord(QString &word); //Returns the stem of a word QString getStemOfWord(QString &word, QString &usedLanguage); //Tokenizes the input - each token as an entry QVector<QString> getTokenizedVectorByTextString(QString &inputText); //Adds sentence boundaries (<S>,</S>) QVector<QString> getSentenizedVectorByTokenizedVector(QVector&tokenizedVector); //Adds PoS-Tags to the tokenized and senteniced vector (!!!<S>,</S>!!! must be already assigned) QVector<QString> getTaggedVectorByTokenizedVector(QVector &sentenicedVector); //Adds lemmata information to the tokenized and senteniced vector (!!!<S>,</S>!!! must be already assigned) QVector<QString> getLemmatizedVectorByTokenizedVector(QVector &sentenicedVector); //Returns language name QString detectLanguageByTextString(QString &inputText); //Returns all lemmata [0] and PoS-Tags [1] of a word in a vector QVector<QVector<QString>> getLemmaPosCombinationsByWordString(QString &word); //Return all entity classes of a word QVector<QString> getEntityClassesByWordString(QString &word);
QString getTeiXmlByTextString(QString &inputText, QString corpusName, QString corpusSource, QString textId, QString textType, bool isCorpus, bool addAdditionalInfos );
QString inputText;
KnowledgeBase Kb;
QVector <QString> tokenizedVector;
// ************** Tokenization of Input-String and Document-Structure Analysis (tidy) ************** //
Tokenizer To;
To.getTokenizedInput(inputText,tokenizedVector, Kb);
// ************** Sentence Detection following Kiss-Strunk ************** //
Sentencer Se;
Se.getSentenicerInput(tokenizedVector, Kb);
// ************** Tagging following TnT-Tagger HMM ************** //
QVector <QString> taggedVector;
taggedVector.fill("",tokenizedVector.size());
QVector <QString> lemmatizedVector = tokenizedVector;
QVector <QString> entityVector;
entityVector.fill("",tokenizedVector.size());
QVector <QString> additionalInfoVector;
additionalInfoVector.fill("",tokenizedVector.size());
Tagger Tg(Kb);
Tg.getTaggedInput(tokenizedVector,
taggedVector,
lemmatizedVector,
entityVector,
Kb,
additionalInfoVector,
addAdditionalInfos
);
// ************** TEI-P5 Conversion ************** //
Outizer Ot;
QString teiXml = Ot.getTeiXml(tokenizedVector,
taggedVector,
lemmatizedVector,
entityVector,
additionalInfoVector,
textCounter,textType,
textId,
corpusName,
corpusSource,
addAdditionalInfos
);
Prepro2010 is trainable on different languages (Actionizer.h). You need a tagged corpus/treebank in the format shown below: one wordform per line, the first column is the wordform, the second column is the tag. Sentence boundaries are marked by #BOS | #EOS The output files of the training phase are written in a folder modelName
#BOS \t 1 In \t APPR nova \t ADJ fert \t V animus \t N mutatas \t V dicere \t V formas \t N corpora \t N #EOS \t 2 #BOS \t 3 ...
Using this treebank format as an input, you can train any withespace separatable language using the below function. Additional you can evaluate the performance of the model by using an evaluation partition (same treebank format).
Actionizer Ac;
Ac.trainTagger("french", "tigerTreebank.tsv");
bool trainTagger(QString modelName, QString treebankFile);
bool evaluateTagger(QString treebankFile);
Copy the generated data into the resource folder of the LibraryVersion (e.g.resources/french/tagger/). In order to extract/write a fullform lexicon out of the database use:
bool extractFullFormLexiconFromDB(QString host, QString dbName, QString user, QString pass, QString lexiconName);
Prof. Dr. Alexander Mehler
Fachbereich für Informatik und Mathematik
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: meh...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/21
Rüdiger Gleim
Fachbereich für Informatik und Mathematik
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: gle...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/29
Goethe-Universität Frankfurt am Main
Robert-Mayer-Straße 10
D-60054 Frankfurt am Main
Postfach / P.O. Box: 154
Email: meh...@em.uni-frankfurt.de
Web: http://www.hucompute.org/team/21
Ulli Waltinger
University of Bielefeld
Faculty of Technology
Text Technology / Applied Computational Linguistics
ulli...@uni-bielefeld.de
Thorsten Brants, 2000. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the Sixth Applied Natural Language Processing Conference ANLP-2000, Seattle, WA.
H. Cunningham, D. Maynard, K. Bontcheva, V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002.
Ulli Waltinger and Alexander Mehler (2009). The Feature Difference Coefficient: Classification by Means of Feature Distributions. In Proceedings of Text Mining Services (TMS), March 23-25, Leipzig, Germany, 2009
Ulli Waltinger and Alexander Mehler (2008). Who is it? Context sensitive named entity and instance recognition by means of Wikipedia. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence (WI-2008)
Alexander Mehler and Rüdiger Gleim and Alexandra Ernst and Ulli Waltinger (2008). WikiDB: Building Interoperable Wiki-Based Knowledge Resources for Semantic Databases. In Sprache und Datenverarbeitung. International Journal for Language Data Processing, 2008.
Ulli Waltinger, Alexander Mehler 2009. Social Semantics And Its Evaluation By Means Of Semantic Relatedness And Open Topic Models. Proceedings of the 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Milan (Italy), 2009.
Ulli Waltinger, Irene Cramer and Tonio Wandmacher 2009. From Social Networks To Distributional Properties: A Comparative Study On Computing Semantic Relatedness. Proceedings of the Annual Meeting of the Cognitive Science Society - CogSci 2009, Amsterdam (NL), 2009.
Last changed: 25 June 2010, Ulli Waltinger