Primož Jakopin, Aleksandra Bizjak:

Part-of-speech tagging of Slovenian text

Abstract and extended summary of the article O strojno podprtem oblikoslovnem označevanju slovenskega besedila, published in Slavistična revija, Vol. 45/3-4, 1997, p. [513]-532.


Abstract

       The first POS tagger for texts in Slovenian language is presented. It includes the complete environment: supporting software as well as the tagset, based on Slovenian grammar. The tagset consists of 4.797 tags as the language is highly inflected. A description of the tagger follows; it includes a two-step disambiguator. The first step is based on data base of previosly processed sentences, where a unambiguously tagged immediate neighbourhood of the observed word is being searched. It is followed by a probabilistic tagger, where frequencies of tag n-tuples up to level 5 are taken into consideration.
       So far 330.000 words have been tagged - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection from which is available in electronic form on Internet.

Summary

       Part-of-speech tagging has in the past year also spread into the domain of so-called Central & Eastern European languages. It is the first step in text parsing and a pre-requisite for further quantitative linguistic analysis, such as machine translation or setting up of national text corpora, accessible over Internet: BNC - British National Corpus and CNC, Czech National Corpus, for instance.
       In the frame of preparations for the founding of Slovenian National Corpus the project of Slovenian POS tagger was set in motion in September 1996 by the authors of this article, a lingware specialist from the Faculty of Arts in Ljubljana and a linguist from the Fran Ramovš Institute of Slovenian Language at the Scientific Research Centre of the Slovenian Academy of Sciences and Arts.
       The tagger and the tagset were developed using the novel Pomladni dan by Slovenian 20th century writer Ciril Kosmač as a starting sample. Besides Slovenian grammar some other sources were considered while assembling the tagset, the tagsets of the Brown corpus, the Penn Treebank corpus and the tagset, used in the frame of MULTEXT/East project. The main criteria were the legibility of the tags and the minimal size, required for disambiguation. The tags had to be short, derived from Slovenian wording of linguistic terms and self-explanatory to such an extent, that they would be not only machine-readable, but acceptable to human reader as well. The point is illuminated in the following sentence:

Seveda se  lahko motijo.("Certainly they may be wrong.")
Č  Gmp A  Gcp

The tag Č stands for particle, Gmp for separate verbial morpheme, A for adjective and Gcp for verb, third person, plural.
       The tags for verbs are given in table 1.

Table 1: VERB

part-of-speech Type Person Gender Number Case Example
main verb G {a,b,c} {e,d,p} Gce (plava)
auxiliary verb to be in present GP {a,b,c} {e,d,p} GPce (je)
auxiliary verb to be in future GFP {a,b,c} {e,d,p} GFPcp(bodo)
negative form of the aux. verb to be GZP {a,b,c} {e,d,p} GZPae(nisem)
verb to be GO {a,b,c} {e,d,p} GOae (sem)
negative form of the verb to be GZO {a,b,c} {e,d,p} GZOce(ni)
negative form of the verb to have GZ {a,b,c} {e,d,p} GZbe (nimaš)
imperative GV {a,b,c} {e,d,p} GVbe (glej)
participle ending in -l GL {m,ž,s}{e,d,p} GLže (obrisala)
participle of the verb to be GLB {m,ž,s}{e,d,p} GLBme(bil)
particple ending in -n/-t GN/GT {m,ž,s}{e,d,p}{1,2} GNme1(rojen)
participle ending in -č/-ši GČ/GŠI (loveč)
infinitive GNE GNE (povedati)
supine GNA GNA (gledat)
conditional GBI GBI (bi)
separate verbial morpheme Gmp Gmp (se)

        The tagger, or better, the software for the support of the tagging process, has evolved over time from a simple tool used as an aid for manual tagging to a two-step disambiguator in 1997. From the very beginning the tagger has been a part of text-editor EVA and therefore totally interactive. It is possible to tag just a single word or proceed from a selected place in text. The lexicon of words, their tags and tag frequencies is updated on the fly as another file, which can also be edited and adjusted when required. This file acts as a database, which is used for production of different statistical tables that give the necessary feedback for fine-tuning of the tagger and the tagset.         The disambiguator has two steps. The first one is based on the history of tagged text: if a wordform in the lexicon has one and only one tag, it is the right one; if not, the neighbourhood of the word has to be examined. If there exists a neighbourhood, from 2 to 5 words long (including the word in question), that has a match in the history data base, with one and only one set of tags, the word is given its tag from this set, otherwise it is left untagged. The second step is a probabilistic one where the frequencies of all possible sets of tags for immediate neighbourhood, this time without words (again 2 to 5 words deep) are considered. If there is one and only one such set for any of the possible neighbourhoods, the corresponding tag is given to the observed word; otherwise the word is left untagged again. The table 2 shows a sentence (by Ciril Kosmač), after the two phases of disambiguator, as displayed on a computer screen.

Table 2: Disambiguated sentence



An attempt to translate the sentence might produce the following: The spring has passed, the summer has passed, and the fall has come, the golden time of all golden, time of murmuring winds and ripe scents, time of huge clouds and unattainable horizons, time of sweet and gloomy unrest.
       In the sentence of table 2 there are 25 tagged words out of 31 with 1 error (24/31 = 77% hit rate).
       So far 330.000 words have been tagged, manually completed and verified - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection (about 5% of the full text) from which is available in electronic form on Internet. In table 3 the samples follow each other in the chronological order. Some other data about the first three samples (distribution of part-of-speech groups, for instance) are shown in the paper.

Table 3: The samples

sample name size     sentenceswords
1. Ciril Kosmač: Pomladni dan 176 pp. 5.922 61.565
2. Platon: Država 317 pp. 7.323 93.430
3. Newspaper Delo (Internet) 43 days 2.956 53.895
4. Ciril Kosmač: Prazna ptičnica139 pp. 2.079 26.656
5. George Orwell: 1984    229 pp. 6.684 90.760
Total    24.964   326.306

       In the field of POS tagging of texts in the Slovenian language, the first steps have been made. A proven tagset, a disambiguator with a hit rate of 80%, and the supporting software, both incorporated into an own text editor for effective and comfortable use, have been established. A database of tagged samples (330.000 words) is now complemented by a lexicon of 3.300.000 wordforms with POS tags, based on 93.000 lemmas from the Dictionary of the Slovenian Literary Language.
       The wish list for the future includes a bigger database (1 million tagged words) and a better disambiguator, with a hit rate of 90% or more.



Appendix: Remaining tables from the article:

Table 5: NOUNS

Type Gender Number Case Example
common noun S {m,ž,s} {e,d,p} {1,2,3,4,5,6} Sme1 (dan)
gerund SG {m,ž,s} {e,d,p} {1,2,3,4,5,6} SGse2(spoznanja)
proper noun of persons IO {m,ž,s} {e,d,p} {1,2,3,4,5,6} IOme1(Martin)
proper noun of inhabitants IP {m,ž,s} {e,d,p} {1,2,3,4,5,6} IPme2(Čeha)
divine names IV {m,ž,s} {e,d,p} {1,2,3,4,5,6} IVme3(Bogu)
names of animals {m,ž,s} {e,d,p} {1,2,3,4,5,6} IŽže3(Liski)
proper noun of places IZ {m,ž,s} {e,d,p} {1,2,3,4,5,6} IZme2(Črnomlja)
proper noun of mythological places IM {m,ž,s} {e,d,p} {1,2,3,4,5,6} IMme5(Hadu)
other names (institutions, books..) IS {m,ž,s} {e,d,p} {1,2,3,4,5,6} ISže2(Iliade)

Table 7: ADJECTIVES

Type Gender Number Case Degree Definiteness Example
adjective P {m,ž,s}{e,d,p}{1,2,3,4,5,6}{0,j,jj} {0,i} Pme1i (pomladni)
participle ending in -l PL {m,ž,s}{e,d,p}{1,2,3,4,5,6} {0,i} PLmp4 (uspele)
participle ending in -n/-t PN/PT {m,ž,s}{e,d,p}{1,2,3,4,5,6} {0,i} PNme4i(zgrešeni)
participle ending in -č/-ši PČ/PŠI {m,ž,s}{e,d,p}{1,2,3,4,5,6} {0,i} PČže2 (cvetoče)
predicative adjective PD {m,ž,s}{e,d,p} {0,j,jj} PDme (rad)
possesive adjectives from proper names of persons PIO {m,ž,s}{e,d,p}{1,2,3,4,5,6} PIOžp4(Andrejeve)
poss.adj.from proper names of inhabitants PIP {m,ž,s}{e,d,p}{1,2,3,4,5,6} PIPme1(Brikin)
poss.adj.from divine names PIV {m,ž,s}{e,d,p}{1,2,3,4,5,6} PIVse4(Kronovo)
poss.adj.from proper names of places PIZ {m,ž,s}{e,d,p}{1,2,3,4,5,6} PIZže5(Krimski)
poss.adj.from proper names of mythological places PIM {m,ž,s}{e,d,p}{1,2,3,4,5,6} PIMme5(Hadovem)
poss.adj.from other names PIS {m,ž,s}{e,d,p}{1,2,3,4,5,6} PISže5(Mohorjevi)

Table 8: PRONOUNS

Type Person Gender Number Gender NumberCase primer
personal pronoun ZO {a,b,c}{m,ž,s,0}{e,d,p} {1,2,3,4,5,6} ZOcme5 (njem)
personal reflexive pro. ZOP {2,3,4,5,6} ZOP2 (sebe)
possesive pronoun ZSV {a,b,c}{m,ž,s,0}{e,d,p}{m,ž,s}{e,d,p}{1,2,3,4,5,6} ZSVaeme2(mojega)
possesive reflexive pro.ZSVP {m,ž,s}{e,d,p}{2,3,4,5,6} ZSVPme6 (svojim)
interrogative pro. ZV {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZVse1 (kaj)
relative pronoun ZR {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZRme2 (kakršnega)
negative pronoun ZNI {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZRme2 (kakršnega)
indefinite pronoun ZPO {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZPOme1 (kdo)
relative indefinite pro.ZRPO {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZRPOme6 (komerkoli)
definite pronoun ZNE {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZNEse1 (nekaj)
demonstrative pro. ZD {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZDme5 (drugem)
general pronoun ZT {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZTse1 (vse)
identity pronoun ZI {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZIme4 (isti)
multitude pronoun ZM {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZMse3 (marsičemu)
demonstrative pronoun ZK {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZKmp1 (ti)
emphasized pronoun ZPU {m,ž,s}{e,d,p}{1,2,3,4,5,6} ZPUže1 (sama)
conjunctional pronoun ZVR ZVR (ki)

Table 9: NUMERALS

Type Gender Number Case Example
cardinal numeral ŠG {m,ž,s}{e,d,p} {1,2,3,4,5,6} ŠGže3 (petintridesetim)
ordinal numeral ŠV {m,ž,s}{e,d,p} {1,2,3,4,5,6} ŠVme2 (prvega)
separating numeral ŠL {m,ž,s}{e,d,p} {1,2,3,4,5,6} ŠLse6 (dvojim)
multiple numeral ŠM {m,ž,s}{e,d,p} {1,2,3,4,5,6} ŠMme5 (trojnih)
quantifiers ŠNE ŠNE (nekaj)
number Š Š (77.500,00.-)

Table 10: OTHER MORPHOLOGICAL CATEGORIES

Type Relationship Degree Case Example
adverb A (0,j,jj) A (resnično)
particle Č Č (kar)
negative particle ČZ ČZ (ne)
conjunctional particleČV ČV (ali)
preposition E {2,3,4,5,6} E2 (iz)
conjunction V {pr,po} Vpr(in)
interjection M M (oh)

Table 11: ABBREVIATIONS

Type Gender Number Case Example
lower case abbreviation K K (št.)
upper case abbreviation KI {0,m,ž,s}{0,e,d,p} {0,1,2,3,4,5,6}KI (ŠTUNFF)
www KURL KURL(http://www.delo.si)

Table 12: POS TAGS BY MORPHOLOGICAL CATEGORIES

Pomladni dan Država Delo Skupaj
common nouns 10.17317.48814.07841.739
proper nouns 1.2066684.2696.143
verbs 20.26921.4989.29751.064
adjectives 4.3067.1146.00817.428
pronons 7.06514.8992.73624.700
numerals 3991.3633.1884.950
adverbs 3.6255.2541.90810.787
particles 3.5905.5361.55710.683
prepositions 5.2698.2156.03819.522
conjunctions 5.37711.3683.39820.143
interjections 27418-292
abbreviations 1291.4181.439

Table 13: POS TAGS BY MORPHOLOGICAL CATEGORIES (%)

Pomladni dan Država Delo Celota
common nouns 16.5218.7226.1219.98
proper nouns 1.960.717.922.94
verbs 32.9223.0217.2524.46
adjectives 6.997.6111.158.34
pronons 11.4815.955.0811.82
numerals 0.651.465.922.37
adverbs 5.895.623.545.16
particles 5.835.932.895.11
prepositions 8.568.7911.209.35
conjunctions 8.7312.176.309.64
interjections 0.45 - - 0.14
abbreviations - - 2.630.69
All together 100.00

Table 14: THE WORDS BEGINING IN žive- IN THE SET OF WORDS AND TAGS

žive·Gcp,1;Pže2,5;Pžp1,4;Pžp4,2;Smp4,2
živega·Pme2,2;Pse2,1;Sse2,3
živel·GLme,13
živela·GLže,12
živele·GLžp,2
živeli·GLmp,4
živem·Pme5,1;Pse5,2
živemu·Pme3,1;Sse3,1
živeti·GNE,18
živeče·PČmp4,1
živečih·PČmp2,1;Smp2,1
živečimi·PČsp6,2

Table 15: The example of the tagged text (Plato's Republic, Book I)

3. Po resnici in  odkrito, pri Zevsu, ti     bom   povedal svoje  
Š E5 Sže5 Vpr A E5 IVme5 ZObe3 GFPae GLme ZSVPse4
mnenje, dragi Sokrat. Pogosto se  sestajamo možje istih  let  in
Sse4 Pme1 IOme1 A Gmp Gap Smp1 ZIsp2 Ssp2 Vpr
potrjujemo pravilnost starega pregovora.4Ko  se  pogovarjamo,
Gap Sže4 Pme2 Sme2 Š Vpo Gmp Gap
skoraj vsi    tarnajo in  se  otožnostjo spominjajo mladostnih
A ZTmp1 Gcp Vpr Gmp E6 Sže6 Gcp Pžp2
radosti, ljubezni, pitja in  gostij ter vsega  drugega, kar je   s
Sžp2 Sžp2 Sse2 Vpr Sžp2 Vpr ZTse2 ZDse2 ZVR GOce E6
tem    zvezi; pri tem    so   nejevoljni, kakor da  bi  bili  oropani
ZKse6 E5 Sže5 E5 ZKse5 GPcp Pmp1 Vpo Vpo GBI GLBmp GNmp1
velikih stvari in  bi  bili  nekoč imenitno živeli, zdaj pa  le še
Pžp2 Sžp2 Vpr GBI GLBmp A A GLmp A Vpr Č Č
životarili. Nekateri se  tudi pritožujejo, da  svojci njimi   -
GLmp ZNEmp1 Gmp Č Gcp Vpo Smp1 E6 ZOcmp6
ker so   stari - grdo ravnajo, in  pri tem    ubirajo žalostinke o
Vpo GPcp Pmp1 A Gcp Vpr E5 ZKse5 Gcp Sžp4 E5
nadlogah, ki  jih     je   kriva starost. Meni   se  zdi, Sokrat, da  ti   
Sžp5 ZVR ZOcžp2 GPce Pže1 Sže1 ZOae3 Gmp Gce IOme1 Vpo ZKmp1
ne obtožujejo pravega krivca, kajti ko  bi  tega   bila  kriva
ČZ Gcp Pme2 Sme2 Vpr Vpo GBI ZKse2 GLBže Pže1
starost, bi  imeli jaz    in  vsi    drugi  moji      starostni vrstniki iste 
Sže1 GBI GLmp ZOae1 Vpr ZTmp1 ZDmp1 ZSVaemp1 Pmp1 Smp1 ZIžp4
težave. Naletel pa  sem  že na mnoge, ki  se  ne počutijo slabo,
Sžp4 GLme Č GPae Č E4 Pmp4 ZVR Gmp ČZ Gcp A
tako se  je   nekoč namerilo, da  sem  bil   ravno pri pesniku
ZK Gmp GPce A GLse Vpo GOae GLBme A E5 Sme5
Sofoklu, ko  ga      je   nekdo   vprašal: "Kako je   pri tebi, Sofokles, z
IOme5 Vpo ZOcme4 GPce ZNEme1 GLme ZV GOce E5 ZObe5 IOme1 E6
ljubeznijo? Ali še lahko občuješ žensko?" - "Molči, človek!"
Sže6 Č Č A Gbe E6 Sže6 GVbe Sme1
je   odvrnil pesnik. "Vesel sem , da  je   to     za mano. Tako   mi     je,
GPce GLme Sme1 Pme1 GPae Vpo GOce ZKse1 E6 ZOae6  ZKse1 ZOae3 GPce
kakor da  bi  pobegnil divjemu, pobesnelemu gospodarju. "Te   
Vpo Vpo GBI GLme Pme3 PLme3 Sme3 ZKžp4
besede so   mi     že takrat ugajale in  nič  manj mi     ne ugajajo danes.
Sžp4 GPcp ZOae3 Č A GLžp Vpr ZNI Aj ZOae3 ČZ Gcp A  



The page first posted by P. Jakopin 17 November, 1998. Renewed 1 June, 2001. Date of last change: 3 June.

URL: http://bos.zrc-sazu.si/bibliografija/POS_tagging.html         Visits after 1.6.2001