Primo¾ Jakopin, Aleksandra Bizjak:

Part-of-speech tagging of Slovenian text

Abstract and extended summary of the article O strojno podprtem oblikoslovnem označevanju slovenskega besedila, published in Slavistična revija, Vol. 45/3-4, 1997, p. [513]-532.


Abstract

       The first POS tagger for texts in Slovenian language is presented. It includes the complete environment: supporting software as well as the tagset, based on Slovenian grammar. The tagset consists of 4.797 tags as the language is highly inflected. A description of the tagger follows; it includes a two-step disambiguator. The first step is based on data base of previosly processed sentences, where a unambiguously tagged immediate neighbourhood of the observed word is being searched. It is followed by a probabilistic tagger, where frequencies of tag n-tuples up to level 5 are taken into consideration.
       So far 330.000 words have been tagged - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection from which is available in electronic form on Internet.

Summary

       Part-of-speech tagging has in the past year also spread into the domain of so-called Central & Eastern European languages. It is the first step in text parsing and a pre-requisite for further quantitative linguistic analysis, such as machine translation or setting up of national text corpora, accessible over Internet: BNC - British National Corpus and CNC, Czech National Corpus, for instance.
       In the frame of preparations for the founding of Slovenian National Corpus the project of Slovenian POS tagger was set in motion in September 1996 by the authors of this article, a lingware specialist from the Faculty of Arts in Ljubljana and a linguist from the Fran Ramov¹ Institute of Slovenian Language at the Scientific Research Centre of the Slovenian Academy of Sciences and Arts.
       The tagger and the tagset were developed using the novel Pomladni dan by Slovenian 20th century writer Ciril Kosmač as a starting sample. Besides Slovenian grammar some other sources were considered while assembling the tagset, the tagsets of the Brown corpus, the Penn Treebank corpus and the tagset, used in the frame of MULTEXT/East project. The main criteria were the legibility of the tags and the minimal size, required for disambiguation. The tags had to be short, derived from Slovenian wording of linguistic terms and self-explanatory to such an extent, that they would be not only machine-readable, but acceptable to human reader as well. The point is illuminated in the following sentence:

Seveda se  lahko motijo.("Certainly they may be wrong.")
Č  Gmp A  Gcp

The tag Č stands for particle, Gmp for separate verbial morpheme, A for adjective and Gcp for verb, third person, plural.
       The tags for verbs are given in table 1.

Table 1: VERB

part-of-speech Type Person Gender Number Case Example
main verb G {a,b,c} {e,d,p} Gce (plava)
auxiliary verb to be in present GP {a,b,c} {e,d,p} GPce (je)
auxiliary verb to be in future GFP {a,b,c} {e,d,p} GFPcp(bodo)
negative form of the aux. verb to be GZP {a,b,c} {e,d,p} GZPae(nisem)
verb to be GO {a,b,c} {e,d,p} GOae (sem)
negative form of the verb to be GZO {a,b,c} {e,d,p} GZOce(ni)
negative form of the verb to have GZ {a,b,c} {e,d,p} GZbe (nima¹)
imperative GV {a,b,c} {e,d,p} GVbe (glej)
participle ending in -l GL {m,¾,s}{e,d,p} GL¾e (obrisala)
participle of the verb to be GLB {m,¾,s}{e,d,p} GLBme(bil)
particple ending in -n/-t GN/GT {m,¾,s}{e,d,p}{1,2} GNme1(rojen)
participle ending in -č/-¹i GČ/G©I (loveč)
infinitive GNE GNE (povedati)
supine GNA GNA (gledat)
conditional GBI GBI (bi)
separate verbial morpheme Gmp Gmp (se)

        The tagger, or better, the software for the support of the tagging process, has evolved over time from a simple tool used as an aid for manual tagging to a two-step disambiguator in 1997. From the very beginning the tagger has been a part of text-editor EVA and therefore totally interactive. It is possible to tag just a single word or proceed from a selected place in text. The lexicon of words, their tags and tag frequencies is updated on the fly as another file, which can also be edited and adjusted when required. This file acts as a database, which is used for production of different statistical tables that give the necessary feedback for fine-tuning of the tagger and the tagset.         The disambiguator has two steps. The first one is based on the history of tagged text: if a wordform in the lexicon has one and only one tag, it is the right one; if not, the neighbourhood of the word has to be examined. If there exists a neighbourhood, from 2 to 5 words long (including the word in question), that has a match in the history data base, with one and only one set of tags, the word is given its tag from this set, otherwise it is left untagged. The second step is a probabilistic one where the frequencies of all possible sets of tags for immediate neighbourhood, this time without words (again 2 to 5 words deep) are considered. If there is one and only one such set for any of the possible neighbourhoods, the corresponding tag is given to the observed word; otherwise the word is left untagged again. The table 2 shows a sentence (by Ciril Kosmač), after the two phases of disambiguator, as displayed on a computer screen.

Table 2: Disambiguated sentence



An attempt to translate the sentence might produce the following: The spring has passed, the summer has passed, and the fall has come, the golden time of all golden, time of murmuring winds and ripe scents, time of huge clouds and unattainable horizons, time of sweet and gloomy unrest.
       In the sentence of table 2 there are 25 tagged words out of 31 with 1 error (24/31 = 77% hit rate).
       So far 330.000 words have been tagged, manually completed and verified - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection (about 5% of the full text) from which is available in electronic form on Internet. In table 3 the samples follow each other in the chronological order. Some other data about the first three samples (distribution of part-of-speech groups, for instance) are shown in the paper.

Table 3: The samples

sample name size     sentenceswords
1. Ciril Kosmač: Pomladni dan 176 pp. 5.922 61.565
2. Platon: Dr¾ava 317 pp. 7.323 93.430
3. Newspaper Delo (Internet) 43 days 2.956 53.895
4. Ciril Kosmač: Prazna ptičnica139 pp. 2.079 26.656
5. George Orwell: 1984    229 pp. 6.684 90.760
Total    24.964   326.306

       In the field of POS tagging of texts in the Slovenian language, the first steps have been made. A proven tagset, a disambiguator with a hit rate of 80%, and the supporting software, both incorporated into an own text editor for effective and comfortable use, have been established. A database of tagged samples (330.000 words) is now complemented by a lexicon of 3.300.000 wordforms with POS tags, based on 93.000 lemmas from the Dictionary of the Slovenian Literary Language.
       The wish list for the future includes a bigger database (1 million tagged words) and a better disambiguator, with a hit rate of 90% or more.



Appendix: Remaining tables from the article:

TABLE 5: NOUNS

Type Gender Number Case Example
common noun S {m,¾,s} {e,d,p} {1,2,3,4,5,6} Sme1 (dan)
gerund SG {m,¾,s} {e,d,p} {1,2,3,4,5,6} SGse2(spoznanja)
proper noun of persons IO {m,¾,s} {e,d,p} {1,2,3,4,5,6} IOme1(Martin)
proper noun of inhabitants IP {m,¾,s} {e,d,p} {1,2,3,4,5,6} IPme2(Čeha)
divine names IV {m,¾,s} {e,d,p} {1,2,3,4,5,6} IVme3(Bogu)
names of animals {m,¾,s} {e,d,p} {1,2,3,4,5,6} I®¾e3(Liski)
proper noun of places IZ {m,¾,s} {e,d,p} {1,2,3,4,5,6} IZme2(Črnomlja)
proper noun of mythological places IM {m,¾,s} {e,d,p} {1,2,3,4,5,6} IMme5(Hadu)
other names (institutions, books..) IS {m,¾,s} {e,d,p} {1,2,3,4,5,6} IS¾e2(Iliade)

TABLE 7: ADJECTIVES

Type Gender Number Case Degree Definiteness Example
adjective P {m,¾,s}{e,d,p}{1,2,3,4,5,6}{/,j,jj} {/,i} Pme1i (pomladni)
participle ending in -l PL {m,¾,s}{e,d,p}{1,2,3,4,5,6} {/,i} PLmp4 (uspele)
participle ending in -n/-t PN/PT {m,¾,s}{e,d,p}{1,2,3,4,5,6} {/,i} PNme4i(zgre¹eni)
participle ending in -č/-¹i PČ/P©I {m,¾,s}{e,d,p}{1,2,3,4,5,6} {/,i} PČ¾e2 (cvetoče)
predicative adjective PD {m,¾,s}{e,d,p} {/,j,jj} PDme (rad)
possesive adjectives from proper names of persons PIO {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIO¾p4(Andrejeve)
poss.adj.from proper names of inhabitants PIP {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIPme1(Brikin)
poss.adj.from divine names PIV {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIVse4(Kronovo)
poss.adj.from proper names of places PIZ {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIZ¾e5(Krimski)
poss.adj.from proper names of mythological places PIM {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIMme5(Hadovem)
poss.adj.from other names PIS {m,¾,s}{e,d,p}{1,2,3,4,5,6} PIS¾e5(Mohorjevi)

TABLE 8: PRONOUNS

Type Person Gender Number Gender NumberCase primer
personal pronoun ZO {a,b,c}{m,¾,s,/}{e,d,p} {1,2,3,4,5,6} ZOcme5 (njem)
personal reflexive pro. ZOP {2,3,4,5,6} ZOP2 (sebe)
possesive pronoun ZSV {a,b,c}{m,¾,s,/}{e,d,p}{m,¾,s}{e,d,p}{1,2,3,4,5,6} ZSVaeme2(mojega)
possesive reflexive pro.ZSVP {m,¾,s}{e,d,p}{2,3,4,5,6} ZSVPme6 (svojim)
interrogative pro. ZV {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZVse1 (kaj)
relative pronoun ZR {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZRme2 (kakr¹nega)
negative pronoun ZNI {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZRme2 (kakr¹nega)
indefinite pronoun ZPO {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZPOme1 (kdo)
relative indefinite pro.ZRPO {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZRPOme6 (komerkoli)
definite pronoun ZNE {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZNEse1 (nekaj)
demonstrative pro. ZD {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZDme5 (drugem)
general pronoun ZT {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZTse1 (vse)
identity pronoun ZI {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZIme4 (isti)
multitude pronoun ZM {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZMse3 (marsičemu)
demonstrative pronoun ZK {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZKmp1 (ti)
emphasized pronoun ZPU {m,¾,s}{e,d,p}{1,2,3,4,5,6} ZPU¾e1 (sama)
conjunctional pronoun ZVR ZVR (ki)

TABLE 9: NUMERALS

Type Gender Number Case Example
cardinal numeral ©G {m,¾,s}{e,d,p} {1,2,3,4,5,6} ©G¾e3 (petintridesetim)
ordinal numeral ©V {m,¾,s}{e,d,p} {1,2,3,4,5,6} ©Vme2 (prvega)
separating numeral ©L {m,¾,s}{e,d,p} {1,2,3,4,5,6} ©Lse6 (dvojim)
multiple numeral ©M {m,¾,s}{e,d,p} {1,2,3,4,5,6} ©Mme5 (trojnih)
quantifiers ©NE ©NE (nekaj)
number © © (77.500,00.-)

TABLE 10: OTHER MORPHOLOGICAL CATEGORIES

Type Relationship Degree Case Example
adverb A (/,j,jj) A (resnično)
particle Č Č (kar)
negative particle ČZ ČZ (ne)
conjunctional particleČV ČV (ali)
preposition E {2,3,4,5,6} E2 (iz)
conjunction V {pr,po} Vpr(in)
interjection M M (oh)

TABLE 11: ABBREVIATIONS

Type Gender Number Case Example
lower case abbreviation K K (¹t.)
upper case abbreviation KI {/,m,¾,s}{/,e,d,p} {/,1,2,3,4,5,6}KI (©TUNFF)
www KURL KURL(http://www.delo.si)

TABLE 12: POS TAGS BY MORPHOLOGICAL CATEGORIES

Pomladni dan Dr¾ava Delo Skupaj
common nouns 10.17317.48814.07841.739
proper nouns 1.2066684.2696.143
verbs 20.26921.4989.29751.064
adjectives 4.3067.1146.00817.428
pronons 7.06514.8992.73624.700
numerals 3991.3633.1884.950
adverbs 3.6255.2541.90810.787
particles 3.5905.5361.55710.683
prepositions 5.2698.2156.03819.522
conjunctions 5.37711.3683.39820.143
interjections 27418-292
abbreviations 1291.4181.439

TABLE 13: POS TAGS BY MORPHOLOGICAL CATEGORIES (%)

Pomladni dan Dr¾ava Delo Celota
common nouns 16.5218.7226.1219.98
proper nouns 1.960.717.922.94
verbs 32.9223.0217.2524.46
adjectives 6.997.6111.158.34
pronons 11.4815.955.0811.82
numerals 0.651.465.922.37
adverbs 5.895.623.545.16
particles 5.835.932.895.11
prepositions 8.568.7911.209.35
conjunctions 8.7312.176.309.64
interjections 0.45 - - 0.14
abbreviations - - 2.630.69
All together 100.00

TABLE 14: THE WORDS BEGINING IN ¾ive- IN THE SET OF WORDS AND TAGS

¾ive ~ Gcp,1;P¾e2,5;P¾p1,4;P¾p4,2;Smp4,2
¾ivega ~ Pme2,2;Pse2,1;Sse2,3
¾ivel ~ GLme,13
¾ivela ~ GL¾e,12
¾ivele ~ GL¾p,2
¾iveli ~ GLmp,4
¾ivem ~ Pme5,1;Pse5,2
¾ivemu ~ Pme3,1;Sse3,1
¾iveti ~ GNE,18
¾iveče ~ PČmp4,1
¾ivečih ~ PČmp2,1;Smp2,1
¾ivečimi ~ PČsp6,2

TABLE 15: An Example of Tagged Text (Plato's Republic, Book I)

3. Po resnici in odkrito, pri Zevsu, ti bom povedal svoje
© E5 S¾e5 Vpr A E5 IVme5 ZObe3 GFPae GLme ZSVPse4
mnenje, dragi Sokrat. Pogosto se sestajamo mo¾je istih let in
Sse4 Pme1 IOme1 A Gmp Gap Smp1 ZIsp2 Ssp2 Vpr
potrjujemo pravilnost starega pregovora.4. Ko se pogovarjamo,
Gap S¾e4 Pme2 Sme2 © Vpo Gmp Gap
skoraj vsi tarnajo in se z oto¾nostjo spominjajo mladostnih
A ZTmp1 Gcp Vpr Gmp E6 S¾e6 Gcp P¾p2
radosti, ljubezni, pitja in gostij ter vsega drugega, kar je s
S¾p2 S¾p2 Sse2 Vpr S¾p2 Vpr ZTse2 ZDse2 ZVR GOce E6
tem v zvezi; pri tem so nejevoljni, kakor da bi bili oropani
ZKse6 E5 S¾e5 E5 ZKse5 GPcp Pmp1 Vpo Vpo GBI GLBmp GNmp1
velikih stvari in bi bili nekoč imenitno ¾iveli, zdaj pa le ¹e
P¾p2 S¾p2 Vpr GBI GLBmp A A GLmp A Vpr Č Č
¾ivotarili. Nekateri se tudi prito¾ujejo, da svojci z njimi -
GLmp ZNEmp1 Gmp Č Gcp Vpo Smp1 E6 ZOcmp6
ker so stari - grdo ravnajo, in pri tem ubirajo ¾alostinke o
Vpo GPcp Pmp1 A Gcp Vpr E5 ZKse5 Gcp S¾p4 E5
nadlogah, ki jih je kriva starost. Meni se zdi, Sokrat, da ti
S¾p5 ZVR ZOc¾p2 GPce P¾e1 S¾e1 ZOae3 Gmp Gce IOme1 Vpo ZKmp1
ne obto¾ujejo pravega krivca, kajti ko bi tega bila kriva
ČZ Gcp Pme2 Sme2 Vpr Vpo GBI ZKse2 GLB¾e P¾e1
starost, bi imeli jaz in vsi drugi moji starostni vrstniki iste
S¾e1 GBI GLmp ZOae1 Vpr ZTmp1 ZDmp1 ZSVaemp1 Pmp1 Smp1 ZI¾p4
te¾ave. Naletel pa sem ¾e na mnoge, ki se ne počutijo slabo,
S¾p4 GLme Č GPae Č E4 Pmp4 ZVR Gmp ČZ Gcp A
tako se je nekoč namerilo, da sem bil ravno pri pesniku
ZK Gmp GPce A GLse Vpo GOae GLBme A E5 Sme5
Sofoklu, ko ga je nekdo vpra¹al: "Kako je pri tebi, Sofokles, z
IOme5 Vpo ZOcme4 GPce ZNEme1 GLme ZV GOce E5 ZObe5 IOme1 E6
ljubeznijo? Ali ¹e lahko občuje¹ z ¾ensko?" - "Molči, človek!"
S¾e6 Č Č A Gbe E6 S¾e6 GVbe Sme1
je odvrnil pesnik. "Vesel sem, da je to za mano. Tako mi je,
GPce GLme Sme1 Pme1 GPae Vpo GOce ZKse1 E6 ZOae6 ZKse1 ZOae3 GPce
kakor da bi pobegnil divjemu, pobesnelemu gospodarju. "Te
Vpo Vpo GBI GLme Pme3 PLme3 Sme3 ZK¾p4
besede so mi ¾e takrat ugajale in nič manj mi ne ugajajo danes.
S¾p4 GPcp ZOae3 Č A GL¾p Vpr ZNI Aj ZOae3 ČZ Gcp A