P. Jakopin, A. Bizjak: Part-of-speech tagging of Slovenian text

Primož Jakopin, Aleksandra Bizjak:

Part-of-speech tagging of Slovenian text

Abstract and extended summary of the article O strojno podprtem oblikoslovnem označevanju slovenskega besedila, published in Slavistična revija, Vol. 45/3-4, 1997, p. [513]-532.

Abstract

       The first POS tagger for texts in Slovenian language is presented. It includes the complete environment: supporting software as well as the tagset, based on Slovenian grammar. The tagset consists of 4.797 tags as the language is highly inflected. A description of the tagger follows; it includes a two-step disambiguator. The first step is based on data base of previosly processed sentences, where a unambiguously tagged immediate neighbourhood of the observed word is being searched. It is followed by a probabilistic tagger, where frequencies of tag n-tuples up to level 5 are taken into consideration.
       So far 330.000 words have been tagged - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection from which is available in electronic form on Internet.

Summary

       Part-of-speech tagging has in the past year also spread into the domain of so-called Central & Eastern European languages. It is the first step in text parsing and a pre-requisite for further quantitative linguistic analysis, such as machine translation or setting up of national text corpora, accessible over Internet: BNC - British National Corpus and CNC, Czech National Corpus, for instance.
       In the frame of preparations for the founding of Slovenian National Corpus the project of Slovenian POS tagger was set in motion in September 1996 by the authors of this article, a lingware specialist from the Faculty of Arts in Ljubljana and a linguist from the Fran Ramovš Institute of Slovenian Language at the Scientific Research Centre of the Slovenian Academy of Sciences and Arts.
       The tagger and the tagset were developed using the novel Pomladni dan by Slovenian

20

^th century writer Ciril Kosmač as a starting sample. Besides Slovenian grammar some other sources were considered while assembling the tagset, the tagsets of the Brown corpus, the Penn Treebank corpus and the tagset, used in the frame of MULTEXT/East project. The main criteria were the legibility of the tags and the minimal size, required for disambiguation. The tags had to be short, derived from Slovenian wording of linguistic terms and self-explanatory to such an extent, that they would be not only machine-readable, but acceptable to human reader as well. The point is illuminated in the following sentence:


Seveda	se	lahko	motijo.	("Certainly they may be wrong.")
Č	Gmp	A	Gcp

The tag Č stands for particle, Gmp for separate verbial morpheme, A for adjective and Gcp for verb, third person, plural.
The tags for verbs are given in table 1.

Table 1: VERB


part-of-speech	Type	Person	Gender	Number	Case	Example
main verb	G	{a,b,c}	{e,d,p}			Gce	(plava)
auxiliary verb to be in present	GP	{a,b,c}	{e,d,p}			GPce	(je)
auxiliary verb to be in future	GFP	{a,b,c}	{e,d,p}			GFPcp	(bodo)
negative form of the aux. verb to be	GZP	{a,b,c}	{e,d,p}			GZPae	(nisem)
verb to be	GO	{a,b,c}	{e,d,p}			GOae	(sem)
negative form of the verb to be	GZO	{a,b,c}	{e,d,p}			GZOce	(ni)
negative form of the verb to have	GZ	{a,b,c}	{e,d,p}			GZbe	(nimaš)
imperative	GV	{a,b,c}	{e,d,p}			GVbe	(glej)
participle ending in -l	GL		{m,ž,s}	{e,d,p}		GLže	(obrisala)
participle of the verb to be	GLB		{m,ž,s}	{e,d,p}		GLBme	(bil)
particple ending in -n/-t	GN/GT		{m,ž,s}	{e,d,p}	{1,2}	GNme1	(rojen)
participle ending in -č/-ši	GČ/GŠI					GČ	(loveč)
infinitive	GNE					GNE	(povedati)
supine	GNA					GNA	(gledat)
conditional	GBI					GBI	(bi)
separate verbial morpheme	Gmp					Gmp	(se)

The tagger, or better, the software for the support of the tagging process, has evolved over time from a simple tool used as an aid for manual tagging to a two-step disambiguator in 1997. From the very beginning the tagger has been a part of text-editor EVA and therefore totally interactive. It is possible to tag just a single word or proceed from a selected place in text. The lexicon of words, their tags and tag frequencies is updated on the fly as another file, which can also be edited and adjusted when required. This file acts as a database, which is used for production of different statistical tables that give the necessary feedback for fine-tuning of the tagger and the tagset. The disambiguator has two steps. The first one is based on the history of tagged text: if a wordform in the lexicon has one and only one tag, it is the right one; if not, the neighbourhood of the word has to be examined. If there exists a neighbourhood, from 2 to 5 words long (including the word in question), that has a match in the history data base, with one and only one set of tags, the word is given its tag from this set, otherwise it is left untagged. The second step is a probabilistic one where the frequencies of all possible sets of tags for immediate neighbourhood, this time without words (again 2 to 5 words deep) are considered. If there is one and only one such set for any of the possible neighbourhoods, the corresponding tag is given to the observed word; otherwise the word is left untagged again. The table 2 shows a sentence (by Ciril Kosmač), after the two phases of disambiguator, as displayed on a computer screen.

Table 2: Disambiguated sentence

An attempt to translate the sentence might produce the following: The spring has passed, the summer has passed, and the fall has come, the golden time of all golden, time of murmuring winds and ripe scents, time of huge clouds and unattainable horizons, time of sweet and gloomy unrest.
In the sentence of table 2 there are 25 tagged words out of 31 with 1 error (24/31 = 77% hit rate).
So far 330.000 words have been tagged, manually completed and verified - four novels and a one-month sample of the leading Slovenian newspaper Delo, a selection (about 5% of the full text) from which is available in electronic form on Internet. In table 3 the samples follow each other in the chronological order. Some other data about the first three samples (distribution of part-of-speech groups, for instance) are shown in the paper.

Table 3: The samples


	sample name	size	sentences	words
1.	Ciril Kosmač: Pomladni dan	176 pp.	5.922	61.565
2.	Platon: Država	317 pp.	7.323	93.430
3.	Newspaper Delo (Internet)	43 days	2.956	53.895
4.	Ciril Kosmač: Prazna ptičnica	139 pp.	2.079	26.656
5.	George Orwell: 1984	229 pp.	6.684	90.760
	Total		24.964	326.306

In the field of POS tagging of texts in the Slovenian language, the first steps have been made. A proven tagset, a disambiguator with a hit rate of 80%, and the supporting software, both incorporated into an own text editor for effective and comfortable use, have been established. A database of tagged samples (330.000 words) is now complemented by a lexicon of 3.300.000 wordforms with POS tags, based on 93.000 lemmas from the Dictionary of the Slovenian Literary Language.
The wish list for the future includes a bigger database (1 million tagged words) and a better disambiguator, with a hit rate of 90% or more.

Appendix: Remaining tables from the article:

TABLE 5: NOUNS


	Type	Gender	Number	Case	Example
common noun	S	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	Sme1	(dan)
gerund	SG	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	SGse2	(spoznanja)
proper noun of persons	IO	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IOme1	(Martin)
proper noun of inhabitants	IP	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IPme2	(Čeha)
divine names	IV	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IVme3	(Bogu)
names of animals	IŽ	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IŽže3	(Liski)
proper noun of places	IZ	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IZme2	(Črnomlja)
proper noun of mythological places	IM	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	IMme5	(Hadu)
other names (institutions, books..)	IS	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ISže2	(Iliade)

TABLE 7: ADJECTIVES


	Type	Gender	Number	Case	Degree	Definiteness	Example
adjective	P	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	{/,j,jj}	{/,i}	Pme1i	(pomladni)
participle ending in -l	PL	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}		{/,i}	PLmp4	(uspele)
participle ending in -n/-t	PN/PT	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}		{/,i}	PNme4i	(zgrešeni)
participle ending in -č/-ši	PČ/PŠI	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}		{/,i}	PČže2	(cvetoče)
predicative adjective	PD	{m,ž,s}	{e,d,p}		{/,j,jj}		PDme	(rad)
possesive adjectives from proper names of persons	PIO	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PIOžp4	(Andrejeve)
poss.adj.from proper names of inhabitants	PIP	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PIPme1	(Brikin)
poss.adj.from divine names	PIV	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PIVse4	(Kronovo)
poss.adj.from proper names of places	PIZ	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PIZže5	(Krimski)
poss.adj.from proper names of mythological places	PIM	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PIMme5	(Hadovem)
poss.adj.from other names	PIS	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}			PISže5	(Mohorjevi)

TABLE 8: PRONOUNS


	Type	Person	Gender	Number	Gender	Number	Case	primer
personal pronoun	ZO	{a,b,c}	{m,ž,s,/}	{e,d,p}			{1,2,3,4,5,6}	ZOcme5	(njem)
personal reflexive pro.	ZOP						{2,3,4,5,6}	ZOP2	(sebe)
possesive pronoun	ZSV	{a,b,c}	{m,ž,s,/}	{e,d,p}	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZSVaeme2	(mojega)
possesive reflexive pro.	ZSVP				{m,ž,s}	{e,d,p}	{2,3,4,5,6}	ZSVPme6	(svojim)
interrogative pro.	ZV				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZVse1	(kaj)
relative pronoun	ZR				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZRme2	(kakršnega)
negative pronoun	ZNI				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZRme2	(kakršnega)
indefinite pronoun	ZPO				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZPOme1	(kdo)
relative indefinite pro.	ZRPO				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZRPOme6	(komerkoli)
definite pronoun	ZNE				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZNEse1	(nekaj)
demonstrative pro.	ZD				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZDme5	(drugem)
general pronoun	ZT				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZTse1	(vse)
identity pronoun	ZI				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZIme4	(isti)
multitude pronoun	ZM				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZMse3	(marsičemu)
demonstrative pronoun	ZK				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZKmp1	(ti)
emphasized pronoun	ZPU				{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ZPUže1	(sama)
conjunctional pronoun	ZVR							ZVR	(ki)

TABLE 9: NUMERALS


	Type	Gender	Number	Case	Example
cardinal numeral	ŠG	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ŠGže3	(petintridesetim)
ordinal numeral	ŠV	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ŠVme2	(prvega)
separating numeral	ŠL	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ŠLse6	(dvojim)
multiple numeral	ŠM	{m,ž,s}	{e,d,p}	{1,2,3,4,5,6}	ŠMme5	(trojnih)
quantifiers	ŠNE				ŠNE	(nekaj)
number	Š				Š	(77.500,00.-)

TABLE 10: OTHER MORPHOLOGICAL CATEGORIES


	Type	Relationship	Degree	Case	Example
adverb	A		(/,j,jj)		A	(resnično)

particle	Č				Č	(kar)
negative particle	ČZ				ČZ	(ne)
conjunctional particle	ČV				ČV	(ali)

preposition	E			{2,3,4,5,6}	E2	(iz)
conjunction	V	{pr,po}			Vpr	(in)
interjection	M				M	(oh)

TABLE 11: ABBREVIATIONS


	Type	Gender	Number	Case	Example
lower case abbreviation	K				K	(št.)
upper case abbreviation	KI	{/,m,ž,s}	{/,e,d,p}	{/,1,2,3,4,5,6}	KI	(ŠTUNFF)
www	KURL				KURL	(http://www.delo.si)

TABLE 12: POS TAGS BY MORPHOLOGICAL CATEGORIES


	Pomladni dan	Država	Delo	Skupaj
common nouns	10.173	17.488	14.078	41.739
proper nouns	1.206	668	4.269	6.143
verbs	20.269	21.498	9.297	51.064
adjectives	4.306	7.114	6.008	17.428
pronons	7.065	14.899	2.736	24.700
numerals	399	1.363	3.188	4.950
adverbs	3.625	5.254	1.908	10.787
particles	3.590	5.536	1.557	10.683
prepositions	5.269	8.215	6.038	19.522
conjunctions	5.377	11.368	3.398	20.143
interjections	274	18	-	292
abbreviations	12	9	1.418	1.439

TABLE 13: POS TAGS BY MORPHOLOGICAL CATEGORIES (%)


	Pomladni dan	Država	Delo	Celota
common nouns	16.52	18.72	26.12	19.98
proper nouns	1.96	0.71	7.92	2.94
verbs	32.92	23.02	17.25	24.46
adjectives	6.99	7.61	11.15	8.34
pronons	11.48	15.95	5.08	11.82
numerals	0.65	1.46	5.92	2.37
adverbs	5.89	5.62	3.54	5.16
particles	5.83	5.93	2.89	5.11
prepositions	8.56	8.79	11.20	9.35
conjunctions	8.73	12.17	6.30	9.64
interjections	0.45	-	-	0.14
abbreviations	-	-	2.63	0.69
All together				100.00

TABLE 14: THE WORDS BEGINING IN žive- IN THE SET OF WORDS AND TAGS


žive ~ Gcp,1;Pže2,5;Pžp1,4;Pžp4,2;Smp4,2
živega ~ Pme2,2;Pse2,1;Sse2,3
živel ~ GLme,13
živela ~ GLže,12
živele ~ GLžp,2
živeli ~ GLmp,4
živem ~ Pme5,1;Pse5,2
živemu ~ Pme3,1;Sse3,1
živeti ~ GNE,18
živeče ~ PČmp4,1
živečih ~ PČmp2,1;Smp2,1
živečimi ~ PČsp6,2

TABLE 15: An Example of Tagged Text (Plato's Republic, Book I)

	3.	Po	resnici	in	odkrito,	pri	Zevsu,	ti	bom	povedal	svoje
	Š	E5	Sže5	Vpr	A	E5	IVme5	ZObe3	GFPae	GLme	ZSVPse4

mnenje,	dragi	Sokrat.	Pogosto	se	sestajamo	možje	istih	let	in
Sse4	Pme1	IOme1	A	Gmp	Gap	Smp1	ZIsp2	Ssp2	Vpr

potrjujemo	pravilnost	starega	pregovora.	4.	Ko	se	pogovarjamo,
Gap	Sže4	Pme2	Sme2	Š	Vpo	Gmp	Gap

skoraj	vsi	tarnajo	in	se	z	otožnostjo	spominjajo	mladostnih
A	ZTmp1	Gcp	Vpr	Gmp	E6	Sže6	Gcp	Pžp2

radosti,	ljubezni,	pitja	in	gostij	ter	vsega	drugega,	kar	je	s
Sžp2	Sžp2	Sse2	Vpr	Sžp2	Vpr	ZTse2	ZDse2	ZVR	GOce	E6

tem	v	zvezi;	pri	tem	so	nejevoljni,	kakor	da	bi	bili	oropani
ZKse6	E5	Sže5	E5	ZKse5	GPcp	Pmp1	Vpo	Vpo	GBI	GLBmp	GNmp1

velikih	stvari	in	bi	bili	nekoč	imenitno	živeli,	zdaj	pa	le	še
Pžp2	Sžp2	Vpr	GBI	GLBmp	A	A	GLmp	A	Vpr	Č	Č

životarili.	Nekateri	se	tudi	pritožujejo,	da	svojci	z	njimi -
GLmp	ZNEmp1	Gmp	Č	Gcp	Vpo	Smp1	E6	ZOcmp6

ker	so	stari -	grdo	ravnajo,	in	pri	tem	ubirajo	žalostinke	o
Vpo	GPcp	Pmp1	A	Gcp	Vpr	E5	ZKse5	Gcp	Sžp4	E5

nadlogah,	ki	jih	je	kriva	starost.	Meni	se	zdi,	Sokrat,	da	ti
Sžp5	ZVR	ZOcžp2	GPce	Pže1	Sže1	ZOae3	Gmp	Gce	IOme1	Vpo	ZKmp1

ne	obtožujejo	pravega	krivca,	kajti	ko	bi	tega	bila	kriva
ČZ	Gcp	Pme2	Sme2	Vpr	Vpo	GBI	ZKse2	GLBže	Pže1

starost,	bi	imeli	jaz	in	vsi	drugi	moji	starostni	vrstniki	iste
Sže1	GBI	GLmp	ZOae1	Vpr	ZTmp1	ZDmp1	ZSVaemp1	Pmp1	Smp1	ZIžp4

težave.	Naletel	pa	sem	že	na	mnoge,	ki	se	ne	počutijo	slabo,
Sžp4	GLme	Č	GPae	Č	E4	Pmp4	ZVR	Gmp	ČZ	Gcp	A

tako	se	je	nekoč	namerilo,	da	sem	bil	ravno	pri	pesniku
ZK	Gmp	GPce	A	GLse	Vpo	GOae	GLBme	A	E5	Sme5

Sofoklu,	ko	ga	je	nekdo	vprašal:	"Kako	je	pri	tebi,	Sofokles,	z
IOme5	Vpo	ZOcme4	GPce	ZNEme1	GLme	ZV	GOce	E5	ZObe5	IOme1	E6

ljubeznijo?	Ali	še	lahko	občuješ	z	žensko?" -	"Molči,	človek!"
Sže6	Č	Č	A	Gbe	E6	Sže6	GVbe	Sme1

odvrnil

pesnik. "

Vesel

sem,

mano.

Tako

je,

GPce

GLme

Sme1

Pme1

GPae

Vpo

GOce

ZKse1

ZOae6

ZKse1

ZOae3

GPce

kakor	da	bi	pobegnil	divjemu,	pobesnelemu	gospodarju. "	Te
Vpo	Vpo	GBI	GLme	Pme3	PLme3	Sme3	ZKžp4

besede

že

takrat

ugajale

nič

manj

ugajajo

danes.

Sžp4

GPcp

ZOae3

GLžp

Vpr

ZNI

ZOae3

ČZ

Gcp