nova beseda
The Corpus
Nova beseda is a collection of texts, aimed primarily to cover
the lexicographic needs of the Institute, but, through the free web
access, also to all the other research and education purposes in the
field of Slovenian. It is another step on the way to
Slovenian National Corpus, an ideal collection of
electronic texts, aimed at research and educational community at large,
that would consist of all newer Slovenian texts as well as all the old
ones, preserved to this day. The project would be technically
possible for quite some time now (
The feasibility of a complete text corpus).
The current corpus started with web presentation of the electronic
collection of Slovenian fiction in 1999, 3 mil. words in all, gathered
and processed during the preparation of the doctoral thesis
Upper Bound of Entropy in Slovenian Literary Texts, in the field
of information theory. The limitation of the printed thesis to 200
pages required the web presentation of the thesis material, and to
make it useful for linguistic research and lexicographic purposes at
the Institute, where the majority of the texts have been processed, a
concordancer had to be added. In spring of 2000 the corpus has been
augmented to 28 mil. words through the addition of texts from the DELO
daily newspaper, 1998 - 2000; search engine has been adapted at
the same time to suit larger collection. The first text cleanup has
been accomplished, which paved the way to the use of wordform
dictionary. Without such a cleanup, a cluster of errors pops up for
every more frequent word and its inflected forms. Such clusters blur
the dictionary to a large extent and make its use much more difficult.
(see chapter
3.3 Error problem - alas in Slovenian - from the thesis).
In May 2000 the corpus has been transferred from the server of the
Faculty of Arts (University of Ljubljana) to the server of the
Institute, and obtained a new, more local user-friendly name. Instead
of CORTES (acronym from CORpus of TExts in Slovenian) it is now called Nova
beseda (New word in Slovenian). In the summer 2000
the corpus size, mainly through addition of new DELO newspaper
texts, grew to 48 mil. words.
In the years that followed the size and diversity of the corpus have been gradually
enlarged, in the late spring of 2005 to 162 mil. words and 4.158 texts. Corpus
consists of 6 main parts: part D - Delo Slovenian daily, 2.310 texts, 1998 -
2005, 120 mil. words, part G - Slovenian National Assembly session transcripts,
711 texts of formal speech, 1996 - 2004, 20 mil. words, part A -
fiction in Slovenian, 778 texts which include the complete works of
prominent writers Drago Jančar, Ciril Kosmač and Ivan Cankar, 12 mil.
words, part P - 78 texts of the Monitor computer magazine 1999-2004
and Viva healthy living magazine, 6 mil. words, part B - non-fiction
in Slovenian, 251 texts and 2 mil. words and part C - 26 scientific and
technical publications, 2 mil. words. All the texts are in relatively
good condition, they have been marked to the sentence level and
a reasonable attempt to clean the typographical and other errors has been made.
Most texts come from the last decade.
Contributors
The corpus has been made possible by
Delo newspaper company, Slovenian National Assembly, late Mr. Franko Luin,
Mr. Drago Jančar, Monitor computer magazine, Viva healthy living magazine,
Didakta publishing house, Mihelač publishing house, Mr. Samo Kuščer, Mrs.
Aleksandra Rekar and many others, who have kindly contributed texts; many
thanks to all.
In the setup of the core corpus (A, 3 mil. words)
Varja Cvetko Orešnik, Aleksandra Bizjak, Lučka Uršič and Karmen Nemec
of the Fran Ramovš Institute of Slovenian Language (ISJ) have taken part, as
well as Miran Hladnik, Igor Grdina, Matjaž Rebolj and Marina
Zorman from the Faculty of Arts, University of Ljubljana and
Zlatka Rabzelj from the Jože Mazovec public library in Ljubljana.
Valuable contributions have also been made by the late Franc Jakopin,
Klaus Detlef Olof, Melita Ambrožič, Jure Dimec and Tomaž
Erjavec.
During the preparation and cleanup of the remaining part of the corpus,
in particular stages of this work Helena Dobrovoljc, Aleksandra Bizjak,
Birte Loenneker and Lučka Uršič from ISJ have taken part,
Cvetka Bajec, Andreja Musar and Primož Murn as well as several other
students of the Faculty of Arts, especially while transferring the
works of Ivan Cankar into digital format.
Terms of Use
Most texts in the corpus are the property of the copyright owners;
the use through corpus concordancer and word, multi-word unit search
engine is permitted only for research and education purposes.
Software and Web Interface
The software used for the preparation of texts and for their web
presentation has been the local wordprocessor
EVA and its Internet version,
NEVA,
respectively.
Page posted on May 2, 2000; date of last change: March 10, 2008.
URL: http://bos.zrc-sazu.si/a_about.html
Comments
Visits:
Hosted by Fran Ramovš Institute of Slovenian Language,
Scientific Research
Centre of Slovenian Academy of Sciences and Arts.