BESEDA - Text Corpus at ISJ ZRC SAZU

Fran Ramovš Institute of Slovenian Language ZRC SAZU
Corpus Laboratory

nova beseda

The Corpus

Nova beseda is a collection of texts, aimed primarily to cover the lexicographic needs of the Institute, but, through the free web access, also to all the other research and education purposes in the field of Slovenian. It is another step on the way to Slovenian National Corpus, an ideal collection of electronic texts, aimed at research and educational community at large, that would consist of all newer Slovenian texts as well as all the old ones, preserved to this day. The project would be technically possible for quite some time now ( The feasibility of a complete text corpus). The current corpus started with web presentation of the electronic collection of Slovenian fiction in 1999, 3 mil. words in all, gathered and processed during the preparation of the doctoral thesis Upper Bound of Entropy in Slovenian Literary Texts, in the field of information theory. The limitation of the printed thesis to 200 pages required the web presentation of the thesis material, and to make it useful for linguistic research and lexicographic purposes at the Institute, where the majority of the texts have been processed, a concordancer had to be added. In spring of 2000 the corpus has been augmented to 28 mil. words through the addition of texts from the DELO daily newspaper, 1998 - 2000; search engine has been adapted at the same time to suit larger collection. The first text cleanup has been accomplished, which paved the way to the use of wordform dictionary. Without such a cleanup, a cluster of errors pops up for every more frequent word and its inflected forms. Such clusters blur the dictionary to a large extent and make its use much more difficult. (see chapter 3.3 Error problem - alas in Slovenian - from the thesis). In May 2000 the corpus has been transferred from the server of the Faculty of Arts (University of Ljubljana) to the server of the Institute, and obtained a new, more local user-friendly name. Instead of CORTES (acronym from CORpus of TExts in Slovenian) it is now called Nova beseda (New word in Slovenian). In the summer 2000 the corpus size, mainly through addition of new DELO newspaper texts, grew to 48 mil. words.

In the years that followed the size and diversity of the corpus have been gradually enlarged, in the late spring of 2005 to 162 mil. words and 4.158 texts. Corpus consists of 6 main parts: part D - Delo Slovenian daily, 2.310 texts, 1998 - 2005, 120 mil. words, part G - Slovenian National Assembly session transcripts, 711 texts of formal speech, 1996 - 2004, 20 mil. words, part A - fiction in Slovenian, 778 texts which include the complete works of prominent writers Drago Jančar, Ciril Kosmač and Ivan Cankar, 12 mil. words, part P - 78 texts of the Monitor computer magazine 1999-2004 and Viva healthy living magazine, 6 mil. words, part B - non-fiction in Slovenian, 251 texts and 2 mil. words and part C - 26 scientific and technical publications, 2 mil. words. All the texts are in relatively good condition, they have been marked to the sentence level and a reasonable attempt to clean the typographical and other errors has been made. Most texts come from the last decade.

Contributors

The corpus has been made possible by Delo newspaper company, Slovenian National Assembly, late Mr. Franko Luin, Mr. Drago Jančar, Monitor computer magazine, Viva healthy living magazine, Didakta publishing house, Mihelač publishing house, Mr. Samo Kuščer, Mrs. Aleksandra Rekar and many others, who have kindly contributed texts; many thanks to all.

In the setup of the core corpus (A, 3 mil. words) Varja Cvetko Orešnik, Aleksandra Bizjak, Lučka Uršič and Karmen Nemec of the Fran Ramovš Institute of Slovenian Language (ISJ) have taken part, as well as Miran Hladnik, Igor Grdina, Matjaž Rebolj and Marina Zorman from the Faculty of Arts, University of Ljubljana and Zlatka Rabzelj from the Jože Mazovec public library in Ljubljana. Valuable contributions have also been made by the late Franc Jakopin, Klaus Detlef Olof, Melita Ambrožič, Jure Dimec and Tomaž Erjavec.

During the preparation and cleanup of the remaining part of the corpus, in particular stages of this work Helena Dobrovoljc, Aleksandra Bizjak, Birte Loenneker and Lučka Uršič from ISJ have taken part, Cvetka Bajec, Andreja Musar and Primož Murn as well as several other students of the Faculty of Arts, especially while transferring the works of Ivan Cankar into digital format.

Terms of Use

Most texts in the corpus are the property of the copyright owners; the use through corpus concordancer and word, multi-word unit search engine is permitted only for research and education purposes.

Software and Web Interface

The software used for the preparation of texts and for their web presentation has been the local wordprocessor EVA and its Internet version, NEVA, respectively.

Page posted on May 2, 2000; date of last change: March 10, 2008.

URL: http://bos.zrc-sazu.si/a_about.html Comments Visits:

Hosted by Fran Ramovš Institute of Slovenian Language, Scientific Research Centre of Slovenian Academy of Sciences and Arts.