The LC-STAR Catalan lexicon was created within the scope of the LC-STAR project (IST 2001-32216) which was sponsored by the European Commission and the Spanish Government.
The lexicon comprises a set of more than 100000 entries. Entries were generated from more than 45000 common words entries selected from a corpora of more than 37 million words distributed in appropriated domains, a set of more than 45000 names including person names, family names, cities, streets, companies and brand names and a list of 5000 Specific Application Words translated from English terms defined by the LC-STAR consortium.
Production was performed at the Technologies and Applications of Language and Speech Center (TALP) of the Universitat Politècnica de Catalunya (UPC) (Spain). The owner of the database is UPC.
For each entry group i.e. for each wordform, we provide the lemma(s) it derives from and the POS(s) it belongs to. For each POS we will mark, where applicable, its attributes (e.g. number, person, case, tense, mood, voice, degree, type...).
The list of POS is:
NOM (Common and proper nouns), ADJ (Descriptive/qualificative adjective), DET (Determinative adjective or determiner), NUM (Numeral adjective or numerals), VER (Verb), AUX (Auxiliary verb), PRO (Pronoun), ART (Article), ADV (Adverbs and adverbial phrases), CON (Conjunctions and conjunctive phrases), ADP (Adpositions and prepositional phrases), INT (Interjection), PAR (Particles and clitics), PRE (Predicative), ONO (Onomatopoeia words), MEW (Measure words), AUW (Auxiliary words), IDI (Idiom), PUN (Punctuation marks), ABB (Abbreviations), COMPOUND TAGS.
In order to generate the common word list, a corpus with six different domains was defined and further subcategorized into subdomains. The domains are: Sports/Games, News, Finance, Culture/Entertainment, Consumer Information, and Personal communications. The size of the corpora was 20,204,086 words. The size of the word list was 54,627 words with a self coverage of 98.52%
The Catalan proper names domain consists of 46,228 entries, divided into 3 subdomains: First and Last names, Place names (city names, geographic names, major capitals, important and well known cities, important and well known national cultural and historic places, national street names, countries), and Organizations (organisations (profit, non-profit), national and international companies, brand names).
The lexicon includes special application words. The special application word list contains 7,498entries, and was designed to contain words from Common domains to be useful for all applications (Number and digits, abbreviations, global domains: Measures,
Abbreviations, Special signs, Domestic equipment, Health, Greetings), and Specific vocabulary for applications controlled by voice (information retrieval, controlling of consumer devices, traveling, etc.).
An XML-based mark-up language was chosen to represent the linguistic information in a formal, unambiguous manner and easy to read. Moreover, the information can be processed by as many parties as possible. The XML parser that will be used to parse the Lexica can be any XML version 1.0 compliant parser.
A formally specified grammar (Document Type Definition or DTD), containing all the linguistic information described so far allows to validate automatically the XML-based lexica. The LXCN.DTD file contains the DTD implementing the linguistic information.
For each entry, the property “orthography,” defined as the correct way (or, when more than one spelling is acceptable, the most common spelling for a wordform) of writing a given wordform will be the key to all of the phonetic, morphological, and grammatical information for that entry. This is coded via entry groups. Therefore, words that can be associated, for example, to multiple parts of speech (POS) will nevertheless have a unique entry.
The determination of the “correct” way of writing of a given wordform is highly language-dependent. When more than one “correct” spelling exists, additional spellings might optionally be coded in the language-dependent section of each entry.
Multiple-token entries are allowed only in the proper names section of lexica, and in those cases where individual components of a phrase never occur or are meaningless or take a very different meaning when considered standalone, e.g. ipso facto. This may also be true for a limited set of adverbial, conjunctive or prepositional phrases. Phrases are coded replacing blanks with underscores (e.g. New_York, Stratford_on_Avon, ipso_facto).
Numeric entries (e.g. 1994, 17.5, XVIII, 2°) are removed from wordlists, as well as punctuation marks.
Acronyms playing the role of proper names are listed in the proper name section of lexica (e.g. MIT, IBM, UN, EU). Other acronyms, that are relevant to applications, are enlisted in the application wordlist section of lexica (e.g. GmbH, SpA). Acronyms are not expanded. Acronym spellings are coded with or without dots according to which form is more common/frequent in the language.
Abbreviations are moved into the application wordlist section of lexica. It’s allowed for multiple expansion (i.e. many spellings associated to one abbreviation).
Casing is treated on a language-dependent basis. Catalan distinguish Proper Names with capitalization
Character set is ISO 8859-1
Phonetic transcription information is coded for each entry. We refer to “pronunciation” as the way the word is spoken in isolation: i.e. no assimilation processes within sentences are taken into account. However, when deemed appropriate for a given language, assimilation processes can be treated on a language-dependent basis and sets of rules may be provided accordingly.
We use the SAMPA phonetic alphabet with stress marker (") and syllable boundary marker (-). Multiple pronunciations are coded if they are in common use. Phonemes will be separated from adjacent phonemes and syllable boundary markers by a space. Pause markers are allowed for some long entries (i.e. companies with a large number of words)
Foreign words are phonetised according to the symbol sets used in each language for regular words. Approximations to foreign sounds and representations of foreign sounds are documented in the language phoneme set. In common word list, foreign words will be tagged specifying the language they come from via the standard XML attribute “xml:lang”.
The common words part of the lexicon was transcribed automatically. Foreign words and names were fully checked manually.
Electronic Dictionaries were used to uniform spellings. Phonetic transcription was based in an in-house Catalan grapheme to phoneme transcriber. Transcription rules and exceptions were discussed with the expert linguistics that did the transcription of names and foreign words. Tagging was done automatically and fully manually supervised. Clitics were manually done. A 10% of all the manual work was double checked every week. If some discrepancies between linguists were detected, the work was redone. The full lexicon and documentation were validated by SPEX
This database is commercially available.