Semantic Model: Supporting Multilingual Description

HOPE aims to provide access to materials written in a variety of languages supplied by social history institutions from across Europe to an international audience via ''inter alia'' broad thematic cross domain portals like Europeana. This endeavor raises the question of multilingual access on two levels. First, HOPE is a pan-European consortium; there are eight institutional languages represented, and this is reflected in the metadata being provided. The broad challenge remains to overcome the linguistic 'cacophony' in order to allow materials to be searched as a corpus.

Second and perhaps less obvious, the HOPE Social History Resource is itself multilingual; collections are in many of the major European languages and sometimes in more than one. More to the point, several HOPE institutions are proposing to supply textual collections in a language that differs from the cataloging language. A few have attempted to provide partial or full metadata in the language 'native' to the material, most have not. Currently, at least seven collections proposed for submission from three different content providers have metadata in a different language than the content. And while the practice of providing descriptions solely in the language of cataloging may have sufficed in the context of institutional websites and catalogs, it becomes problematic as material is disseminated through large-scale discovery services like Europeana. This issue is further complicated by the varied approaches of domain-specific metadata and cataloging standards. The library domain in particular has a long history of capturing original language metadata 'as inscribed' on the title pages and credits of published/produced materials. The more specific challenge, then, is to overcome the very real barriers to access when users are confronted with descriptive metadata in one (often not understood) language about material in another (often understood) language. Though multilingual access has not been the primary focus of HOPE, HOPE has attempted to confront both the broad and specific challenges to some degree.

In an ideal world a user would be able to perform searches in a language of his or her choice, the system would then retrieve all relevant results in all the languages of the metadata and/or the digital objects (for full-text searches), and the user would receive a translation of the results in the language(s) desired. Although multilingual access to digital resources is a hot topic of research (note among others, project MultiMatch and EuropeanaConnect, both European Commission sponsored projects treating the issue), as it currently stands the vision above remains a utopia. The key problem in the automation of multilingual access to digital resources remains that the semantic specificity of each language and the specificity of the, in our case, social history knowledge domain in which content providers and users operate when creating or searching for information. Since these semantic difficulties are specific for each language, they are multiplied in case of a search system that deals with multilingual searching (metadata or full text digital objects in multiple languages).

Given the current state of research, HOPE has been compelled to adopt a more pragmatic approach to the challenges presented. First and foremost, it has provided the semantic structure to capture data already created by content providers, including elements to record the various uses of language for a given resource and elements to accommodate the multilingual/multiscript metadata currently available. Provisions to capture both language information and textual description in multiple languages or scripts should provide a basis for further work on the topic.

HOPE Data Model Language Elements

The HOPE Content Providers Survey showed that currently most content providers record the language of the described material for text-based resources as well as the language of cataloging. In response, the HOPE data model includes several elements for recording language information. All of these elements record information supplied by the content provider as a string and as a normalized element attribute. HOPE uses the ISO-639-3 data value standard for language codes. The language elements specified in the HOPE data model are:

''Language Digital Content (0,*):'' This element records one or more languages used in or by the original resource represented by the digital object. This element is optional since not all HOPE resources are language based, but it is recommended for resources which have textual content. Depending on the domain profile employed, HOPE content providers may map to this element from one or several available fields supported in the domain standards or supply it separately.
''Language (0,*):'' Like the above, this element, when available, captures metadata on languages used in or by the original resource as already recorded in the descriptive metadata on the resource. The semantics of this element correspond with the Language Digital Content element, but provide it as a free-text field to capture any legacy data about the language of the original resource supplied by the content provider.
Currently, the element is only represented in two of the five domain profiles. The labels for the Language elements differ depending on the profile used: Language of the Described Material (Archival Profile), Original Language (Audiovisual Profile), Language Used (Audiovisual Profile), Sub-Title Language (Audiovisual Profile). At the current time, HOPE does not support the MARC 546 Language Note field nor the DC Language field for mapping a free-text language statement.
''Language Metadata (1,1):'' This element indicates the primary cataloging language of the metadata describing the original resource. This element is mandatory and can have only one occurrence. Secondary languages, used for translation of specific values, are recorded using the language attribute discussed below. This element is required.
''Europeana Language (1,1):'' This element indicates one or more official languages associated with the country in which the content provider is located. The element is mandatory, since it is required by Europeana, but is otherwise of no added value within HOPE. The element is required.

Support for Multilingual Description

The HOPE Content Providers Survey revealed that ten content providers out of twelve describe their collections in a single cataloging language: 83.5 percent of the metadata records supplied to HOPE were identified as unilingual descriptions. Almost half (47.5 percent) of all the unilingual metadata records were in German, followed at some distance by English (14.2 percent), and Dutch (9.6 percent). About 3.5 percent of the unilingual metadata records were in Italian, Portuguese, and Finnish, while 1.6 percent and 0.4 percent of these records are in French and Hungarian, respectively. Yet, these figures should be treated with a certain amount of caution given the varied approach of domain standards. Library cataloging standards in particular require title information to be recorded in original language, i.e. 'as inscribed' on the title page or in credits; translations (presumably in the language of cataloging) may also be recorded, but these are optional. Thus many so-called unilingual library records identified in HOPE may actually be 'mixed records', with title information in the language of the material and the remainder of the information in the cataloging language.

The remaining 16.5 percent of the descriptions were identified as bilingual, the vast majority being in Dutch-English (15.5 percent) with French-German and Hungarian-English records amounting to less than 1 percent each of the bilingual descriptions. Of these descriptions, most are only partially bilingual. Such records include a translation or transliteration of the title only, which is entered in a note field and indexed for searching. We might assume that such practices closely followed the library practice and additional fields contain metadata in the language of cataloging while primary titles are recorded in the original language. Only one content provider claimed to support elements for translated values, including 'translated titles' and 'translated descriptions' of archival, library, and audiovisual metadata.

In order to accommodate current practices and to encourage content providers to record multilingual metadata when possible, HOPE made two provisions in its data model. First, it included elements already existing in the domain standards for recording translated data. Two domain profiles currently contain Title elements dedicated to translated titles with the following domain-specific labels: Parallel Title (Library Profile), Translated Title (Library Profile), Translated Sub-Title (Library Profile), Translated Title (Audiovisual Profile). The Audiovisual Profile also has a Description element with the domain-specific label: Translated Synopsis. These domain-specific labels enable a distinction to be made between the translated and original values for these elements. As noted, the primary values can generally be inferred to be in the language of the material. MARC has a dedicated subfield to record the language attributes of translated titles which can be mapped to a language attribute in HOPE. (Of course for parallel titles or titles of multilingual works, determining the language of each value may prove trickier.) EN15907 also includes language attributes for the title and description elements.

HOPE also supports the supply of full multilingual descriptive metadata by making most descriptive metadata elements in the HOPE data model repeatable as well as providing them with a language attribute. The Language Attribute is one in a set of several attributes available for descriptive metadata elements and is used to indicate the language of a value not recorded in the primary language specified in the Language Metadata element. The Language Attribute is given as a normalized value. Currently few content providers are taking advantage of this option, but HOPE nevertheless encourages the supply of robust multilingual metadata.

Support for Varied Scripts and Transliterated Values

The HOPE content providers must be able to export all descriptive metadata records encoded in UNICODE UTF-8. UNICODE UTF-8 supports all European character sets as well as more complex Asian and Middle Eastern scripts. Currently, eight content providers, representing 70.1 percent of the metadata records, use solely UNICODE UTF-8. The remaining content providers use UTF-8 and ISO 8859-1 alone or in combination; both can easily be converted to UNICODE UTF-8. Nevertheless, even in an all UNICODE environment there are still obstacles related to the use of non-Latin characters and diacritics that hinder the interoperability of metadata and thus the accessibility of resources.

In the HOPE Content Providers Survey, only 1.5 percent of the HOPE metadata records slated for supply to HOPE contained metadata in a script other than Latin. This was from two separate content providers and the script in both cases was Russian Cyrillic, which both content providers generally transliterate as well. Another institution suggested that it transliterates from Russian Cyrillic, Chinese, Hebrew, Bengali, and Urdu scripts. Of these, transliterations from Russian Cyrillic and Chinese were the most common. A fourth institution transliterated metadata from Arabic script. In sum, at the time of the survey, content providers generally had original script and transliterated metadata together or transliterated metadata only for any given element. There were few cases identified of non-Latin script elements existing without transliteration, most likely due to the proscriptions of domain standards or the exigencies of indexing systems.

All other issues aside, transliteration is required to properly index metadata in a variety of alphabets. HOPE therefore requires that content providers perform transliteration of metadata values which are in a non-Latin script to the Latin script for all indexed fields. As domain standards do not include dedicated elements for original script/transliterated data, HOPE does not include domain specific labels for these values. Instead, when content providers choose to store and map both original alphabet and transliterated values, the transliterated value can be recorded as an additional occurrence of the metadata element. To support the recording of original script values, HOPE provides a Script Attribute for most descriptive metadata elements. The Script Attribute is given as a normalized value and is generally used in conjunction with the Language Attribute. Though this approach does not forestall all possible indexing problems, it does stave off the worst.

HOPE Multilingual Support as Best Practice

HOPE offers a broad and uniform support for multilingual and multiscript metadata. Through the above mechanisms HOPE has attempted to capture metadata related to language use as well as multilingual/script values already created by HOPE content providers. In so doing, HOPE has created in its data model the potential to capture and exploit robust multilingual/multiscript metadata should the HOPE Best Practice Network choose to take advantage of it.

Nevertheless, the current system has several flaws. The first, mentioned elsewhere, is the inconsistent application of the Descriptive Unit Language element across the domains. The lacuna is particularly notable in the case of the library profile. The second is that restrictions in the domain profile sometimes preclude multilingual options. For instance, in both the Dublin Core and visual profiles the Title and Description elements are not repeatable elements. In broader terms, HOPE's domain-specific multilingual support (including elements and even attributes from the various domains) and HOPE's underlying multilingual support (as represented in its data model) are not clearly distinguished. In fact, the underlying support for multilingual and multiscript metadata is broader than it appears through the lens of the domains. This can be addressed in large part through the clear documentation of the options available with practical guidance for implementing these options through the mapping process.

Next, we will describe HOPE's initial attempts to harmonize multilingual metadata through specific best practices on translation and transliteration. (See: Semantic Harmonization, section on [[Semantic_Harmonization#HarmonizingMultilingualScriptDescriptions|Harmonizing Multilingual/Script Descriptions]].) The project has also supported the further enhancement of submitted metadata in order attempt to bridge the gaps in linguistic and professional practice. Both harmonization of existing metadata and enrichment of the HOPE Social History Resource are intended to facilitate searching across languages and domains.

Related Resources

HOPE: Heritage of the People's Europe. "Section: Data Model." ''The HOPE Common Metadata Structure, including Harmonisation Specifications''. May 2011. (http://www.peoplesheritage.eu/pdf/D2_2_Metadata%20Structure.pdf)

HOPE: Heritage of the People's Europe. "Section: XML Schema." ''The HOPE Common Metadata Structure, including Harmonisation Specifications''. May 2011.
(http://www.peoplesheritage.eu/pdf/D2_2_Metadata%20Structure.pdf)

HOPE: Heritage of the People's Europe. "Appendix B: HOPE Audiovisual Profile (prototype)." ''The HOPE Common Metadata Structure, including Harmonisation Specifications''. May 2011.

HOPE: Heritage of the People's Europe. ''HOPE Archival Profile Mapping Table, Version 1.5''. August 2012.

HOPE: Heritage of the People's Europe. ''HOPE Dublin Core Profile Mapping Table, Version 1.5''. August 2012.

HOPE: Heritage of the People's Europe. ''HOPE Library Profile Mapping Table, Version 1.5''. August 2012.

HOPE: Heritage of the People's Europe. ''HOPE Visual Profile Mapping Table, Version 1.5''. August 2012.

''ISO (International Organization for Standardization), Language codes - ISO 639'' (https://www.iso.org/iso-639-language-codes.html)