Case Study: Open Society Archives (OSA) Manages Data about Objects
Since opening their doors in 1995, the Open Society Archives (OSA) at the Central European University in Budapest (Hungary) have collected and made openly accessible material on communism, the Cold War, and their aftermath and on human rights worldwide. OSA hold approximately 7,000 linear meters of archival records (including audio visual content) and a library collection which comprises more than 6,500 dailies, journals, and informal press titles. OSA have approximately 30 staff members with a small dedicated ICT unit, which works in coordination with the university's main ICT unit.
Prior to the project, OSA depended on separate in-house developed solutions to catalog their archives, library, and film library collections and make them available on the website. Separate databases were likewise created for each digitized collection. For OSA, the HOPE project was primarily a means to improve the level of their internal systems and practices—in particular they were eager to introduce a robust digital object repository. OSA opted to join the HOPE PID Service and the HOPE Shared Object Repository (SOR), hoping to work closely to integrate these systems into their envisioned repository.
It was clear from the outset of the HOPE project that a digital object repository was not only desirable but also necessary in order to meet HOPE requirements. The existing system was fragmented with a separate data structure for each digital project and little or no administrative metadata kept on the digitized content. The storage of digital objects was also idiosyncratic, with masters generally stored on tapes and [[Glossary#Derivative|derivatives]] distributed over several servers or stored directly in the website.
OSA's first step was to develop a common metadata schema, which included both descriptive metadata (for archival items, library items, and digital collections) and technical metadata on the related digital files. The schema was based on known standards, primarily Encoded Archival Description (EAD), MARC21, Dublin Core (specifically, the Dublin Core Collections Application Profile), and PREMIS. For archival description, OSA opted to define their own metadata elements based on EAD rather than to directly use the schema; MARC21, Dublin Core, and PREMIS elements were directly incorporated.
A new architecture was introduced based on:
- ''Fedora Commons:'' low-level metadata storage;
- ''Apache Solr:'' search engine;
- ''Drupal:'' website content management system and content display user interface.
Fedora is designed around “compound digital objects”, whereby one or more “content items” are aggregated into the same digital object. Content items can be of any format and can either be stored locally in the repository or stored externally and referenced by the digital object. Each content item in Fedora is represented by a datastream. OSA found Fedora to have “excellent flexibility”, allowing them to design objects according to institutional needs. In the end, OSA decided on an atomistic “everything is an object” approach and defined objects in the following way:
''Collection objects'' include a descriptive metadata datastream for a single collection.
''Item objects'' include:
- Descriptive metadata datastream for a single item (multi-page document, single-page document, film, etc.);
- Relationship metadata on the parent collection;
- METS metadata datastream to relate file objects to items.
''File objects'' include:
- File (master, derivative, or other), as externally managed content on OSA's file server or in the HOPE Shared Object Repository;
- Technical metadata, as an externally managed XML file on OSA's file server.
Each collection and item object will have a unique local identifier in the form of “osa:” followed by a 32bit hex GUID hash. Each file object will have a local identifier in the same form as the identifier of the item to which it belongs followed by a quality suffix and sequential page information. These local identifiers will form the root of the OSA PIDs stored by the HOPE PID Service.
Example: The following shows the local identifier and PID proposed for a sample Fedora item (which encompasses the descriptive and structural metadata for a single digital object).
- Item Fedora ID: osa:3b34347820ef45f0bfd87239c096c5a1
- Item PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1<br>Item PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1
Example: The following shows the local identifier and file name proposed for the master of the first page of the above compound object item.
- Master Fedora ID: osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- Master file name: 3b34347820ef45f0bfd87239c096c5a1_m_001.tif
Example: The following shows the PID for the master and derivatives 2 and 3 when the master is submitted to the Shared Object Repository. In this case, derivatives are generated and stored by the Shared Object Repository and the URI forms of PIDs include Handle location attributes generated by the HOPE PID Service.
- Master PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- Master PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001?…
- PDF PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- PDF PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001?…
- Thumbnail PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- Thumbnail PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001?…
Example: The following shows the PID for the master as well as the local identifiers, file names, and PIDs for derivatives 2 and 3 when these are generated and stored locally.
- Master PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- Master PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_m_001
- Derivative PDF Fedora ID: osa:3b34347820ef45f0bfd87239c096c5a1_l_000
- Derivative PDF file name: 3b34347820ef45f0bfd87239c096c5a1_l_000.pdf
- Derivative PDF PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_l_000
- Derivative PDF PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_l_000
- Derivative thumbnail Fedora ID: osa:3b34347820ef45f0bfd87239c096c5a1_t_001
- Derivative thumbnail file name: 3b34347820ef45f0bfd87239c096c5a1_t_001.jpg
- Derivative thumbnail PID: hdl:10891/osa:3b34347820ef45f0bfd87239c096c5a1_t_001
- Derivative thumbnail PID URL: http://hdl.handle.net/10891/osa:3b34347820ef45f0bfd87239c096c5a1_t_001
(File names follow the same convention as Fedora file identifiers.)
The identifier convention went through several iterations as OSA adapted and defined its workflow and data structure. The final convention was determined in a meeting involving professional colleagues, ICT, and management—a meeting which also treated archival reference codes and library special collection call numbers. The need to create persistent and globally unique PIDs drove this process. The syntax for local identifiers, PIDs, and files names thus arrived at is robust and also reflects the relationship of entities in OSA's system. On the other hand, identifiers and file names exceed recommended lengths, which may hinder internal administration and possibly external use. The inclusion of an institutional acronym could also prove a problem over the long term.
In addition to local identifiers and PIDs, OSA plan to capture and store a range of technical metadata, including provenance information. As mentioned, OSA use PREMIS as a base schema but include extensions to other format-specific schema, e.g. NISO MIX, videoMD, and audioMD; PREMIS accommodates extensions. To automatically capture as much of this metadata as possible, OSA tested several available solutions across various content formats comparing their results to their technical metadata requirements. In the end, OSA have opted for two solutions: Jhove for documents and images; MediaInfo for audio and video files. These were chosen based on "the completeness of results and their active development status". Both programs generate XML files as an output. These technical metadata files will be named in accordance with master copies and stored in the same directory structure on a local file server. To facilitate automatic metadata capture, OSA plan to develop several small applications. The first will check the directory structure to locate any master files without an accompanying technical metadata XML file; if located the application will check the mime type and trigger the generation of metadata using either Jhove or MediaInfo. The second will create a datastream pointing to the technical metadata XML file, as part of the Fedora file object creation process. Fedora will be able to pull in the content of the technical metadata file upon request.
Several values cannot be generated automatically during this process, and OSA are still looking for possible solutions to this problem. These include metadata on the hardware and software used to create the files as well as the creation date, format registry information, and original file names. OSA plan to look into possibilities for embedding more metadata into files at the point of digitization, using metadata schemas such as exif or xmp. They may also consider creating a collection-level technical metadata datastream to store global values such as the names of service providers or format registry information as well as external links to scanning logs and file naming tables on the whole collection. They point out that since this information is primarily for long-term preservation, it is currently unnecessary to store it at a high level of granularity.
Beyond this OSA still plan to develop modules for Rights and Events management, also based on the PREMIS model, though extended to meet local needs. Events will allow OSA to monitor new file creation and deletion, fixity checks, and other information related to the objects history after submission to the repository. Rights are more problematic. OSA are currently looking into methods which will allow them to manage the rights for all their holdings—analog and digital—across the entire archival workflow.
OSA were deeply involved in repository best practice work and decided to use their own in-house development as a test-bed for practices they were advocating. Thus far, OSA have implemented many HOPE best practice recommendations into their data structure and system and have developed several solutions to generate, store, and manage administrative metadata and PIDs. Nevertheless, OSA are still at the beginning of a long process, and the real test will come as they attempt to manage objects through the entire process of digitization, ingest, storage, harvesting, and migration.