Title: Fishing in the Data Ocean
Team: Academia Sinica Center for Digital Cultures (ASCDC)
The TaiUC (Taiwan Digital Archives Union Catalogue), similar to the Europeana, an online portal with over 5 million digital objects, has collected digitized objects over the past 15 years from more than 100 libraries, archives, museums, academic institutions, and government agencies in the whole Taiwan, such as the National Central Library, Academia Historica and National Palace Museum. The collection includes books, newspapers, artworks, photos, specimen and sounds. Most of the metadata and content are in Chinese and Asia Culture oriented. Academia Sinica Center for Digital Cultures (ASCDC) is now in charge of the sustainable operation. The presentation aims to report how we adopt Lined Open Data (LOD) approach to publish these structure data, in the light of the Fish Datasets in our first stage of LOD project, to make metadata and its digital objects get connected with related resources in the world.
Data Sources and the Challenges
Taiwan and its surrounding islands, situated in East Asia at the northwestern edge of the Pacific, possess an endless variety of terrains, forests, agricultural products, and marine ecologies. In the 19th century, a succession of adventurers from Western countries came to Taiwan (Formosa is Portuguese historical name for the Island of Taiwan) to collect animal and plant specimens for research. For instance, Robert Swinhoe (1836-1877), an English biologist, was the first to collect specimens on the main island of Taiwan, and meanwhile organized them systematically on a large scale, thus pioneering the research on Taiwan’s natural history. They collected and recorded species in Taiwan and brought back to their home countries. These materials were then kept in different institutions and became the first-hand information about the Formosa’s natural environment for European.
The 40 thousand specimens from the Fish Dataset of TaiUC, originally from four research museums and universities, are the first phase of the LOD project to make the links with other relevant biodiversity information and external authority vocabularies on the web. By transforming and publishing the metadata into Linked Open Data, we will be able to access the precious digital heritage scattered around the world, and reappear Taiwan biodiversity investigations’ travelling traces and achievements.
However, we face the following challenges: (1) How can we recover the original contextual and rich information of the Fish Collection by the LOD task? (2) 40 thousand metadata records of the TaiUC come from various institutions which have their own metadata formats and mapped into the TaiUC as the Dublin Core format. How can we accommodate and reconcile the variety of attributes and data values? (3) The synonyms of species are recorded in each metadata record and haven’t linked into a LOD-based species name vocabularies. How can we resolve the issue? (4) There are strong relationships among the 40 digital items, but not established yet. For instance, there are independent records for X-ray pictures, otolith images, and specimens concerning the same fish, and not mechanism built to aggregate these relevance or even uncover the detail relations. How can we make it work?
The existing 40,000 items of the fish collection in TaiUC are primarily in the forms of specimen pictures, otolith images, X-ray images, and manual drawings. On average, each record composed of 24 fields concerning the biological data, including scientific name, vernacular name, taxonomy, identification number, type, characteristic, habitats, distribution, fishery use, collect method, collect location, longitude, latitude, depth, specimen length, specimen weight, collection date, collector, publisher, reference, source, language, rights holder, and image rights.
To reconcile the data sources in the TaiUC, where the fish collection kept original contextual information, we first analyze the original XML files of their respective projects, develop dedicated Triple parser program using Apache Jena framework, and adopt “dcterms” in the design as the generic vocabularies, as well as another 7 specific vocabularies, such as TaxonConcept (txn), Darwin Core (dwc), Open Vocabulary (ov), Wgs84 (geo), Sampling Features (sam), Schema.org(schema), and Creative Commons (cc), and 29 property types are reused, allowing thorough presentation of rich and specific information, whereby the context of such data may be interpreted as it is originally intended.
Furthermore, 7 external authority vocabularies such as AAT, DBpedia, DWC terms, GeoNames, VIAF and ASCDC are added to the data model to enrich the original data. We adopt VIAF for people and institutes; GeoNames for geological information; AAT, DWC Terms for data types; and Catalogue of Life in Taiwan (TaiCol) for scientific names. Finally, the triple-parsed metadata XML files are converted into RDF format.
While Chinese is used as the primary language in the TaiUC, non-Chinese users should have no problem with using our dataset because we have reused commonly adopted external authority resource such as AAT, VIAF and GeoNames when converting the original XML files into LOD dataset. In particular, AAT is a multi-lingual LOD dataset, as its data may describe the same concept with non-preferred terms that work with different languages including Chinese, thus language will not be a barrier when interpreting the data.
In addition to data reconciliation, other challenges for this project are dealing with synonyms naming in the original metadata, such as vernacular names and scientific names, and determining taxonomic hierarchy, such as family and genus. Such information could change when biology identification method changes. From morphological identification to modern molecular technology based gene identification, both species categorization system and phylogeny have changed dramatically. Taiwanese species scientific names can be found at Catalogue of Life in Taiwan (taicol.tw) website which are open licensing and maintained by ASCDC, but this resource is not LOD dataset. This is a gap that requires us to concurrently convert the biologic information stored at TaiCOL into LOD-based dataset. As for now, we convert the format of the data existing under Animalia into RDF format, thus all TaiUC information concerning taxon concept will be generally interpreted by TaiCOL LOD dataset.
We have converted TaiUC’s data to LOD on a platform (data.ascdc.tw). Anyone wishes to download digital files from this platform is free to access. This collection is released based on CC0, although the digital images still have their own Creative Common terms. From the search results, the structured information of the fish collection has been shown by specific properties, and the data can be downloaded from Sparql endpoint. We also use the lodlive tool to visualize LOD and assist the users to understand the data model and its content clearly.
Views on the future 1: connect to the Europeana by the LOD task
We now use Europeana’s released search API service to identify scientific names. By using the property “schema:isRelatedTo”, searching links between scientific names in Europeana and the TaiUC. However, we have found more than 100 crucian carps in Europeana.
How do we find the exact TaiUC fish related to Europeana collection? Can we simply identify location? Yes, but currently Europeana uses Taiwan, Formosa as name for location, TaiUC uses Keelung River for location. We have got more than 3,000 related items from Europeana. Therefore, we first need to build a connection with Europeana by adding Taiwan as our vocabularies. Second, we reuse the property owl:sameAs to make a link between the Taiwan and Taiwan, Formosa.
Views on the future 2: Global cooperation for biological LOD datasets
Once the fish collection is published as LOD dataset, the TaiUC datasets can link up with Europeana LOD dataset. The linkage with Europeana may allow us to explore archives scattered all over the world and trace 200 years of records of native species. However, more detailed information are kept at international LAMs (Library, Archive, Museum), whose original data are not yet converted into LOD format. We further plan to work with LAMs community and enrich TaiUC content and enhance linkage and access among international biodiversity datasets.
Views on the future 3: Integration of related data
We will also adopt the EDM (Eruopeana Data Model) to develop series connections among every record kept at TaiUC. For example, when display information of a certain fish, all X-ray pictures, otolith images, and other available information concerning the same fish will also appear with relation indicated. Eventually, a data is no longer limited to one record; it will be presented in a meaningful manner with overall relation and context.
Views on the future 4: Digital exhibition
In the future, we intend to apply our results to digital exhibition. The initial application has started since 2016 on the digital museum exhibition system established by the ASCDC. We hope that TaiUC resources to be CC-licensed for public access. Furthermore, it is planed to let TaiUC LOD dataset be presented by subjects, with visualized tools, so that users may easily select subjects, download collections, and upload data through a digital museum exhibition interface, where the system will automatically display images from semantic metadata.
It is believed that when the goals are ultimately achieved, we will be able to enhance information retrieval of digital heritage, and further help scholars, researchers, general publics and of course machines to consume data in an innovative way.
Team members: Sophy Shu-Jiun Chen, Wei Cheng, Lu-Yen Lu, Sheng Ting, Hsiang-An Wang
Country: Taiwan (R.O.C.)