Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Language missing in the OAI-DDI #7388

Closed
bappun opened this issue Nov 4, 2020 · 8 comments · Fixed by #7958
Closed

Language missing in the OAI-DDI #7388

bappun opened this issue Nov 4, 2020 · 8 comments · Fixed by #7958

Comments

@bappun
Copy link

bappun commented Nov 4, 2020

Most of our datasets are described as "french" datasets in the metadata. For example: https://data.sciencespo.fr/dataset.xhtml?persistentId=doi:10.21410/7E4/00LYOG (detailed metadata are embedded in the dataset here).

We imported a DDI file where the French language is set at a file level also at a study level:

<?xml version="1.0" encoding="UTF-8"?>
<codeBook xml:lang="fr" xsi:schemaLocation="ddi:codebook:2_5 https://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5" ID="fr.cdsp.ddi.OV70"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns="ddi:codebook:2_5">
<stdyDscr>
	<citation>
		<titlStmt>
			<titl xml:lang="fr">L'ouvrier français en 1970</titl>
			<parTitl xml:lang="en">The French Working Class in 1970</parTitl>
			<IDNo agency="CDSP">fr.cdsp.ddi.OV70</IDNo>
			<IDNo agency="DataCite">doi:10.21410/7E4/00LYOG</IDNo>
		</titlStmt>

But this information is lost at study level, either when harvesting in oai-ddi or when downloading metadata as DDI, for example:

<OAI-PMH
	xmlns="http://www.openarchives.org/OAI/2.0/"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
	<responseDate>2020-11-04T13:29:10Z</responseDate>
	<request verb="ListRecords" metadataPrefix="oai_ddi" set="CDSP">https://data.sciencespo.fr/oai</request>
	<ListRecords>
		<record>
			<header>
				<identifier>doi:10.21410/7E4/00LYOG</identifier>
				<datestamp>2020-08-22T00:00:07Z</datestamp>
				<setSpec>CDSP</setSpec>
				<setSpec>ALL_SCPO</setSpec>
			</header>
			<metadata>
				<codeBook
					xmlns="ddi:codebook:2_5"
					xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="ddi:codebook:2_5 https://ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" version="2.5">
					<docDscr>
						<citation>
							<titlStmt>
								<titl>L'ouvrier français en 1970</titl>
								<IDNo agency="DOI">doi:10.21410/7E4/00LYOG</IDNo>
							</titlStmt>
							<distStmt>
								<distrbtr source="archive">data.sciencespo</distrbtr>
								<distDate>2020-05-05</distDate>
							</distStmt>
							<verStmt source="DVN">
								<version date="2020-05-13" type="RELEASED">2</version>
							</verStmt>
							<biblCit>Adam, Gérard; Bon, Frédéric; Capdevielle, Jacques; Mouriaux, René; Lavau, Georges, 2020, "L'ouvrier français en 1970", https://doi.org/10.21410/7E4/00LYOG, data.sciencespo, V2</biblCit>
						</citation>
					</docDscr>

However it is kept in oai-dc harvesting: <dcterms:language>French</dcterms:language>

Maybe this is two different issues: one about the import that does not save the xml:lang attribute, and another about the dataset language that is not added to the OAI-DDI.

@pdurbin
Copy link
Member

pdurbin commented Nov 4, 2020

It sounds like a bug with import, which is strange because as far as I know import uses the one and only DDI parsing code in the system. @bappun is this something you or someone on your team might be able to make a pull request for? I know @JingMa87 has also been working in related code recently.

@jggautier
Copy link
Contributor

jggautier commented Nov 4, 2020

Hi @bappun. I think there are multiple issues to tease out here.

When you wrote that you "imported a DDI file where the French language is set at a file level also at a study level", could you write more about what you mean by "a file level"? Is that referring to metadata of data files within a dataset?

@bappun
Copy link
Author

bappun commented Nov 4, 2020

Hi all!

@pdurbin We are currently in contact with Danny about how we can contribute for some issues (that are sometimes linked). We should have more information soon! :)

@jggautier I'm sorry for the "file level", I meant the root of the xml file (<codeBook xml:lang="fr").

@jggautier
Copy link
Contributor

jggautier commented Nov 23, 2020

Can't believe it's been 19 days! Sorry for this late reply.

Information loss
Dataverse takes imported DDI xml, maps the elements it can to fields in its metadata blocks, and re-creates the XML file when it exports that metadata. If Dataverse isn't able to map elements in the DDI xml to fields in its metadatablocks, it drops that information. Dataverse has no field for expressing the language of the metadata document (the metadata as a whole) or metadata entered in particular metadata fields, so that information is lost on import.

There are a few github issues about this, e.g. one about indicating the language of the metadata record as a whole (#4632) and one about indicating the language of metadata entered in specific fields (#4633). These issues follow the convention that when Dataverse imports metadata files, like the DDI xml you imported, it tries to map information in those files to fields in its own metadata model (the fields in its metadatablocks), and ignores information it can't map. Once that information is mapped to the Dataverse installation's fields, that metadata is also editable within the Dataverse repository.

What hasn't been discussed in any open GitHub issues I could find is the possibility of Dataverse somehow retaining the information that it can't map to fields in its metadatablocks, instead of ignoring it; including it in its index so that its searchable; and adding it to the metadata exports. But if the metadata isn't mapped to Dataverse's fields, it won't show up in the UI and is effectively not editable by depositors/curators (at least not in through the Dataverse UI).

  • I think this wouldn't be okay if you'd like to let people edit the metadata of datasets created by importing DDI xml files.
  • But if Dataverse installations could retain and index more of the metadata it harvests from other sources, which should not be editable from the installation doing the harvesting, this could improve the discovery of harvested metadata.

Language of data versus language of metadata
What you see returned in the oai_dc harvesting is the language chosen for the "Language" field in the "Citation" metadatablock. For example, https://doi.org/10.21410/7E4/00LYOG is one of the datasets in that oai-pmh feed, and in that dataset's third version, depositors/curators used Dataverse's "Language" metadata field to add "French". Dataverse's "Language" metadata field is defined as the "language of the dataset." I've always taken this to mean the language of the data files themselves, as opposed to the language of the metadata document (e.g. <codeBook xml:lang="fr" as you mentioned).

And like you already wrote, Dataverse isn't mapping what's chosen for the "Language" metadata field, like the "French" value chosen in https://doi.org/10.21410/7E4/00LYOG, to any elements or attributes in the DDI xml that it exports. I'm not sure that DDI Codebook has one element to describe the language all of a dataset's data files.

Maybe we would add a lang attribute to the metadata of each of a dataset's data files? For example:

  • For ingested files: <fileDscr ID="f4395" xml:lang="fr" URI="http://dataverse.icrisat.org/api/access/datafile/4395">
  • For non-ingested files: <otherMat ID="f4186257" xml:lang="fr" URI="https://dataverse.harvard.edu/api/access/datafile/4186257" level="datafile">

@bappun
Copy link
Author

bappun commented Feb 15, 2021

At Sciences Po we only have 1 language per dataset. So we are wondering if it would be possible to add the xml:lang attribute to the <codeBook> element in the OAI using the metadata language field value.

For the datafiles I do not have more information as we only have 1 language per DDI.

@jggautier
Copy link
Contributor

jggautier commented Feb 16, 2021

Thanks @bappun.

In the Sciences Po repository, when a depositor chooses French in the Language field of the dataset, what is the depositor saying?:

  1. this data is in French
  2. this metadata is in French
  3. both the data and the metadata are in French

I took a look at a few datasets in the Sciences Po repository but could not tell what the depositors intended since "French" is chosen in the Language field, plus the metadata and the data files of the datasets both contain text in the French language.

@gmi-cdsp
Copy link

Thanks @jggautier !
When we say that a dataset is in one language (french for example), this means that both the data and the metadata are in french.
This sample is in english
This one in french

(there can still be some inconsistencies and no data file is present ... at the moment)

@jggautier
Copy link
Contributor

jggautier commented Apr 29, 2021

Thanks. That's very helpful, and apologies again for the delay. I hope this illustration more clearly communicates my understanding of the XML you pasted in your first comment and clarifies the different needs:

Untitled-1

I interpret that XML (in red) as saying that:

  • The main language of the entire DDI Codebook metadata is French
  • The text entered into the dataset title field is in French
  • The text entered in the Parallel Title field (DDI Codebook's parTitl element) is the English translation of the dataset's title

This xml does not tell me the language of the data files.

If Sciences Po needs to specify only (1) the language of the entire metadata document, (2) the dataset title in a second language, and (3) the language of the data files, here are some proposals I hope will help get your repository to a great solution:

Proposal 1:

Metadata form changes

  • Add a field in the Citation metadatablock for entering the main language of the dataset metadata, what I'll call "Metadata language" field
  • Add a field in the Citation metadatablock for entering the dataset title in another language and specifying what language that is (what DDI Codebook calls "Parallel Title")
  • Change the field name and tooltip (in the UI) of the Citation metadatablock's current "Language" field to differentiate it from the new "Metadata language" field and to make it more obvious that the current "Language" field only specifies the language of the data files

Metadata mapping changes

  • Make sure that when the Dataverse software imports DDI Codebook xml files:
    • The language specified in <codeBook xml:lang="language code"... is entered into the new field for specifying the main language of the dataset metadata
    • The translated title and the translation language specified in <parTitl xml:lang="language code"> is entered into the new Parallel Title field (which might be labelled something else, like "Translated Title")
  • Make sure that when exporting DDI Codebook xml files:
    • What's entered in the Citation metadatablock's current "Language" field is included in the xml exports
    • The values for the two new fields (that is, the "Metadata language" and "Parallel Title" fields) are included in the xml exports

Proposal 2:

Metadata form changes

  • Add a field in the Citation metadatablock for entering the main language of both the dataset metadata and the data files of that dataset. I'll call it the "Metadata and files language" field for now.
    • I would argue that this new field would make the Citation metadatablock's current "Language" field unnecessary, and at the very least, repositories who plan to use this new "Metadata and files language" field should be encouraged to hide/remove the Citation metadatablock's current "Language" field.
  • Add a field in the Citation metadatablock for entering the dataset title in another language and specifying what language that is (what DDI Codebook calls "Parallel Title")

Metadata mapping changes

  • When the Dataverse software imports DDI Codebook xml files:
    • Determine how to map certain values in the imported xml files to that new "metadata + files language" field.
  • Make sure that when exporting DDI Codebook xml files:
    • What's entered into this new "Metadata and files language" field is expressed in the several elements of the DDI Codebook xml export
    • The value for the "Parallel Title" field is included in the DDI Codebook xml exports

Proposal 3:

Consider the Sciences Po repository's Citation metadatablock as a fork of the Dataverse software's Citation metadatablock, so that Sciences Po can redefine (or broaden the definition of) the current Citation metadatablock "Language" field. In the Sciences Po repository, that "Language" field would specify the language of both the metadata and the data files.

The repository would also need to fork the code it uses for importing and exporting DDI Codebook XML in ways that maintain the meaning of the metadata it imports from other repositories (if it does or plans to do that) and maintain the meaning of its metadata when it's imported by other repositories.

I'm only including proposal 3 to be divergent in my thinking and encourage more divergent thinking about solutions to this issue. I hope Sciences Po doesn't need to fork their code. I think that other proposals could include addressing the more robust and granular needs described in #4633.

I also think that this issue is another example of how Dataverse software more flexibly handling metadata can benefit Dataverse repositories. Similar to how the customization of controlled vocabularies is being made more flexible (and the customization of metadata fields can be made more flexible), so can metadata mapping when importing and exporting metadata. But I'm assuming that making metadata mapping more flexible will take more time than you and your colleagues have for this task of reducing information loss when importing and exporting its dataset metadata.

I can include illustrations (like mockups) for the first two proposals if that would help make sure that we all understand them. Looking forward to hearing what you think!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants