18 December 2023

Why we should publish our data as Creative Commons Zero (CC0)

By data we mean specimen, observation or checklist datasets published as Darwin Core Archives and any derivates. To keep the discussion focused, this does not include pictures or software code.

What we hope to achieve

One license for the whole Canadensys community (easier for aggregation and it sends a strong message as a community)
An existing license (we don’t want to write our own legal documents)
An open license (allowing our data to be really used)
A clear license (so users can focus on doing great research with the data, instead of figuring out the fine print)
Giving credit where credit is due

Our recommendation

cc-zero We recommend Canadensys participants to publish their data under the Creative Commons Zero (CC0) license. With CC0 you waive any copyright you might have over the data(set) and dedicate it to the public domain. Users can copy, use, modify and distribute the data without asking your permission. You cannot be held liable for any (mis)use of the data either.

CC0 is recommended for data and databases and is used by hundreds of organizations. It is especially recommended for scientific data, and thus encouraged by Pensoft (see their guidelines for biodiversity data papers) and Nature (see this opinion piece). Although CC0 doesn’t legally require users of the data to cite the source, it does not take away the moral obligation to give attribution as is common in scientific research (more about that below).

Why would I waive my copyright?

For starters, there’s very little copyright to be had in our data, datasets and databases. Copyright only applies to creative content and 99% of our data are facts, which cannot be copyrighted. We do hold copyright over some text in remarks fields, the data format or database model we chose/created, and pictures. If we consider a Darwin Core Archive (which is how we are publishing our data) the creative content is even further reduced: the data format is a standard and we only provide a link to the pictures, not the picture itself.

Figuring out where the facts stop and where the (copyrightable) creative content begins can already be difficult for the content owner, so imagine what a legal nightmare it can become for the user. On top of that different rules are used in different countries. Publishing our data under CC0 removes any ambiguity and red tape. We waive any copyright we might have had over the creative content and our data gets the legal status of public domain. It can no longer be copyrighted by anyone.

Can’t we use another license?

Let’s go over the options. Keep in mind that these licenses only apply to the creative aspect of the dataset, not the facts. But as pointed out above, figuring this out can be difficult or impossible for the user. So much in fact, that the user might decide not to use your data at all, especially if they think they might not meet the conditions of the license.

All rights reserved

Conclusion: Not good.

Open Data Commons Public Domain Dedication and License (PDDL)

There are no restrictions on how to use the data. This license is very similar to CC0

Conclusion: Perfect, in fact this license was a precursor of CC0, but… it is less well known and maybe not as legally thorough as CC0. CC0 made a huge effort to cover legislation in almost all countries and the Creative Commons community is working hard to improve this even further. Therefore, if you have to choose, CC0 is probably the best bet.

Creative Commons Attribution-NoDerivs (CC BY-ND)

The user cannot build upon the data(set), which is what most data use involves.

Conclusion: Not good.

Creative Commons Attribution-NonCommercial (CC BY-NC)

The user cannot use the data(set) for commercial purposes. This seems fine from an academic viewpoint, but the license is a lot more restrictive than intuitively thought. See: Hagedorn, G. et al. ZooKeys 150 (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information.

Conclusion: Not good.

Creative Commons Attribution-ShareAlike (CC BY-SA) or Open Data Commons Open Database License (ODbL)

The user has to share any work based upon the data(set) under the same or similar license to this one.

Conclusion: Good, but… this can lead to some problems for an aggregator like Canadensys or GBIF: if they are mixing and merging data with different “SA” licenses, which one do they choose? They might be incompatible.

Creative Commons Attribution (CC BY) or Open Data Commons Attribution License (ODC-By)

The user has to attribute the data(set) in the manner specified by the owner. This condition is also present in the three licenses above.

Conclusion: Good, but… this can lead to impractical “attribution stacking”. If an aggregator or a user of that aggregator is using and integrating different datasets provided under a BY license, they legally have to cite the owner for each and every one of them in the manner specified by those owners (again, for the potential creative content in the data).

But giving credit/attribution is a good thing!

Absolutely, but legally enforcing it can lead to the opposite affect: a user might decide not to use the data out of fear of not completely complying with the license (see paragraph above). As hinted at the beginning of this post, CC0 removes the drastic legally enforceable requirement to give attribution, but it does not remove the moral obligation to give attribution. In fact, this has been the common practice in scientific research for many decades: legally, you don’t have to cite someone’s research/data you’re using, but not doing so won’t make you very popular either.

To encourage users to give credit where credit is due, we propose to create Canadensys norms. Norms are not a legal document, but a “code of conduct” where we declare how we would like users to use, share and cite our data, and how they can participate. We can explain how one could cite an individual specimen, a collection, a dataset or an aggregated “Canadensys” download. We can point out that our data is constantly being corrected or added to, so it is useful to keep coming back to the original repository and not to a secondary repository that may not have been updated. In addition to that, we can build tools to monitor downloads or automatically compile an adequate citation. And with data papers - which drafts can be automatically generated from IPT - dataset are really brought into the realm of traditional publishing and the associated scientific recognition.

All to say that are mechanisms where both users and data owners benefit, without the legal burden. It guarantees that our data can be used now and in the future. We hope you’ll join us in our recommendation.

A huge thanks to the Gregor Hagedorn and the authors of the Pensoft guidelines for data papers.