Why we should publish our data under Creative Commons Zero (CC0)

With the first datasets getting published and more coming soon, the issue comes up under what license we – the Canadensys community and the individual collections – will publish our data. Dealing with the legal stuff can be tedious, which is why we have looked into this issue with the Canadensys Steering Committee & Science and Technology Advisory Board before opening the discussion to the whole community.

By data we mean specimen, observation or checklist datasets published as a Darwin Core Archive and any derivatives. To keep the discussion focused, this does not include pictures or software code.

2012.01.30 – Update to post: technically CC0 is not a license, but a waiver (see comment below).

What we hope to achieve

  1. One license for the whole Canadensys community, which is easier for aggregation and sends a strong message as one community.
  2. An existing license, because we don’t want to write our own legal documents.
  3. An open license, allowing our data to be really used.
  4. A clear license, so users can focus on doing great research with the data, instead of figuring out the fine print.
  5. Giving credit where credit is due.

Our recommendation

cc-zero We recommend Canadensys participants to publish their data under Creative Commons Zero (CC0). With CC0 you waive any copyright you might have over the data(set) and dedicate it to the public domain. Users can copy, use, modify and distribute the data without asking your permission. You cannot be held liable for any (mis)use of the data either.

CC0 is recommended for data and databases and is used by hundreds of organizations. It is especially recommended for scientific data and thus encouraged by Pensoft (see their guidelines for biodiversity data papers) and Nature (see this opinion piece). Although CC0 doesn’t legally require users of the data to cite the source, it does not take away the moral responsibility to give attribution, as is common in scientific research (more about that below).

Why would I waive my copyright?

For starters, there’s very little copyright to be had in our data, datasets and databases. Copyright only applies to creative content and 99% of our data are facts, which cannot be copyrighted. We do hold copyright over some text in remarks fields, the data format or database model we chose/created, and pictures. If we consider a Darwin Core Archive (which is how we are publishing our data) the creative content is even further reduced: the data format is a standard and we only provide a link to pictures, not the pictures themselves.

Figuring out where the facts stop and where the (copyrightable) creative content begins can already be difficult for the content owner, so imagine what a legal nightmare it can become for the user. On top of that different rules are used in different countries. Publishing our data under CC0 removes any ambiguity and red tape. We waive any copyright we might have had over the creative content and our data gets the legal status of public domain. It can no longer be copyrighted by anyone.

Can’t we use another license?

Let’s go over the options. Keep in mind that these licenses only apply to the creative aspect of the dataset, not the facts. But as pointed out above, figuring this out can be difficult or impossible for the user. So much so in fact, that the user may decide not to use the data at all, especially if they think they might not meet the conditions of the license.

All rights reserved

copyright The user cannot use the data(set) without the permission of the owner.

Conclusion: Not good.

Open Data Commons Public Domain Dedication and License (PDDL)

There are no restrictions on how to use the data. This license is very similar to CC0.

Conclusion: Perfect, in fact this license was a precursor of CC0, but… it is less well known and maybe not as legally thorough as CC0. CC0 made a huge effort to cover legislation in almost all countries and the Creative Commons community is working hard to improve this even further. Therefore, if you have to choose, CC0 is probably better.

Creative Commons Attribution-NoDerivs (CC BY-ND)

by-nd The user cannot build upon the data(set), which is what most data use involves.

Conclusion: Not good, and sadly used by theplantlist.org. Roderic Page pointed this out by showing what cool things he can NOT do with the data.

Creative Commons Attribution-NonCommercial (CC BY-NC)

by-nc The user cannot use the data(set) for commercial purposes. This seems fine from an academic viewpoint, but the license is a lot more restrictive than intuitively thought. See: Hagedorn, G. et al. ZooKeys 150 (2011). Creative Commons licenses and the non-commercial condition: Implications for the re-use of biodiversity information.

Conclusion: Not good.

Creative Commons Attribution-ShareAlike (CC BY-SA) or Open Data Commons Open Database License (ODbL)

by-sa The user has to share any work based upon the data(set) under a license that is identical or similar to the one used.

Conclusion: Good, but… this can lead to some problems for an aggregator like Canadensys or GBIF: if they are mixing and merging data with different SA licenses, which one do they choose? They might be incompatible.

Creative Commons Attribution (CC BY) or Open Data Commons Attribution License (ODC-By)

by The user has to attribute the data(set) in the manner specified by the owner. This condition is also present in the three licenses above.

Conclusion: Good, but… this can lead to impractical “attribution stacking”. If an aggregator or a user of that aggregator is using and integrating different datasets provided under a BY license, they legally have to cite the owner for each and every one of those in the manner specified by these owners (again, for the potential creative content in the data). See point 5.3 at the bottom of this Creative Commons page for a better explanation and this blog post for an example.

But giving credit is a good thing!

Absolutely, but legally enforcing it can lead to the opposite affect: a user may decide not to use the data out of fear of not completely complying with the license (see paragraph above). As hinted at the beginning of this post, CC0 removes the drastic legally enforceable requirement to give attribution, but it does not remove the moral obligation to give attribution. In fact, this has been the common practice in scientific research for many decades: legally, you don’t have to cite the research/data you’re using, but not doing so could be considered plagiarism, which would compromise your reputation and the credibility of your work.

To encourage users to give credit where credit is due, we propose to create Canadensys norms. Norms are not a legal document (see an example here), but a “code of conduct” where we declare how we would like users to use, share and cite our data, and how they can participate. We can explain how one could cite an individual specimen, a collection, a dataset or an aggregated “Canadensys” download. We can point out that our data are constantly being corrected or added to, so it is useful to keep coming back to the original repository and not to a secondary repository that may not have been updated. In addition to that, we can build tools to monitor downloads or automatically create an adequate citation. And with the arrival of data papers – which drafts can now be automatically generated from IPT – data(sets) are really brought into the realm of traditional publishing and the associated scientific recognition.


All this to say that there are mechanisms where both users and data owners can benefit, without the legal burden. CC0 + norms guarantees that our data can be used now and in the future. I for one will update the license for our Université de Montréal Biodiversity Centre datasets. We hope you will join us!

Thanks to the Gregor Hagedorn for his valuable advice on all the intricacies of data licensing.

  • Felix Sperling

    I think this is an excellent summary, and the section on “But giving credit is a good thing” is especially good. I’m ‘on board’ with CCO.
      –  Felix Sperling, 

  • timothy vollmer

    This is a fantastic post! One nit is that CC0 is technically not a license, but a waiver. Thanks. 

    Creative Commons

    • http://www.linkedin.com/in/peterdesmet Peter Desmet

      Thanks a lot! I was actually wondering today why CC0 is not listed under licenses on http://www.creativecommons.org. This explains why.
      Guess I’ll have to update my post and title. :-)

      • http://www.linkedin.com/in/peterdesmet Peter Desmet

        And now updated. This kind of feedback is really useful.

  • timothy vollmer

    This is a fantastic post! One nit is that CC0 is technically not a license, but a waiver. Thanks. 

    Creative Commons

  • Pingback: Därför ska du använda CC0 för din data | Dead Session

  • MArk Markcost

    One of best summaries of the pros and cons of licences I have read

  • Pingback: Can’t I just say “data available for educational and research use”? « Research Remix

  • Pingback: Around the Web: Some resources on the Panton Principles & open data : Confessions of a Science Librarian

  • Pingback: Open Data & The Panton Principles: Thoughts on a presentation to librarians : Confessions of a Science Librarian

  • Pingback: Around the Web: Some resources on the Panton Principles & open data – Confessions of a Science Librarian

  • Gustavo Olivares

    I see the argument for CC-0 but I do prefer CC-BY, particularly for “datasets”. It is more than “attachment” to the data, if you’re doing a meta-analysis and don’t cite it (and presumably don’t give ways of getting it) then how are the reviewers (or the readers) supposed to check that?

    In my view, attribution is more about traceability as it allows the reader of a derivative work to go back to the source and check that the data says what the paper claim it says. Who hasn’t encountered a statement in an article that points to a reference, only to find that it was not in fact the original reference and that actually the original author meant something different?

    I agree that the “BY” part puts some costs on the user but they are not “unreasonable” costs, particularly when weighted against the traceability of the data.

    • David Shorthouse

      Thanks for the comment, Gustavo. I agree that the “BY” at first seems like an attractive option for the data producer. But does it necessarily allow for traceability/verification for the consumer of a data product in a manner similar to cited literature? And, what if the consumer is building a dynamic, web-based service that gleans portions of a dataset into a new, value-added product?

      Although DataCite, http://www.datacite.org/ is making great strides toward standard ways to cite data, I haven’t yet seen widespread, cross-domain uptake. The social/administrative/technical infrastructure for data has not yet matured as it has with the publications industry. All this to say that leaving it up to the producer to specify how they require their data be cited (the “BY” in a very real legal sense) doesn’t necessarily confer traceability.

      If the data product is a static reconstitution of other static data outputs, I suppose the legal requirement for the consumer to cite his/her sources is not very difficult to manage. But, data need not be static of course. Much of it is borne digital these days and remains digital throughout its lifecycle. And, quite often a consumer needs to update fields in a dataset with fields from another dataset (eg georeferencing locality information, disambiguating scientific names, etc.). The resultant stacking of citations could very well force a consumer into the difficult position of abandoning a perfectly good dataset; the burden of managing record-level citations to fulfill the legal requirement is very challenging.

      • Gustavo Olivares

        Thanks for responding! (I only saw the date of the post after I submitted the comment … I just saw a link to this from a page on G+)

        You’re right in that traceability is not automatic just because I set citation requirements and you’re right in pointing out the work of datacite but the fact that a standard “data citation” is not here doesn’t mean that we should not cite data. To me, what it highlights is the fact that “data” is not well covered by any of the existing copyright/licensing standards and that’s why the discussions around CC-0 or CC-BY (and others).

        Maybe I am biased by my backgound on earth and engineering sciences because I always deal with static (ish) data that get merged/analysed/updated within the scientific literature and therefore the citation/attribution of the work is paramount because if my conclusions depend on “others” data, I need to point to that otherwise my conclusions can’t be tested/challenged. Which is as much traceability as it is reproducibility

        In any case, what I don’t agree with is the “nuisance” argument that citation stacking is difficult. I grant that it may not be simple to manage complex data aggregations but what was unmanageable 50 years ago I carry around on my pocket fully indexed! So my recommendation is not to remove the citation requirement but to work towards making those attributions easier to work with by supporting initiatives like datacite, orcid and creative commons to find the best framework for data that promotes knowledge development and sharing without risking the integrity of that knowledge.

  • Pingback: madagaskar

  • http://twitter.com/charlesroper charlesroper

    I’d be interested to see a revision of this article in light of CC 4.0 and, in particular, sui generis database rights which apply throughout the EU. More info: http://wiki.creativecommons.org/Data#How_does_the_treatment_of_sui_generis_database_rights_vary_in_prior_versions_of_CC_licenses.3F

    • http://peterdesmet.com/ Peter Desmet

      Hi Charles,

      It is true that CC 4.0 now also covers sui generis database rights, in addition to copyright (which is a good thing). So, CC 4.0 “license terms and conditions now apply to the database structure (its selection and arrangement, to the extent copyrightable), its contents (if copyrightable), and in those instances where the database maker has sui generis database rights then the rights that are granted those makers.” (from: http://wiki.creativecommons.org/Data#When_a_CC_license_is_applied_to_a_database.2C_what_is_being_licensed.3F).

      The above article already mentions that biodiversity data are facts (and thus not copyrightable), but sui generis database rights don’t apply either, for one because the structure of the data (a Darwin Core archive) is standardized. A good explanation regarding CC 4.0 and their (non) effect on biodiversity data can be found in the last question of this interview: http://www.biomedcentral.com/biome/donat-agosti-on-big-data-copyright-and-attribution-in-taxonomy/

      VertNet has recently also recommended CC0 for data, with some substantial background information: http://www.vertnet.org/resources/datalicensingguide.html

      I hope this clarifies some things.

      • http://twitter.com/charlesroper charlesroper

        Many thanks Peter. I came across this article while considering the GBIF consultation. I’m not sure if sui generis applies to Canadensys participants or not given you are based outside the EU, but certainly within the EU sui generis provides database rights where there has been “qualitatively and/or quantitatively a substantial investment in either the obtaining, verification or presentation of the contents” regardless of the originality of the content or structure.

        The VertNet article only seems to consider copyright and not database rights – it would be good to see sui generis considered there too.

        I’m not sure how the Donat Agnosti interview explains how CC 4.0 is ineffective in regards to biodiversity data. The question there discusses the problem of attribution specifically rather than the applicability of sui generis more generally.

        I found this article and discussion by way of the GBIF consultation and my context is within a Local Record Centre within the UK’s NBN. We’d like to be able to share data we have made a substantial investment in obtaining, verifying and presenting but if CC0 is mandatory, that excludes our participation and much of the rest of the NBN here in Britain.

        • http://peterdesmet.com/ Peter Desmet

          Hi Charles, I have since moved back to Belgium, where I also have to deal with database rights. :-)

          I’m currently in conversation with people from Plazi to understand what these cover exactly. So far I understand that these only applies to databases (i.e. a substantial investment in the presentation) and thus not data or standardized datasets, and only protect private investments.

          Regarding your local record centre and NBN, what do you want to be protected against? As there is practically no legal protection (which is just communicated and clarified to the user via CC0), what technological/social solutions would help? I think this is the discussion everyone should be having, as these are the only tools we have.

          • http://twitter.com/charlesroper charlesroper

            There’s a good evaluation report from the EU that gives a clear background on the history and rationale for sui generis database rights along with pros, cons and case studies.


            Worth reading. Section 2.2 in particular. According to the word of Directive 96/9/EC, sui generis provides rights to “prevent extraction and/or re-utilization of the whole or of a substantial part evaluated qualitatively and/or quantitatively, of the contents of that database.” (emphasis mine)

            In the UK we have a network of local environmental record centres (LRCs) which work with the local volunteer community (among others) to obtain, verify, disseminate and present biodiversity data (species and biotope records, mainly). LRCs are generally small, non-profit organisations partially funded by partner organisations who need the data in one place in a usable format, and partially through income from “commercial” data requests; i.e., requests for data that come from for-profit entities, which is primarily the highly commercial development and ecological consultancy sectors. Data is generally, but not always, supplied free for non-profit or scientific research purposes. In effect, it’s similar to CC BY-NC.

            CC0 is incompatible with this scenario. Loss of commercial income would mean many, if not most, LRCs would become unsustainable and their collapse would mean a significant reduction of records available to the NBN (which is the UK node of GBIF). Not good. What’s more, the data is mostly contributed by an army of volunteers – Britain’s prolific amateur naturalists – who are driven by conservation. A common proviso – a social norm, if you like – attached to the contribution of data is that it is not “misused” or commercially exploited and we, as the compiler and manager of their data, have a moral obligation to respect those wishes.

            So CC0 is a good solution where it is possible to use. But where it is not, the other standardised, machine readable CC licenses seem like a good compromise. I do believe in encouraging contributors to do so with CC0 waiver, but what if they do not wish to as is usually the case? What if they wish to contribute while retaining their rights? It seems as if the problems originate from the use of non-standard, non-machine-readable, licenses. Aren’t the machine readable licenses of CC – and in particular CC 4.0 – supposed to help tackle these issues?

          • http://peterdesmet.com/ Peter Desmet

            Hi Charles,

            Do you mind if I copy this conversation as an issue to a paper we plan to write regarding CC0 for occurrence data and solicit feedback from some colleagues who know more about this than I do? They should be able to answer your question (which I think is shared with some other GBIF publishers as well) in more depth.

          • http://twitter.com/charlesroper charlesroper

            Yes, please do, many thanks! :-)

  • Pingback: where to find garcinia cambogia

  • Pingback: how to take pure garcinia cambogia plus

  • Pingback: webdirectorybit.com

  • Pingback: Free Babysitter Finder

  • Pingback: toshiba computers

  • Pingback: Canoga Park janitorial services & Office Cleaning Service

  • Pingback: eye cream for dark circles

  • Pingback: vtargeter

  • Pingback: colon cleanse reviews

  • Pingback: travertine polishing houston

  • Pingback: joeleavers.com/kissing

  • Pingback: Download Whatsapp for PC

  • Pingback: vtigercrm

  • Pingback: web design company in india

  • Pingback: immobiliare venezia