Opening a Darwin Core Archive with Excel

One of the editors of VASCAN asked me for an Excel file with all the vernacular names in our database, so she could check them for typos. As the database administrator, I could have exported the names from our MySQL database, but I chose to approach the request from a user’s perspective instead. All VASCAN data are available as a Darwin Core Archive, so the question became: How do I open a Darwin Core Archive with Excel? This post explains one approach.

Downloading and opening the Darwin Core Archive

You can download the entire VASCAN dataset as a Darwin Core Archive from the VASCAN website or our IPT portal. Both link to the same file. You can also generate your own custom subset with the checklist builder.

I download the file and unzip it.

Understanding the archive

open-dwca-with-excel-zip

To understand the structure of the Darwin Core Archive, I need to take a look at the meta.xml file, shown here without clutter:

<?xml version="1.0" encoding="utf-8"?>
<archive ... >
 <core ... rowType="http://rs.tdwg.org/dwc/terms/Taxon">
  <files><location>taxon.txt</location></files>
  ...
 </core>
 <extension ... rowType="http://rs.gbif.org/terms/1.0/Distribution">
  <files><location>distribution.txt</location></files>
  ...
 </extension>
 <extension ... rowType="http://rs.gbif.org/terms/1.0/VernacularName">
  <files><location>vernacularname.txt</location></files>
  ...
 </extension>
 <extension ... rowType="http://rs.gbif.org/terms/1.0/Description">
  <files><location>description.txt</location></files>
  ...
 </extension>
</archive>

I see two interesting files:

A Darwin Core Extension allows to store one-to-many relationships. One record in the core can link to many records in the extension. In this example, one taxon can link to many vernacular names (e.g. in different languages, preferred vs. alternative names, etc.). Just like cores, Darwin Core extensions have to be registered, so others can understand and use them. The vernacular name extension was created by GBIF and is registered here.

open-dwca-with-excel-star-schema

I decide to use both files, as I’m also interested in the taxa the vernacular names apply to.

Excel and UTF-8 files

Like all Darwin Core text files, taxon.txt and vernacularname.txt are encoded as UTF-8, which allows the storage of special characters and accents. In this case, this is useful for French vernacular names like “Fougères”.

Sadly though, Excel doesn’t open UTF-8 files correctly by default. I’m using Microsoft Excel 2008 for Mac, and all the following techniques do not work:

open-dwca-with-excel-wizard

In the Text Import Wizard I can only choose Macintosh, Windows (ANSI) and DOS or OS/2 (PC-8) as the file origin, non of which will shows accents correctly, so my data in Excel looks like this:

open-dwca-with-excel-utf8-error

In this example my data are stored as tab-separated values (tsv), but the character encoding problem is the same for comma-separated values (csv).

If you know for a fact that the dataset you’re working with doesn’t contain accents or special characters, you can of course use the Text Import Wizard. It has some advantages over the workaround method.

Workaround

The workaround I always use, is to open my UTF-8 file in a decent text editor, like Smultron for Mac (do not use TextEdit) or Notepad++ for PC (both free) and then copy and paste the data in Excel.

open-dwca-with-excel-smultron

  1. Open the text file in a text editor
  2. Make sure the editor displays the text correctly
  3. Select all the data (cmd/ctrl+A)
  4. Copy all the data (cmd/ctrl+C)
  5. Open an empty spreadsheet in Excel
  6. Choose: Format > Cells… > Set category to “Text”
  7. Paste everything (cmd/ctrl+V)
  8. Choose: Format > Cells… > Set category to “General”
  9. Save the file as .xlsx.

In step 6 you force Excel to interpret all the data as text, which avoids annoying date transformations like “2011-03-20″ to “03/20/2011″. Numbers will be interpreted as text too though, so sorting might not work as expected. You can always transform numbers back by using =VALUE(). In step 8 we revert back to “general”, so long cell values will not be shown as “######”. If you do not have dates in your dataset, you could skip step 6 and 8.

Step 9: If you work with an older version of Excel (97-2004) and/or you save your file as .xls, your file will be limited to 65.536 rows.

This workaround method only works for tab-separated values, not for comma-separated values. It is also important to know that copying large chunks of data might slow down or crash your computer. Make sure to save often.

open-dwca-with-excel-result

Final thoughts

A lot of people (myself included) use Excel daily: it’s a powerful and user friendly program to manipulate data, but it’s lousy at importing UTF-8 encoded data. I hope this post can help you avoid some of the common pitfalls. If you know an alternative method, please let us know in the comments.

My two text files are now correctly imported in Excel as two sheets/tabs: one for vernacular names and one for taxa. I will explain how you can link both together in another blog post.