Thursday, May 26, 2016

Notes on current and future projects

I'll be taking a break shortly, so I thought I'd try to gather some thoughts on a few projects/ideas that I'm working on. These are essentially extended notes to myself to jog my memory when I return to these topics.

BOLD data into GBIF

Following on from work on getting mosquito data into GBIF I've been looking at DNA barcoding data. BOLD data is mostly absent from GBIF. The publicly available data can be downloaded, and is in a form that could be easily ingested by GBIF. One problem is that the data is incomplete, and sometimes out of date. BOLD's data dumps and BOLD's API use different formats (sigh), and the API returns additional data such as image soft voucher specimens. Most data in the data dumps are not identified to species, so they will have limited utility for most GBIF users.

One approach would be to take the data dumps as the basic data, then use the API to enhance that data, such as adding image links. If the API returns a species-level identification for a barcode then that could be added as an identification using the corresponding Darwin Core extension. In this way we could treat the data as an evolving entity, which it is as our knowledge of it improves. For a related example see Leafcutter bee new to science with specimen data on Canadensys where Canadensys record two different identifications of some bee specimens as research showed that some specimens represented a new species.

This work reflects my concern that GBIF is missing a lot of data outside its normal sources. The mechanism for getting data into GBIF is pretty bureaucratic and could do with reforming (or, at least provision of other ways to add data).

BOLD by itself

I've touched on this before (Notes on next steps for the million DNA barcodes map), I'd really like to do something better with the way we display and interact with DNA barcode data. This will need some thought on calculating and visualising massive phylogenies, and spatial queries that return subtrees. I can't help thinking that there's scope for some very cool things in this area. If nothing else, we can do interesting things without getting involved in some of the pain of taxonomic names.

Big trees

Viewing big trees is still something of an obsession. I still think this hasn't been solved in a way that helps us learn about the tree and the entities in that tree. I currently think that a core problem to solve is how to cluster or "fold" a tree in a sensible way to highlight the major groups. I did something rather crude here, other approaches include "Constructing Overview + Detail Dendrogram-Matrix Views" (doi:10.1109/TVCG.2009.130, PDF here).

Graph databases and the biodiversity knowledge graph

I'm working on a project to build a "biodiversity knowledge graph" (see doi:10.3897/rio.2.e8767). In some ways this is recreating (my entry in Elsevier's Grand Challenge "Knowledge Enhancement in the Life Sciences", see also hdl:10101/npre.2009.3173.1 and doi:10.1016/j.websem.2010.03.004).

Currently I'm playing with Neo4J to build the graph from JSON documents stored in CouchDB. Learning Neo4J is taking a little time, especially as I'm creating nodes and edges on the fly and want to avoid creating more than one node for the same thing. In a world of multiple identifiers this gets tricky, but I think there's a reasonable way to do this (see the graph gist Handling multiple identifiers). Since I'm harvesting data I'm ending up building a web crawler, so I need to think about queues, and ways to ensure that data added at different times gets properly linked.

Wikipedia and wikidata

I occasionally play with Wikipedia and Wikidata, although this is often an exercise in frustration as editing Wikipedia tends to result in edit wars ("we don't do things that way"). Communities tend to be conservative. I'll write up some notes about ways Wikipedia and Wikidata can be useful, especially in the context of the Biodiversity Heritage Library (see also Possible project: mapping authors to Wikipedia entries using lists of published works).

All the names

The database of all taxonomic names remains as elusive as ever -- our field should be deeply embarrassed by this, it's just ridiculous.

My own efforts in this area involve (a) obtaining lists of names, by whatever means available, and (b) augmenting them to include links to the primary literature. I've made some this work publicly accessible (e.g., BioNames). I'd like all name databases to make their data open, but most are resistant to the idea (some aggressively so).

One approach to this is to simply ignore the whimpering and make the data available. Another is to consider recreating the data. We have name finding algorithms, and more of the literature is becoming available, either completely open (e.g., BHL) or accessible to mining (see Content Mine). At some point we will be able to recreate the taxonomic name databases from scratch, making the obstinate data providers not longer relevant.

First descriptions

Names, by themselves, are not terribly useful. But the information that hangs off them is. it occurs to me that projects like BioNames (and other things I've been working on such as IPNI names) aren't directly tackling this. Yes, it's nice to have a bibliographic citation/identifier for the original description of a name (or subsequent name changes), but what we'd really like is to be able to (a) see that description and (b) have it accessible to machines. So one thing I plan to add to BioNames automate going from name to the page with the actual description, and display this information. For many names BioNames knows the page number of the description, and hence it's location within the publication. So we need to simply pull out that page (allowing for edge cases where the mapping between digital and physical pages might not be straight forward) and display it (together with text). If we have XML we can also try and locate the descriptions within the text (for some experiments using XSLT see https://github.com/rdmpage/ipni-names/tree/master/pmc ).There's lots of scope for simple text mining here, such as specimen codes (such as type specimens) and associated taxonomic names (synonyms, ecologically associated organisms, etc.).

Dockerize all the things!

Lots of scope to explore using container to provide services. Docker Hub provides Elastic Search and Neo4J, and Bitnami can run Elastic Search and CouchDB on Google's cloud. Hence we can play with various tools without having to install them. The other side is creating containers to package various tools (Global Names is doing this), or using containers to package up particular datasets and the tools needed to explore them. So much to learn in this area.