Comments on Software Discovery Index Report

The NIH has produced a report on the requirements for a Software Discovery Index. Below you’ll find my comments posted there.

I think this report reflects both demand for and consensus around the issues of software identification and use in biomedical science. As a creator and user of scientific software both installable and accessible through Web APIs, from my perspective anything that helps bring transparency and visibility to what scientific software devs do is fantastic.

I would like to reiterate the comments that we need to build on what’s already being used by software developers and is already in the wild and call out two main points.

Package management

There was some mention of package managers but I would suggest that this should be a stronger focal point as much of the metadata is already in this management systems. Examples include:

My suggestion would be that a simple metadata element could be suggested that would allow software developers to advertise that their work be indexed by the Software Discovery Index. Then the process would involve just indexing existing package manages used. (Note, this also works for VMs. See

For those that chose not to use a package repository, I would suggest leveraging that already has a metadata description for software ( This metadata description can already be recognized by search engines.

Software vs Software Paper

One thing that I didn’t think came through strongly enough in the report was the distinction between the software and a potential scientific paper describing the software . Many developers of scientific code also publish papers that tease out important aspects of e.g. the design, applicability or vision of the software. In most cases, these authors would like citations to accrue to that one paper. See for example Scikit-learn [1], the many pages you find on software sites with the “please” cite section [2], or the idea of software articles [3, 4].

This differs from the idea that as an author of a publication one should reference specifically the software version that was used. My suggestion would be to decouple these tasks and actually suggest that both are done. This could be done by allowing for author side minting of DOIs for a particular software versions [5] and suggesting that authors also include a reference to the designated software paper. The Software Discovery Index could facilitate this process by suggesting the appropriate reference pair. This is only one idea, but I would suggest that the difference between Software and Software paper should be considered in any future developments.

This could also address some concerns with referencing APIs.


In summary, I would emphasize the need for the reuse of existing infrastructures for software development, publication of scientific results, and search. There is no need to “role your own”. I think some useful, simple and practical guidelines would actually go a long way in helping the scientific software ecosystem.

[1] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (November 2011), 2825-2830.
[5] e.g.

Filed under: academia Tagged: NIH, package management, software publications
Source: Think Links

Posted in Paul Groth, Staff Blogs

Leave a Reply