development log, 2017.01.27

2016 was a hell of a year, for many reasons (good, not so good, and everywhere in between). I’m hoping to share more about last year in a retrospective in the next week, but as part of the new year we are now living in, I wanted to start doing a weekly-ish series on what I’ve been up to on development fronts.

I’m now employed by the Beckstein Lab as a research software engineer, and with this role comes a shift in my focus to software development. Over the course of my PhD I built a few software libraries and contributed to at least a couple others. My efforts over the next few months (it’s a short contract) are geared almost exclusively toward finishing up loose ends on many of these projects, plus at least one brand new one. These projects form the core of our software stack, so getting them on a solid footing before I leave is high-priority.

As the first entry in this series, I think it’s worthwhile to share the current status of these projects (from my perspective), and where I’m hoping/expecting them to go. The opportunity to have a focus on software in an academic setting is not common, so I really hope we can come out on the other end with some real interesting developments. :)

datreant

Of all my software projects, datreant is the library I poured the most effort into during my PhD, and it forms the core of my software environment. Since I work with simulation data that can’t easily be shoehorned into a database solution, this library gives a database-ish abstract interface to the filesystem, making it possible to work with my data at a high level while still being able to drill down to any level of detail. This interface is also used heavily for my workflow automation, since it makes writing code that refers to filesystem components (files, directories) in an object-oriented way totally cake.

This library is pretty stable in terms of its overall behavior, though the issue tracker has a lot of things that are wide open and need to be resolved. Some of these are relatively easy, but others will take some more thought. My hope is to get at least one, if not two, releases out before I’m finished here. I’ll probably continue to work on datreant after my time in the lab is up, however. It’s a bit unique, and I think has a lot of potential use for a lot of people. More on that soon.

MDSynthesis

datreant started life as MDSynthesis, a library I developed to maintain my sanity in the face of increasing volume and variety of simulation data I was collecting. Eventually the molecular dynamics-agnostic pieces of MDSynthesis were stripped out and became datreant, leaving MDSynthesis itself as a pretty light library by comparison. This library is how I use datreant in practice, since it gives a lot of convenient functionality for working with MD data.

MDSynthesis isn’t a large codebase, and I’d like to keep it that way. Over the past few months I’ve been leaning more toward making datreant, in particular datreant.core, the recommended library for doing all things datreant-like, and leaving MDSynthesis in maintenance mode. The trouble with the library is that by its nature it can tend to accumulate a lot of ad hoc components, and it also has to track the development of both datreant and MDAnalysis, which makes it a bit a of a pain to maintain. In a world of limited time and energy, I’m more prone to say that users use datreant and roll their own conveniences for working with their data, even if they’re doing MD, but on the other hand MDSynthesis remains a nice little library with incredible utility.

Future uncertain, but for now it persists. If anyone is interested in taking on more of a maintenance role for this library, get in touch.

MDAnalysis

MDAnalysis has a strong developer base, and my contributions over the last year have been mostly in design discussions and as co-developer of the shiny, new topology system. There’s a lot of cool stuff coming through the pipe in the next release, and I’m not concerned about the library’s future.

Over the next few months I’m hoping we can get pickling finally working for all objects so that parallelization can work more generally and as users expect, not just in a shared-memory environment as it does currently. Since the new topology system encapsulates things much more cleanly in the Universe object, this should be very doable, but we just have yet to settle on a specific choice of solution. I’m hoping to scrape together a chunk of time to put forth a working proposal; I already have a prototype of a piece of this from a long time ago, but it needs to be picked up again.

MDAnalysis is a great library with a great existing team, and so I’m likely to put my larger efforts elsewhere over the next few months. I’m still going to be involved in discussions, though, since I think I’m recognized among the group as someone who thinks a lot about the implications of design choices, duplication of effort and interfaces, etc.

mdworks

mdworks is a prototype Python module with building blocks for building workflow graphs using Fireworks, an automated workflow engine we’ve started using in the lab. It’s very early development, and mostly functions as a code dump of things I currently use, but the things currently present are fairly general-purpose.

The trouble with using automated workflows is that they require not only code to generate the graphs, but also the infrastructure to serve and execute them, so it’s nontrivial to get up and running. One of my tasks over the next few months is to get mdworks into a state that’s fairly usable by other members in the lab (which means clear documentation, understandable design choices, functional pieces) and also provide what’s needed to get up and running infrastructure-wise (the hard part).

I strongly believe using automated workflows is the way forward in this field, and if we can achieve our goals here it can pay dividends well into the future. We can ask more complicated questions much more easily when the logistics of computing are largely fire-and-forget, and it makes expensive human time available for doing other things.

alchemlyb

The final software project of note that I’m putting effort into is alchemlyb. This library aims to make doing so-called alchemical free energy calculations easier for the average computationalist, providing machinery for parsing data, preprocessing it, estimating free energies, and checking convergence and robustness. It comes as a response to the rather wild-west nature of doing these calculations once you have data to do them with, as most every lab has their own in-house code to do these calculations. This is a lot of duplication of effort, and it’s prone to error. Existing solutions that are publicly available, such as alchemical-analysis.py, although perhaps correct and useful, are often monolithic and inflexible. These also generally don’t scale beyond single-machine use, so growing datasets become a problem.

The development of this library is an informal collaboration between several folks in the field, with me loosely heading up the project (since I’m paid to do software, not strictly science). Because it is pretty well-defined from the start (these kind of calculations have the benefit of years of experience behind them), I think we can stand this library up within the next few months and push out at least a couple releases. Because of the community interest I’m not worried that we won’t have maintenance long-term, but there’s a lot of work to be done yet. This next week we’re laying out issues for all the major components so discussion and development can begin in earnest.

This is an exciting effort, the results of which should be a firm software foundation going forward for the field. I’m a strong believer in the idea that in order for us to focus our efforts on new, harder things, we have to make established techniques routine and easy. This is a specific effort to make that so for alchemical free energy calculations.

Check out this gist if you’re interested in the design ideas behind this library.

Let’s get to work

Obviously from this list there is no shortage of things to do, and I’m realistic enough to know that we probably can’t achieve everything. We will, try, though, with the aim of inching forward to stability. There are relatively few people in the lab who’ve taken an interest in software engineering as I have, so the stack needs to persist long enough to survive without another software-focused person around to keep things going. Whether that works out, only time will tell.

— david

related links

social