SciPy 2015: Introducing MDSynthesis

Although I’ve announced as much on Twitter, I’m excited to present a poster talk this year at SciPy. It’ll be my first time attending, and I’m looking forward to meeting others that are passionate in advancing science through software, and in particular with Python.

I’ll be presenting a software package I’ve been working on called MDSynthesis, which has vastly improved the way I do science with molecular dynamics (MD) simulations. MDSynthesis addresses an important bottleneck in MD research: going from raw simulation data (perhaps many terabytes, spread over tens to hundreds of simulations) to information that allows us to answer biophysical questions. I’ll explain…

One of the obstacles to using modern data science tools like pandas to analyze MD data is the multitude of formats the MD ecosystem trades in. CHARMM and NAMD use DCD files, AMBER uses a NetCDF-derived format, and GROMACS uses an XDR format; all told, there are at least 13 different formats used for storage of MD trajectory data, each with unique strengths and limitations. MDAnalysis is a python package that provides a common interface to many of these formats, turning trajectory data into numpy arrays that can be handled with the full power of the python universe.

But the diversity in trajectory formats isn’t the only obstacle to distilling information from MD data; what’s also a problem is the variety of inputs available for building any particular simulation system. For example, when simulating a single protein, I have a lot of choices in: forcefields, starting conformation, protonation states, solvent, ions, temperature and pressure algorithm…the list goes on. The picture becomes more complicated when one wants to run different types of MD, as there are also a wide variety of enhanced-sampling methods available for use.

And that’s still not all: trajectory data can take a while to churn through to extract measures we are interested in, depending on the measure and depending on the number/length of trajectories. It’s therefore useful to store intermediate data so we can interactively explore it.

Managing this complexity is burdensome, and frankly, boring. I’d rather spend my limited time and energy doing science than managing my ever-growing collection of data. Furthermore, I want quick, specific, and easy access to the data I have so that I can begin answering questions.

MDSynthesis has done this for me. The basic idea behind the package is to provide persistent objects that serve as data storage units, called containers. One such container is the Sim object. This can store any number of MDAnalysis Universe definitions (topologies + trajectories), along with atom selections for later use. Sims store their state directly to disk in a thin HDF5 database (using PyTables), allowing recall of the same Sim instance later, or at the same time in another python session. Most importantly, Sims give an interface for easily storing pandas and numpy data structures in HDF5 format with no fuss, with just as easy recall. Almost any other python data structure can also be stored just as easily; the container will pickle what it can’t serialize to HDF5.

Beyond Sims, there are also Groups, which can store Sims and other Groups as members for easy recall of whole ensembles of containers and easy aggregation of their stored data.

Those are the basic elements; more details can be found in the docs. We just made an alpha release of the package last week which is already usable for daily work, but the project is still very young. What’s particularly exciting for me is that development of the package has already fed back into development of MDAnalysis, with even more performance and persistence functionality on the way.

If you find this software useful, let me know! If it’s missing something that it sorely needs, feel free to submit an issue and we’ll get cracking on it. Pull requests are also welcome!

— david

related links

social