Today was the first day of the Moore-Sloan Data Science Environments annual summit, held this year in Cle Elum, Washington. We had talks about activities going on at UW; many of the most interesting to me were around reproducibility and open science. For example, there were discussions of reproducibility badges, where projects can be rated on a range of criteria and given a score. The idea is to make reproducibility a competitive challenge among researchers. A theme of this is that it isn't cheap to run fully reproducible. That said, there are also huge advantages, not just to science, but also to the individual, as I have commented in this space before. It is easy to forget that when CampHogg first went fully open, we did so because it made it easier for us to find our own code. That sounds stupid, but it's really true that it is much easier to find your three-year-old code on the web than on your legacy computer.
Ethics came up multiple times at the meeting. Ethical training and a foregrounding of ethical issues in data science is a shared goal in this group. I wonder, however, if we got really specific and technical, whether we would agree what it means to be ethical with data. Sometimes the most informative and efficient data-science methods to (say) improve the fairness in distribution of services could easily conflict with concerns about privacy, for example. That said, this is all the more reason that we should encourage ethical discussions in the data science community, and also encourage those discussions to be specific and technical.