This week I’m attending a workshop on *model-based clustering* at the University of Bologna, which is proving rather interesting. Not least because of the huge diversity of applications motivating the methodological developments being presented. For instance, Adrian Raftery has told us about his team’s work on developing the “scanBMA” package for inferring the structure of gene regulatory networks: a problem requiring non-trivial prior specification and fast model selection within a set of >2^80 possible models. Likewise, Charles Bouveyron has given us a fascinating account of his random subgraph model for ecclesiastical networks in merovingian Gaul (supposedly the period of ~480-614AD, just after Asterix & Obelisk), for which a variational Bayes solution was ultimately required. For the budding young scientist finding it difficult to pin down their interests on one particular field I can safely recommend statistics as the right discipline to go into!
BTW I have an idea for a relatively quick & easy project (appropriate to e.g. an honours student or first-year PhD project) with good prospects for publication & citations, if any of my observational astronomy OR statistics colleagues know of someone interested. It goes like this. One of the most popular datasets for mixture modelling in statistics is the so-called “galaxy dataset” of Postman, Huchra and Geller (1986), typically cited by its first statistical use in Roeder (1990). This dataset consists of the measured recession velocities for 82 galaxies in a small region of sky, and is of ongoing statistical interest because it appears to have between three and nine(!) distinct “components” in redshift space depending *very much* on how you try to do your mixture modelling.
Since the SDSS though we now have redshifts for something like 300-400 extra galaxies in the same field(s) and redshift interval. So the project would be to carefully (e.g. not messing up the J1950/J2000 difference etc. and respecting SDSS quality flags) extract these from the SDSS dataset with a redshift selection function to match that of PHG86 and make the new redshift dataset “officially” available as an electronic file via publication in a suitable statistical journal. The main paper would require a description of how the catalogue was assembled, astronomical arguments for why (and perhaps why not) these new galaxies can be considered to have come from the same parent distribution as the original galaxy dataset, plus a first-go at mixture modelling via one of the standard methods (with comparison to results for the PHG86 sample).