Think I nailed the outlier rejection stuff.
I keep extra rows of outlier data for each tile for the area above and below a tile. If the contribution center falls outside the current tile, I look up using these "overlap rows", or add outliers to it. Since contributions which straddle tile boundaries are added to both tiles, this way the data is duplicated for the overlapping areas between tiles. Thus a thread can safely access all the data for one thread.
I'll need to test it some more but initial results look good. Using k=10 with a metal cube scene, I went from 30% cpu and 110kS/s to >98% cpu and 270kS/s. Results were almost identical for hybrid for this simple scene, as the high k made the outlier rejection the rendering bottleneck.
May contain traces of nuts.