Well, I guess this time around its all about maximizing CPU usage. In previous versions I never did 1 service per core because having a video segmented into 32 pieces (16 instances x 2) just meant 16 processes running at 100% each, vs 6 pieces (3 at 533%) or 8 pieces (4 at 400%). All you can really do is maximize CPU. They maybe could have squeezed in one more instance, but with each instance enabled that is more RAM and another process requiring I/O. In the old version on a Mac Pro with 24 logical cores, I found 5 or 6 instances to be faster enabling 6+. A single instance by itself would use 400-600% CPU.