Speed up Qmaster's "Merging distributed QuickTime segments" process?

Question

Level 1

4 points

Speed up Qmaster's "Merging distributed QuickTime segments" process?

I'm finding that Qmaster's "Merging distributed QuickTime segments" process takes as long or LONGER than the time it takes to generate the encode itself.

I found this related discussion but no resolution: http://discussions.apple.com/thread.jspa?threadID=1107862&tstart=56

There must be some un-optimized or rate-limiting bit of code in qmaster... i can't explain why the "Status: Processing: Merging distributed QuickTime segments" takes so long. We have a Compressor v3 setup with 24-36 virtual cores depending on config. The thing blazes through the encode process, doing a SD movie in about 4-6 minutes. It's beautiful how fast it runs, but when it gets back to merging segments, the disk activity does not exceed 15-20MB/sec read/write.

I tested my disks on concurrent read/write (this is a simple RAID0 array) and was seeing 80-90MB/sec concurrent read and write on the same volume. There's no knob, switch, or settings file parameter from what I can tell. It's incredibly frustrating for the disk operation to take LONGER than the encode.

Has anyone been able to overcome this problem? I am contemplating SSDs for the qmaster temp, but I don't think that would speed anything up. Does anyone know if this also exists in Qmaster 3.5?

Xserve, Mac OS X (10.6.4), OSX Server 10.6.5, Qmaster v3/Compressor 3.0.5

Posted on Jan 13, 2011 12:03 PM

Reply

Answer 1

Jan 13, 2011 7:57 PM in response to davelindsay

The more instances you have running, the more segments will need to be combined once the encoding is complete. The only way to resolve this is to do some experiments with different numbers of instances to find a balance between encoding time and merging time.

This thread has some interesting statistics:
http://discussions.apple.com/thread.jspa?messageID=12710420

It would seem that Qmaster is most efficient with instances set to around half the number of cores and that a cluster is more beneficial on long files than short ones.

Reply

Answer 2

Jan 14, 2011 6:05 AM in response to davelindsay

It also depends on your network environment.

If you are sharing all the segments over Ethernet, that's going to be a big part of the bottleneck, as Ethernet has its own fairly significant overhead.

Xsan solves a lot of this problem. You don't need to copy the segments because every node can access it directly at FibreChannel speeds.

Reply

Answer 3

davelindsay Author

Level 1

4 points

Jan 14, 2011 6:44 AM in response to davelindsay

Hi all, thanks for the replies thus far.

Jon Chappell > Compressor Repair is my best friend! Appreciate the links to the other forum with the benchmark results. I'm going to re-create those tests and see how my results compare.

All nodes are connected via 2Gb ethernet (bonded), so network speeds/transfers between cluster storage, source, and encoders are excellent. During encoding, each node is pulling data in over the network from the centralized storage server at well over 100MB/sec and placing the resulting segment on the qmaster temp area on the controller node.

The bottleneck is what happens when the controller (an 8 core 2.66Ghz Xserve with 2TB raid0'd drives) stitches the segments back together again. It's painfully slow, around 20MB/sec when I know the disks are capable of about 3-4x that. During this process, no data is being pulled over the network from the encoder nodes, as they've dumped their segments in he cluster storage node's temp area. It's all local disk i/o during the stitch operation.

Perhaps there's some single-threadedness in that process that prevents qmaster from harnessing all the resources available. All machines in the cluster are identical, so I'm convinced it's something in qmaster's design that may not be user serviceable.

Reply

Answer 4

Jan 14, 2011 7:13 AM in response to davelindsay

Most of the Compressor/Qmaster code is single-threaded, and it wouldn't surprise me at all if stitching needed to happen in a single thread.

Are you reading & writing back to the same disk? I'm not sure if you can separate those, but that would double the bottleneck.

If speed is really important to you, consider Episode from Telestream. It has high performance for its stitching but it's thousands of dollars instead of free.

Reply

Answer 5

Jan 14, 2011 9:17 AM in response to davelindsay

davelindsay wrote:
Perhaps there's some single-threadedness in that process that prevents qmaster from harnessing all the resources available. All machines in the cluster are identical, so I'm convinced it's something in qmaster's design that may not be user serviceable.

It's not Qmaster, it's QuickTime. QuickTime was written long before multi-core CPUs came along so it doesn't work well with threads. Apple half-heartedly hacked it to make a few functions threaded but it really needs a top-to-bottom rewrite.

Reply

Answer 6

davelindsay Author

Level 1

4 points

Jan 14, 2011 9:20 AM in response to Jon Chappell

Good deduction! Didn't they attempt a re-write with QuickTime X?

Reply

Answer 7

Jan 14, 2011 2:32 PM in response to davelindsay

Not really. They've only modernized the playback code. For most functions, especially when creating and editing movies, you need to fall back to the ancient C API.

Reply

Answer 8

davelindsay Author

Level 1

4 points

Jan 14, 2011 3:19 PM in response to Jon Chappell

Just had a bit of a breakthrough after testing various setups all afternoon. I noticed that encoding on 1 controller node (no encode-only service nodes in the cluster) that the stitching operation went at maximum disk speed (80MB/sec concurrent read write). I then added ONE VC instance from an encode node to the cluster (8 VCs on controller, 1 VC on encode node) and watched the stitching drop down again to 20MB/sec. Oddly, there was 20MB/sec network activity from the encode node throughout the entire stitch operation after the encode. That was strange because the encode node should encode directly to the controller's cluster storage.

My thoughts are that it's not Quicktime or Qmaster but possibly NFSd. I looked in the Server Admin settings for NFS and increased the NFS thread count from 20 to 80, restarted NFS service and submitted a job again. Results were MUCH more favorable. Stitching on the controller was at 80MB/sec and zero network activity on the encode node during stitching. This is very promising. I'm going to continue testing and then add a 2nd dedicated encode node to see if the issue is fully resolved.

Current take-aways: Increase NSF thread count in Server Admin

...To be continued.

Reply

Answer 9

Jan 17, 2011 8:18 PM in response to davelindsay

Excellent discovery!

Please keep us posted...

Reply

Answer 10

Jan 27, 2011 9:16 AM in response to davelindsay

Excellent investigation davelindsay!

I'm not clear on how the Qmaster system makes use of NFS-based shares. The Distributed Setup Guide is actually quite weak on its coverage of computer-to-computer file sharing [setup, options, caveats]. It really only mentions creating AFP shares.

I know that the host I have config'ed as the cluster controller does export an NFS share to its "work area" (defaults to /var/spool/qmaster/... )

Thx for the insights.
-Rick

Reply

Answer 11

davelindsay Author

Level 1

4 points

Feb 9, 2011 11:29 AM in response to davelindsay

After weeks of use I'm really not convinced Qmaster is all that is could/should be. Too many unexplained 'Service Down' messages on some of the instances within the cluster and sometimes the cluster itself goes missing and needs to be rebooted and 'rebuilt.' One of my favorite scenarios is when Qmaster begins the process of merging the distributed segments and utilizes 7 of 8 VC's on the controller and 1 of 8 on a networked node - for no logical reason. All of the media segments are on the controller. Sending data back to the node isn't going to help at all! This results in unnecessary network traffic to/from the network node and an overall slowdown for the entire merge job.

It's a shame, really. Could be a very powerful tool. I'm now leaning towards allocating jobs directly to quick clusters of 1 machine in a round-robin approach with SSD's for the qmaster temp directory so that the file merge doesn't suffer from read/write on same disk fatigue.

Reply