Sorry it took awhile to reply — Hurricane Ida and aftermath has kind of put me behind ... everything.
You can build auto ducking for a project in Motion. The method is not that obvious, but not really all that difficult either.
Add both audio tracks to a Motion project and adjust the project length to fit.
In the canvas, draw a rectangle. Doesn't matter what size.
Add a Poke filter. Set the Mix to 0 (you're not really using it for anything but its position).
Dial down the Center parameter disclosure triangle and set the Poke Center.Y to 1.0. (This is its *starting* reference point).
To the Poke > Center.Y parameter (only), right click and Add Parameter Behavior > Audio.
To the Source Audio, add the **Voice track**. Let the Audio behavior finish updating "keyframes" (wait cursor will let you know).
Set the Floor parameter to 0.2. This will help reduce ambient noises like mouse clicks and such.
Set the Peaks to Smooth (this is very important — you don't want the audio levels switching instantly!)
Set the Apply Mode to Add.
Set the Scale to -50 (this is flexible and can be experimented with... later).
To the Poke > Center.Y parameter, add a Clamp behavior. These values will actually control the volume levels of the music track. For the Min, set -0.925 (to start - experiment!) For Max, set it to -0.5 (Most music tracks are "saturated" and this will bring the loudness down to about -6dB peak... where it should be, but again, this is a starting position and can be experimented with).
To the Poke > Center.Y parameter, right click and Add Parameter Behavior > Average. Set the Window Size to about 1/2 second. If your project is 30p, set it to 15. If it's 60p, set it to 30, etc. You can adjust / fine tune this value later.
Click the Speaker icon in the Timeline window to show the Audio Timeline. Select the Music Track.
In the Inspector, click the Audio Track panel.
Right click on Level and Add Parameter Behavior > Link.
Select the Behaviors tab in the Inspector and to the Source Object, add your Rectangle.
For Source Parameter choose Filters > Poke > Center > Y.
For the Apply Mode, choose Add to source.
The theory is: using the Poke position as a level control, starting at 1 (full volume) and every time a vocal sound is made, that level amount is subtracted. The subtracted value is then applied to the music track which reduces the volume of the music. A balance between the two tracks can to be found by using the Clamp values (Min and Max). The Scale of the Audio behavior can be used to speed up the transition between the two states and, as stated, Average helps smooth out all the "micro-peaks".
If I got all these steps right, your music track should now be ducked by the vocal track, with smoothing.
Is it worth the effort? Probably not. It takes Motion an inordinate amount of time to render out the final audio. This 1:33 demo took about 45 minutes to render:
https://fcpxtemplates.com/wp-content/uploads/2021/09/audio-ducking-experiment2-demoExport2.mp3
but the results were fairly decent!