So it’s just the size? Not the quality?
It's both. Any time you scale digital assets up, you're going to degrade the image. It's unavoidable. Small amounts you won't notice, but 720 to 1080 requires 34% more pixels than exist in the 720 footage. That's a 150% increase in size, which will be noticeable.
If you're editing these videos individually, then work on them at their native size. If you're mixing the sources, then it would be better to build the composite video at 720. Scaling down isn't anywhere near as bad as scaling up. It will soften the 1080 clips a bit to scale them down, but then just add a sharpening filter to those clips.
DON'T go overboard with the sharpening. Here's an example of what your sharpened clips should not look like.
The main thing to watch for is you should never, ever see halos on anything. Nothing in the real world has halos. Even the original is too sharp as you can see a halo between the top right of the object and the sky. The sharpened image above is already way overdone.
Your enemy in unsharp masking is the radius.
Original:
High sharpening with a low radius sharpens the image without creating a halo.
Same amount of sharpening with too much radius.
Format is rather irrelevant. You have to bring in what you have, and both are already compressed. When you export your composite video, you can choose .mp4, which is still a compressed video format. The original compression then gets compressed again, which means at least a bit more detail loss.