I run into three issues that, for me, are almost always at fault when I run into CPU load issues:
Issue one: a small number of high intensity, expensive UGens are at fault. Reverb units are a good example of this - many UGens in the sc3-plugins pack are not well optimized and can be resource-hungry as well. Sometimes I can switch to a lower-intensity UGen that does a similar thing. For example, I make heavy use of the SoftClipAmp8 UGen for waveshaping - this can be a CPU hog, but there are several versions (SoftClipAmp and SoftClipAmp4) that are less CPU intensive, and rarely change my sound.
Issue two: a SynthDef has been constructed with a large amount of duplicated or redundant UGens. This can sometimes happen because you haven't planned your graph well, and sometimes can happen because you're processing multiple channels without realizing it. Here are some examples:
sig = 10.collect { SinOsc.ar(1000.rand) }; // this creates 10 channels
sig = LPF.ar(sig, 500); // By multichannel expansion, this is also 10 channels and thus 10 LPF's
sig = Mix(sig);
In this case, I'm using a low pass filter on 10 separate channels, and then mixing them down. The result would be acoustically identical if I had applied the LPF to my mixed result, but it would have only used 1 LPF instead of 10. I've been able to cut CPU usage of a Synth in half by refactoring to avoid some of the above cases.
sig = 10.collect {
SinOsc.ar(1000.rand)
* LFPulse.ar([10, 12, 20])
};
sig = Mix(sig);
sig = LPF.ar(sig, 500);
In this case, I intended my sig to be 10 channels as before. But, I used an array argument for my LFPulse, which means I actually end up with 10 signals that are each 3 channels because of multichannel expansion. My Mix then mixes down to 3 channels instead of 10 as I had perhaps expected.
For both of these scenarios, it's very easy to hit them either by bad planning or simple oversight - the problems can compound in a complex SynthDef, causing you to be running processes on MANY channels when you're expecting just stereo. Multichannel expansion can be difficult to figure out sometimes - the easiest way to triage is to use ".postln" in your SynthDef, which should tell you how many channels a particular signal has at that point in the graph - if you see something radically unexpected (why do I have 20 channels here?) then you can work your way backwards and either figure out your mistake or add a Mix to mix back down to a lower number of channels.
Issue three: over-use of audio rate ugens. I often build my SynthDef's using almost exclusively .ar, generally because it's easier than planning ar versus kr signal chains while I'm working. But, it's incredibly important to remember: an .ar UGen is calculating 64 values for every 1 value a .kr UGen is calculating. Audio rate ugens are often more optimized, but at least in theory this means audio rate UGens can be 64x more expensive than control rate UGens. Modulating filter parameters with audio rate signals can be an especially bad (and very common) case for this:
LPF.ar(sig, SinOsc.ar(1).range(100, 1000))
LPF.ar(sig, SinOsc.kr(1).range(100, 1000))
These should sound identical, but one will use radically more CPU than the other. Look through your signal chain, and convert any slow-moving signals (envelopes, low-speed oscillators, etc) to .kr - you should start to see the CPU usage improve.
Tom is right about Supernova - but splitting your SynthDefs up to make use of multiple Supernova threads can be as much work as any of the above techniques. There is no free ride - at the end of the day, you've got to look at your code, understand what it's doing, and make some choices about how you can do it differently.
And while you're refactoring - make sure you save a copy of your original, in case you end up making things worse (or - to be optimistic - better!) than they were originally.
Good luck!
Scott