Usage

You’ve probably seen this on some internet prompts:

SET CLIP SKIP TO 2 !!!

The syntax for this in our software is:

/render /clipskip:2  and then your brilliant prompt, etc.

What is Clip Skip?

Clip Skip is a feature that literally skips part of the image generation process, leading to slightly different results. This means the image renders faster, too.

But why would anyone want to skip a part of the diffusion process?

A typical Stable Diffusion 1.5 base model image goes through 12 “clip” layers, as in levels of refinement. The early levels are very broad, while as we arrive towards the later levels, the images become more clear and specific.  In the case of some base models, especially those based on Danbooru tags, trial and error led to better images as the very broad clip layers added unwanted noise. You can literally skip layers, saving some GPU time, and getting better art.

Explain it to me like I’m 5

User 5KilosOfCheese offers this brilliant analogy:

If you want picture of “a cow” you might not care about the sub categories of “cow” the text model might have. Especially since these can have varying degrees of quality. So if you want “a cow” you might not want “a abederdeen angus bull”.

You can imagine CLIP skip to basically be a setting for “how accurate you want the text model to be”. You can test it out and see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you’d get picture of “a man standing”, then deeper “young man standing”, “Young man standing in a forest”… etc. CLIP skip really becomes good when you use models that are structured in a special way. Where “1girl” tag can break down to many sub tags that connect to that one major tag. 

Do I need it?

It’s a minor optimization only recommended for hardcore quality-obsessed enthusiasts. If you’re working on anime, or semi-realism, it’s worth a try.

Limitations

How many to skip

Generally speaking, 1 or 2 skips and things can turn out pretty good. Doing more than 2 and things start looking like images with low guidance.

Inconsisent compatibility

ClipSkip has become one of those “wear your seatblets” kind of safe defaults, where many people just prefer to set it and forget it. This isn’t wise. The feature also has unpredictable results when using other technologies, such as LoRAs, so keep that in mind as LoRAs and Textual Inversions. Missing Layers where Layers are expected can make the image worse, or it does nothing at all.

Newer models unaffected

Stable Diffusion 2.1 models use a different CLIP technology (OpenClip) and thus it then makes no difference. Remembering which models are and aren’t SD 1.5 based can be tricky though, even for us. For example, you would think that Realistic Vision 2.0 isn’t SD 1.5 based as it kicks the realism tail of the stock 2.1 model, but it is actually a biproduct of Stable Diffusion 1.5. Yep.

It’s faster to just try it than to try it and compare. Just make sure you also lock seed, guidance, concept, and sampler to accurately compare the differences.