What is Clip Skip and what does it do in Stable Diffusion?

Updated: requires /parser:new temporarily as we have a new text parser for handling weights.  This syntax is temporary while we debug the new weights system

Usage

You’ve probably seen this on some internet prompts:

SET CLIP SKIP TO 2 !!!

The syntax for this in our software is:

/render /parser:new /clipskip:2  and then your brilliant prompt, etc.

What are the number ranges?

Integers only, rounding up only. Use full numbers.

You can input /parser:new /clipskip:1.9 but it will be the same as /parser:new /clipskip:1, The result will only change when you cross a full number like /parser:new /clipskip:2

In terms of ranges, this can go up to /parser:new /clipskip:13, but honestly above /clipskip:5 the results go deep into the wacky territory.

What is Clip Skip?

Clip Skip is a feature that literally skips part of the image generation process, leading to slightly different results. This means the image renders faster, too.

But why would anyone want to skip a part of the diffusion process?

A typical Stable Diffusion 1.5 base model image goes through 12 “clip” layers, as in levels of refinement. The early levels are very broad, while as we arrive towards the later levels, the images become more clear and specific.  In the case of some base models, especially those based on Danbooru tags, trial and error led to better images as the very broad clip layers added unwanted noise. You can literally skip layers, saving some GPU time, and getting better art.

Explain it to me like I’m 5

User 5KilosOfCheese offers this brilliant analogy:

If you want picture of “a cow” you might not care about the sub categories of “cow” the text model might have. Especially since these can have varying degrees of quality. So if you want “a cow” you might not want “a abederdeen angus bull”.  (the full post is at the bottom of this page)

You can imagine CLIP skip to basically be a setting for “how accurate you want the text model to be”. You can test it out and see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you’d get picture of “a man standing”, then deeper “young man standing”, “Young man standing in a forest”… etc. CLIP skip really becomes good when you use models that are structured in a special way. Where “1girl” tag can break down to many sub tags that connect to that one major tag. 

Do I need it?

It’s a minor optimization only recommended for hardcore quality-obsessed enthusiasts. If you’re working on anime, or semi-realism, it’s worth a try.

Limitations

How many to skip

Generally speaking, 1 or 2 skips and things can turn out pretty good. Doing more than 2 and things start looking like images with low guidance.

Inconsisent compatibility

ClipSkip has become one of those “wear your seatblets” kind of safe defaults, where many people just prefer to set it and forget it. This isn’t wise. The feature also has unpredictable results when using other technologies, such as LoRAs, so keep that in mind as LoRAs and Textual Inversions. Missing Layers where Layers are expected can make the image worse, or it does nothing at all.

Newer models unaffected

Stable Diffusion 2.1 models use a different CLIP technology (OpenClip) and thus it then makes no difference. Remembering which models are and aren’t SD 1.5 based can be tricky though, even for us. For example, you would think that Realistic Vision 2.0 isn’t SD 1.5 based as it kicks the realism tail of the stock 2.1 model, but it is actually a biproduct of Stable Diffusion 1.5. Yep.

It’s faster to just try it than to try it and compare. Just make sure you also lock seed, guidance, concept, and sampler to accurately compare the differences.

Additional information from the Telegram Chats and Github that our users found helpful:

MrBreakIt M !!, [6/3/2023 3:15 PM]
CLIP model (The text embedding present in 1.x models) has a structure that is composed of layers. Each layer is more specific than the last. Example if layer 1 is “Person” then layer 2 could be: “male” and “female”; then if you go down the path of “male” layer 3 could be: Man, boy, lad, father, grandpa… etc. Note this is not exactly how the CLIP model is structured, but for the sake of example.
The 1.5 model is for example 12 ranks deep. Where in 12th layer is the last layer of text embedding. Each layer matrix of some size, and each layer is has additional matrixes. So 4×4 first layer has 4 4×4 under it… SO and so forth. So the text space is dimensionally f— huge.
Now why would you want to stop earlier in the Clip layers? Well if you want picture of “a cow” you might not care about the sub categories of “cow” the text model might have. Especially since these can have varying degrees of quality. So if you want “a cow” you might not want “a abederdeen angus bull”.
You can imagine CLIP skip to basically be a setting for “how accurate you want the text model to be”. You can test it out, wtih XY script for example. You can see that each clip stage has more definition in the description sense. So if you have a detailed prompt about a young man standing in a field, with lower clip stages you’d get picture of “a man standing”, then deeper “young man standing”, “Young man standing in a forest”… etc.
CLIP skip really becomes good when you use models that are structured in a special way. Like Booru models. Where “1girl” tag can break down to many sub tags that connect to that one major tag. Whether you get use of from clip skip is really just trial and error.
Now keep in mind that CLIP skip only works in models that use CLIP and or are based on models that use CLIP. As in 1.x models and it’s derivates. 2.0 models and it’s derivates do not interact with CLIP because they use OpenCLIP.
Great explanation, how does this differ from CFG though? Or is CFG a bit more of a basic version of this?
No. CFG is entirely different. Imagine that the process of making the image is a journey. Prompt is instructions to where to go, but not how to get there. Since the tasks of the AI is to basically to do this simple piece of maths: prompt – latent interrogation ≈ 0. CFG is basically multiplier of of the prompt. So (prompt*cfg) – interrogation tokens ≈ 0. This is horibly simplified, but enough for the sake of giving you the idea, so you can research further.
Now. Lets imagine theoretical – or practical if you adjust the settings of your SD – 0 CFG. This would the AI generating whatever it needs, to turn the random gaussian noise, so that it approximates 0 when it is interrogated. Remember that the AI will ALWAYS find something when it looks at a picture or noise. That is what that portion of the AI does, it can’t say “I see nothing”. As we know, if you multiply something with 0, you get 0. So (prompt*0) – interrogation ≈ 0 will turn to interrogation ≈ 0 . On the other extreme, the theoretical max CFG would just mean that the AI will ONLY and ONLY find whatever it finds from behind those tokens in the prompt. This is why you get very extreme distortions and contrast, you are basically extracting the purest representation, of what the AI has for those tokens in your prompt. They are just average total of a pattern, calculated during the training process.
So CFG is really just you telling the AI how specific it should be with the prompt. Imagine that we have an arbitrary scale of CFG. 1. This universe, 2. our galaxy cluster, 3. our galaxy, 4. our part of the galaxy, 5. our solar system, 6. Earth, 7. One side of the globe, 8. a continent, 9. a country, 10. a region of a country, 11. a city, 12. part of the city, 13. specific street, 14. specific address on the street, 15. specific building under that address, 16. specific floor of a building, 17. specific apartment on that floor, 18. specific room in that apartment, 19. specific shelf in that room, 20. specific box on that shelf…. etc etc. As you can see, at some point, things just get unnecessarily specific. This is why the CFG is basically also called “creativity slider”, defining how creative the AI can be. Basically, how much other stuff it can go and find to fill in the picture. The AI has limited capacity of “understanding” and more broader the term, the more value it has. Face is a very broad term, eyes, nose, mouth less broad – however they are connected to face.
So CFG has nothing to do with the structure of the model, as in whether it uses clip layers or not. It is only about how specifically it navigates in said model, now how it is navigated in. In the case of absolute theoretical maximum CFG, if you prompted “boy’s face” you get basically whatever is the most fundamental definition of that. I have tested this myself. Going to 100 CFG, with lots of steps (thousands).
At some point, things just break down to geometric shapes and symmetrical things. Think of it like this. What is “face” fundamentally? Not talking about human face specifically. It is just a plane of a geometric shape.
As I explored the higher scales, face becomes sort of diamond shapes, in peach colour, with black hole in which there was a white circle, between them a sort of a triangle, under that a few lines. The AI was not “wrong” as it made this. Those are basically the fundamental patterns of a human face. However, it did not fetch things like: skin texture, hair, depth… etc. It got EXACTLY what it needed to meet the requirements of the prompt.