Learning ControlNet

ControlNet essentially leverages a tool known within the GenAi community as ControlNet. ControlNet is a framework that can be incorporated into the image creation program to "direct" the generation of output images

The Functionality of ControlNet

By extracting new data from the input images, ControlNet is uniquely effective at adding new parameters to the image creation process. Each mode is designed to seek out specific visual data and comes equipped with a preprocessor that readies any reference image for inferencing. What makes ControlNet uniquely effective is its ability to add new parameters to the image creation process. Each mode is designed to seek out specific visual data and comes equipped with a preprocessor that readies any reference image for inferencing.

Simplifying ControlNet: A Video Game Analogy

Think of ControlNet as a powerful equipment upgrade in an Adventure RPG Game. Without ControlNet, our game character (the user,) can only interact with the game environment (output images) in basic ways, much like a character without any special abilities or items

When a reference image is provided without the ControlNet, the AI's interpretation can vary depending on how much influence you set your image to. Imagine it like adjusting the game's difficulty level - if you lower the influence it is like increasing the difficulty of the game, the AI abstracts more, like a game that offers broader, less specific clues. It takes more skill to communicate the details you want it to draw from the reference image. It bases its information on general elements like the color palettes and shapes in the image.

If you raise the img2img influence, it is like lowering the difficulty level of the game. The AI's output will closely resemble the image you've provided. This can be very powerful for use cases where you are trying to simply alter the style of an input image. It also can be very limiting, depending on a users goals.

Previously, this way of interpreting reference images made it challenging to reproduce specific aspects like poses, line work, shapes, and depth within an image. Using the earlier metaphor, it is much like how a basic game character build might lead to a player struggling with advanced challenges.

Leveling Up with ControlNet

But now, enter ControlNet, our game-changing equipment upgrade. It introduces pre-processors which convert images to ‘mode maps’. Preprocessors detect specific features of an image environment (depth, edges, poses, etc). This information is turned into a mode map and provide control over the output. In the Game Analogy different types of game maps revealing specific details of the game world.

This feature allows users to exert more precision. Once the preprocessed image runs through the selected mode, which has been trained to identify very specific information from a particular kind of mode map, the model is able to produce far more nuanced outputs. It's like having an advanced power-up that lets your character interact with the game world in ways that were previously impossible, offering an enhanced and more controlled image generation experience.

Mapping Examples

These are examples of the preprocessed images that the model uses as reference in the image generation process and not a reflection of the final output.

To make the process of ControlNet easier to visualize, we’ve created a grid and ran it through our basic modes. Our basic modes consist of Structure, Pose, Depth, Lines, and Segmentation. You will see additional modes such as City and Interior - these are ‘Advanced’ modes which use a mix of models and preprocessors.

Sample 1

In this grid we’ve picked images that range, to give a better insight on a multitude of outputs.

Structure (Canny)

Structure Mode is known in the community as a Canny Map. This preprocessor leverages a “Canny Edge Detector,” which is a well established mapping tool in the tech world. Structure Mode retains more details from the original image.

Sample

As you can see, Structure picks up a significant number of details.

Depth Mode

The Depth Mode generates a depth map during the preprocessing stage. This is used to provide nuanced information which is then interpreted by the Depth Mode model in conjunction with a custom model.

Sample

The sketch of a house was very obviously 2D, and as you can see the Depth Mode has not produced as much information for it.

Pose Mode

As you can see by the images below, pose mode draws the most complete information from realistic to semi-realistic images of people. However, that is only true for the input images. Pose mode does understand and can translate those poses for output characters who are cartoon, anime, or other non-realistic styles. Note: typically pose mode does not rescale it’s poses, so it is recommended for smaller characters to use images of people with similar proportions to your ideal output.

Sample

As you can see in this image, Pose mode only recognizes the face of the realistic human image. The cartoon body is also recognized, however the head and face are not. This is because it is not in a realistic or semi realistic style. For more examples of good reference images, check our deep dive article of pose mode.

Edges

The edges model and preprocessors identify straight lines and corners in a given image. This means that images with very few straight lines will give very little information to the edges Mode. Edges is ideal for generating structures and other images where linework being straight is of the utmost importance.

Sample

As you can see in this sample, the images with the most distinct straight lines and corners have the most complete mapping.

Segmentation

Segmentation mode creates hard, distinct segments of color to identify the general shape of the main objects it can detect in a reference image. This will impose a lot of information from the style of the model being used, and retain only the segmented information it preprocesses.

Sample

Line Art

Line Art mode identifies the natural edges in an image, first converting them into a Line Art mode map, and coloring them in based on the style of the model being used. It pays attention to shadow, color, and line art. This mode works both with images that are meant to look like traditional lineart as well as full color images, as the pre-processor creates a map like the one you will see below.

Sample

Normal Map Mode

Normal Map mode operates similarly to Depth Mode, the main difference being that Normal Map Mode also brings in additional textural elements it perceives from the reference image. Users who work with Normal Maps can directly upload an image of their normal map and turn off mode mapping if they would prefer.

Sample

Scribble Mode

Scribble is exactly as it sounds - it takes rough sketches and uses their visual information to create more complex outputs based on your model style. Scribble is best used on simple drawings, however it is entirely possible to input any image into Scribble Mode and it will be preprocessed with varying potential outputs.