Understanding ComfyUI Model Types: A Complete Guide

Introduction

If you’ve ever opened ComfyUI’s models folder and wondered what all those subfolders are for, you’re not alone. ComfyUI is a powerful, node-based interface for AI image generation — and it uses a surprisingly large number of specialized model files, each with a distinct job.

Unlike simpler one-click generators, ComfyUI gives you direct access to every component in the image generation pipeline. That flexibility is what makes it a favorite among AI companion platforms and creators who need fine-grained control over their visuals. But it also means there are a lot of moving parts to understand.

This guide walks through every model type you’ll encounter in ComfyUI, explains what each one does, and shows how they work together to produce the kind of high-quality, consistent visuals that power modern AI companion platforms.

Core Generation Models

These are the foundational models — the ones that actually generate images from text prompts.

Checkpoints

Checkpoints are the main model files and the starting point for any ComfyUI workflow. A checkpoint is a single file that bundles three components together: the UNet (the denoising network that actually generates images), a text encoder (which interprets your prompts), and a VAE (which converts between pixel space and the latent space where generation happens).

Typically range from 2–7 GB depending on architecture and precision
Common architectures include SD 1.5, SDXL, and newer models like Flux
AI companion platforms often use specialized checkpoints — for example, Nova 3DCG XL is an SDXL checkpoint fine-tuned for 3DCG-style character rendering. To learn more about how 3DCG checkpoints power AI companion visuals, see our guide to AI companion visual technology.
You’d encounter these when setting up any generation workflow — the checkpoint is always the first model you load

Think of the checkpoint as the artist’s complete skill set. Everything else in this guide either refines that skill set or adds tools to the artist’s workspace.

Diffusion Models

The diffusion models folder (sometimes labeled diffusion_models or unet) holds standalone UNet or denoising components — the core neural network that transforms noise into images, separated from the other checkpoint components.

Used when loading model components individually rather than as a merged checkpoint
The trend in newer architectures like Flux and SD3 is toward component-based loading, where you pick your own text encoder and VAE rather than relying on whatever was bundled into a checkpoint
Gives you more flexibility to mix and match components for different results

You’d encounter these when working with newer model architectures or when a workflow specifically calls for separate component loading.

VAE

A VAE (Variational Autoencoder) handles the translation between pixel space — the actual images you see — and latent space, the compressed mathematical representation where the diffusion model does its work. It encodes images down into latent representations and decodes them back into viewable images.

Different VAEs can noticeably affect color accuracy, saturation, and fine detail quality
Every checkpoint includes a built-in VAE, but you can swap in a standalone one for better results
Common standalone VAEs include those optimized for SDXL or for more accurate color reproduction
You’d encounter this when images look washed out or have color shifts — swapping the VAE is often the fix

Text Encoders

Text encoders are the models that convert your written prompts into numerical embeddings that the diffusion model can understand. The quality of your text encoder directly affects how well the model interprets and follows your prompts.

SD 1.5 uses a single CLIP text encoder
SDXL uses dual text encoders — OpenCLIP and CLIP working together for richer prompt understanding
Newer architectures like Flux use T5-based encoders, which handle longer and more complex prompts
Companion platforms rely on strong text encoders to accurately translate descriptions of expressions, poses, and settings into matching visuals

You’d encounter standalone text encoders when using component-based loading or when upgrading the text understanding capabilities of your pipeline.

Prompt and Style Enhancement Models

These models modify or enhance what the core generation models produce, without replacing them.

Embeddings (Textual Inversions)

Embeddings, also known as textual inversions, are small files that teach the text encoder new concepts. They work by mapping a special trigger word to a learned representation — when you include that trigger word in your prompt, the model “understands” the concept the embedding encodes.

Typically only a few kilobytes in size
Commonly used as negative embeddings to avoid common artifacts like bad hands, distorted faces, or blurry output
Style embeddings can encode a consistent aesthetic so you don’t need to write lengthy style descriptions in every prompt
You’d encounter these when fine-tuning prompt quality or trying to achieve a specific look consistently. Our embedding optimization guide covers how to use negative embeddings to dramatically improve generation quality.

LoRAs (Low-Rank Adaptations)

LoRAs are small adapter files that modify a model’s behavior without replacing the base checkpoint. They work by injecting learned adjustments into the model’s attention layers — a technique that’s remarkably efficient, producing meaningful changes with files that are typically only 10–200 MB.

Character consistency — A LoRA trained on a specific character’s appearance ensures they look the same across every generation. This is how companion platforms maintain a consistent visual identity for each character.
Style transfer — LoRAs can shift the entire aesthetic of a model’s output, from warm cinematic lighting to cel-shaded illustration.
Concept learning — Teach the model new subjects, objects, or visual concepts it wasn’t trained on.
Multiple LoRAs can be stacked simultaneously, each with adjustable weight to control its influence. For practical recipes on combining LoRAs for unique character styles, see our LoRA combination recipes guide.

LoRAs are one of the most important technologies in the AI companion space. For a deeper look at how companion platforms use them for character identity, see our guide to AI companion visual technology.

Hypernetworks

Hypernetworks are an older fine-tuning approach that modifies the model’s cross-attention layers through small auxiliary networks. They served a similar purpose to LoRAs — style transfer and concept learning — but have been largely superseded.

Less flexible and generally lower quality than LoRAs
Still supported in ComfyUI for backward compatibility
You might encounter them when using older model resources or community-shared assets from the early Stable Diffusion era

Style Models

Style models are dedicated model files designed specifically for applying artistic styles to generated images. Unlike LoRAs, which are general-purpose adapters, style models are purpose-built for style transfer tasks.

Used with specialized style-transfer nodes in ComfyUI
Apply a consistent artistic treatment across different subjects and compositions
Useful when you want a specific artistic look that goes beyond what prompt engineering or LoRAs can achieve

Image Enhancement and Upscaling

These models improve image quality after the initial generation, particularly resolution and detail.

Upscale Models

Upscale models are traditional (non-diffusion) super-resolution networks that increase an image’s pixel dimensions. Models like RealESRGAN, SwinIR, and ESPCN take a generated image and scale it up — typically 2× or 4× — while adding realistic detail.

Work on the final decoded image (pixel space), not in latent space
Fast and predictable — they don’t change the content, just enhance the resolution
Different upscale models are optimized for different content types: some excel at photorealism, others at illustrated or anime-style content
You’d encounter these in almost any production workflow, since most generation happens at lower resolutions for speed

Latent Upscale Models

Latent upscale models work in latent space before the image is decoded by the VAE. This means the diffusion model can add new detail and coherence during the upscaling process, rather than simply interpolating pixels.

Often produces more coherent results than pixel-space upscaling for AI-generated content
The diffusion model “imagines” new detail that’s consistent with the existing content
Takes longer than traditional upscaling but can produce noticeably better results
Particularly useful for companion avatars where facial detail and skin texture quality matter

VAE Approx

VAE approx models are lightweight, approximate VAE decoders used for fast previews during the generation process. Rather than running the full VAE decode on every denoising step — which would be slow — these give you a rough visual preview of what’s being generated.

Let you see a blurry but recognizable preview of the image as it forms
Not used for final output — the full VAE handles the final decode
Helpful for quickly checking whether a generation is going in the right direction before it finishes

Conditioning and Control Models

These models give you spatial and structural control over what gets generated and where.

ControlNet

ControlNet models add spatial conditioning to the generation process, allowing you to guide the structure of an image using reference inputs like pose skeletons, depth maps, edge detection maps, or segmentation masks.

Pose control — Provide a skeleton pose and the model generates a character in that exact position. Companion platforms use this to generate consistent character poses across different scenes.
Depth control — Supply a depth map to define the spatial layout of a scene.
Edge and line control — Use detected edges or line art as a structural guide for the generation.
Expression control — Some ControlNet models can guide facial expressions, which is particularly valuable for companion platforms that generate emotional responses.

ControlNet is one of the most impactful technologies for companion platforms, enabling the kind of dynamic, context-aware visuals that make interactions feel responsive and alive. For a hands-on walkthrough of ControlNet workflows, see our ControlNet fundamentals guide.

CLIP Vision

CLIP Vision (clip_vision) is the image-understanding component of the CLIP model. While the text encoder side of CLIP converts words into embeddings, CLIP Vision converts images into embeddings — allowing the generation pipeline to “see” and reference images.

Powers IP-Adapter workflows, where a reference image guides the style or subject of new generations
Enables image-to-image style transfer without needing a LoRA
Used for visual similarity search and reference-based generation
You’d encounter this when using any feature that takes an image as input rather than text

GLIGEN

GLIGEN is a grounded generation model that allows you to place specific subjects at defined positions within an image using bounding boxes. Instead of hoping the model puts your subject in the right place, you tell it exactly where things should go.

Define bounding boxes with associated text descriptions for precise spatial control
Useful for scene composition where multiple elements need specific placement
Companion platforms could use this for generating scenes with precise character positioning — placing a companion at a café table or standing in a doorway

Specialized and Advanced Models

These models serve more niche or emerging use cases within the ComfyUI ecosystem.

PhotoMaker

PhotoMaker is a model designed for generating consistent character portraits from reference photos. Given one or more reference images of a person, it can generate new images that maintain the subject’s identity across different poses, expressions, and settings.

Particularly relevant for companion platforms that let users customize their companion’s appearance based on uploaded reference images
Achieves identity preservation without requiring LoRA training, which is faster but sometimes less precise
Works well for portrait and upper-body compositions

Classifiers

Classifiers are models that analyze generated images rather than creating them. They evaluate images for specific properties and return scores or categories.

Safety filters — Detect and flag content that violates platform policies
Quality scoring — Rate images on technical quality metrics like sharpness, composition, and artifact presence
Content classification — Categorize images by type, mood, or content
Companion platforms rely on classifiers for automated content moderation and quality assurance, ensuring every image that reaches users meets their standards

Model Patches

Model patches (model_patches) are modular modifications that alter specific behaviors in the inference pipeline. Rather than being separate model files, they function as plug-in adjustments to how the diffusion process runs.

FreeU — Adjusts the model’s feature maps during inference to improve quality without retraining
Self-Attention Guidance (SAG) — Enhances image detail and coherence by modifying the self-attention mechanism
Applied dynamically during generation and can be toggled on or off per workflow
Think of these as runtime tweaks to the generation process — small adjustments that can noticeably improve output quality

Diffusers

Diffusers refers to models stored in Hugging Face’s Diffusers format — a directory structure with separate files for each component rather than a single merged checkpoint file.

ComfyUI can load these directly, which is useful for newer models that are only distributed in this format
The Diffusers format is becoming increasingly common as the ecosystem moves toward component-based architectures
You’d encounter this when downloading newer or experimental models from Hugging Face

Audio and Infrastructure

These are less common model folders that serve specialized or infrastructure purposes.

Audio Encoders

Audio encoders (audio_encoders) encode audio input for multimodal or audio-conditioned generation workflows.

Used in emerging pipelines that generate images conditioned on audio or music
Enable audio-reactive visualizations and music-to-image generation
Still a niche use case, but growing as multimodal AI capabilities expand

Download Model Base

The download model base (download_model_base) folder is not a model type in the traditional sense. It serves as a base configuration and storage path for ComfyUI’s model download manager.

Used by custom nodes that automatically download required models on first use
Helps manage model dependencies without manual downloading
You’d encounter this folder when using nodes that handle their own model management

How These Models Work Together

Understanding individual model types is useful, but the real picture emerges when you see how they combine in a typical workflow. Here’s a simplified example — generating a consistent AI companion portrait:

Load the foundation — A checkpoint (like Nova 3DCG XL) or separate diffusion model, text encoder, and VAE provide the base generation capability.
Apply character identity — A LoRA trained on the companion’s specific appearance ensures they look like themselves. Negative embeddings help avoid common artifacts.
Add spatial guidance — A ControlNet model with a pose reference ensures the companion is in the right position and expression for the conversation context.
Generate the image — The diffusion model runs through its denoising steps, guided by all the conditioning from text prompts, LoRAs, and ControlNets.
Enhance the output — An upscale model increases the resolution, and classifiers verify the image meets quality and safety standards before it’s shown to the user.

Each model type plays a specific role, and ComfyUI’s node-based interface lets you wire them together in exactly the combination you need. This modularity is what gives companion platforms the flexibility to create diverse, high-quality visuals tailored to their specific aesthetic and quality requirements.

Conclusion

The ComfyUI ecosystem can feel overwhelming at first — twenty-plus model types spread across a maze of subfolders. But each one exists for a reason, and understanding what they do helps you appreciate the sophisticated technology behind the AI companion visuals you interact with every day.

Whether you’re a creator building your own ComfyUI workflows or simply curious about how your favorite companion platform generates such compelling visuals, knowing these building blocks gives you a deeper understanding of what’s possible. The modularity of the system — the ability to swap checkpoints, stack LoRAs, add ControlNet guidance, and fine-tune every step of the pipeline — is what enables the diverse, high-quality imagery across different companion platforms.

As AI generation technology continues to advance, expect these model types to evolve as well. New architectures will emerge, existing categories will merge or split, and the line between generation and real-time rendering will continue to blur. But the fundamental principle will remain the same: specialized components working together to create something greater than any single model could achieve alone.

This guide was compiled from research into the ComfyUI model ecosystem and its applications in AI companion visual generation.