Mastering Flux AI with NF4: Speed and Quality Improvements

Overview

Hey there! So, you’ve got your hands on Flux AI, an amazing image generation tool by Black Forest Labs, huh? It’s pretty rad, right? But to truly unleash its power, especially with those nifty Flux checkpoints, you gotta know how to tweak it right. Let’s dive into how you can use different Flux checkpoints and get the best performance out of ‘em!

Supported Flux Checkpoints

1. Available Checkpoints

flux1-dev-bnb-nf4-v2.safetensors: Full flux-dev checkpoint with main model in NF4.
- Recommended: Download it from HuggingFace
flux1-dev-fp8.safetensors: Full flux-dev checkpoint with main model in FP8.
- Download it from HuggingFace

Looking for raw Flux or GGUF? Check out this post.

2. Why NF4?

Speed: For 6GB/8GB/12GB GPUs, NF4 can be 1.3x to 4x faster than FP8.
Size: NF4 weights are about half the size of FP8.
Accuracy: NF4 often beats FP8 in numerical precision and dynamic range.

Using Flux Checkpoints

1. Set Up Your GPU

CUDA Support: If your device supports CUDA newer than 11.7, go for NF4. Congrats, you only need the flux1-dev-bnb-nf4.safetensors.
Older GPUs: If you have an older GPU like GTX 10XX/20XX, download the flux1-dev-fp8.safetensors.

2. Loading in the UI

In the UI, Forge gives an option to force the loading weight type.
Generally, set it to Auto to use the default precision in your downloaded checkpoint.

Tip: Don’t load FP8 checkpoint with the NF4 option!

Boosting Inference Speed

1. Default Settings

Forge’s presets are fast, but you can push the speed limit even further.
Example System: 8GB VRAM, 32GB CPU memory, and 16GB shared GPU memory.

2. Offloading and Swapping

If model size > GPU memory, split the model. Load part to GPU and the other to "swap" location, either CPU or shared memory.
Shared memory can be ~15% faster but might crash on some devices.

3. Tuning GPU Weights

Larger GPU weights = faster speed, but too large might cause crashes.
Smaller GPU weights = slower speed but possible to diffuse larger images.

Example Configurations

Example with Flux-dev

Using Flux-dev in diffusion:
- GPU Memory: 8GB
- CPU Memory: 32GB
- Shared GPU Memory: 16GB
- Time: 1.5 min

Example Prompts

Astronaut in a jungle, cold color palette, muted colors, very detailed, sharp focus.
Steps: 20, Sampler: Euler, Schedule type: Simple, CFG scale: 1, Distilled CFG Scale: 3.5, Seed: 12345, Size: 896x1152, Model: flux1-dev-bnb-nf4-v2

FAQ

Which checkpoints should I use?

If your GPU supports newer CUDA versions (>11.7), use flux1-dev-bnb-nf4.safetensors for better speed and precision.
For older GPUs, stick with flux1-dev-fp8.safetensors.

How can I ensure my GPU is using the T5 text encoder?

T5 might default to FP8 which could be incompatible. Ensure your setup can handle NF4 to get the best of the T5 text encoder.

How can I swap parts between CPU and GPU?

Go to settings and select swap locations. Shared memory tends to be faster but check stability first.

Can I use models like SDXL with NF4?

Sure! Using NF4 diffusion speeds up models like SDXL by around 35% on average, though it doesn’t exactly replicate seeds.

Troubleshooting inpainting or img2img issues?

Make sure you’re on the latest version of Forge. Update it if necessary to resolve black image issues or missing outputs.

How to convert models to NF4?

Custom scripts or community shared links like this NF4 version of flux1-schnell-bnb might help: flux1-schnell-bnb-nf4.safetensors.

And there you go! With these tweaks, you should be all set to get the best out of your Flux AI checkpoints. Happy creating!