- pub
Mastering Flux AI with NF4: Speed and Quality Improvements
Overview
Hey there! So, youāve got your hands on Flux AI, an amazing image generation tool by Black Forest Labs, huh? Itās pretty rad, right? But to truly unleash its power, especially with those nifty Flux checkpoints, you gotta know how to tweak it right. Letās dive into how you can use different Flux checkpoints and get the best performance out of āem!
Supported Flux Checkpoints
1. Available Checkpoints
- flux1-dev-bnb-nf4-v2.safetensors: Full flux-dev checkpoint with main model in NF4.
- Recommended: Download it from HuggingFace
- flux1-dev-fp8.safetensors: Full flux-dev checkpoint with main model in FP8.
Looking for raw Flux or GGUF? Check out this post.
2. Why NF4?
- Speed: For 6GB/8GB/12GB GPUs, NF4 can be 1.3x to 4x faster than FP8.
- Size: NF4 weights are about half the size of FP8.
- Accuracy: NF4 often beats FP8 in numerical precision and dynamic range.
Using Flux Checkpoints
1. Set Up Your GPU
- CUDA Support: If your device supports CUDA newer than 11.7, go for NF4. Congrats, you only need the
flux1-dev-bnb-nf4.safetensors
. - Older GPUs: If you have an older GPU like GTX 10XX/20XX, download the
flux1-dev-fp8.safetensors
.
2. Loading in the UI
- In the UI, Forge gives an option to force the loading weight type.
- Generally, set it to Auto to use the default precision in your downloaded checkpoint.
Tip: Donāt load FP8 checkpoint with the NF4 option!
Boosting Inference Speed
1. Default Settings
- Forgeās presets are fast, but you can push the speed limit even further.
- Example System: 8GB VRAM, 32GB CPU memory, and 16GB shared GPU memory.
2. Offloading and Swapping
- If model size > GPU memory, split the model. Load part to GPU and the other to "swap" location, either CPU or shared memory.
- Shared memory can be ~15% faster but might crash on some devices.
3. Tuning GPU Weights
- Larger GPU weights = faster speed, but too large might cause crashes.
- Smaller GPU weights = slower speed but possible to diffuse larger images.
Example Configurations
Example with Flux-dev
Using Flux-dev in diffusion:
- GPU Memory: 8GB
- CPU Memory: 32GB
- Shared GPU Memory: 16GB
- Time: 1.5 min
Example Prompts
Astronaut in a jungle, cold color palette, muted colors, very detailed, sharp focus.
Steps: 20, Sampler: Euler, Schedule type: Simple, CFG scale: 1, Distilled CFG Scale: 3.5, Seed: 12345, Size: 896x1152, Model: flux1-dev-bnb-nf4-v2
FAQ
Which checkpoints should I use?
- If your GPU supports newer CUDA versions (>11.7), use
flux1-dev-bnb-nf4.safetensors
for better speed and precision. - For older GPUs, stick with
flux1-dev-fp8.safetensors
.
How can I ensure my GPU is using the T5 text encoder?
- T5 might default to FP8 which could be incompatible. Ensure your setup can handle NF4 to get the best of the T5 text encoder.
How can I swap parts between CPU and GPU?
- Go to settings and select swap locations. Shared memory tends to be faster but check stability first.
Can I use models like SDXL with NF4?
- Sure! Using NF4 diffusion speeds up models like SDXL by around 35% on average, though it doesnāt exactly replicate seeds.
Troubleshooting inpainting or img2img issues?
- Make sure youāre on the latest version of Forge. Update it if necessary to resolve black image issues or missing outputs.
How to convert models to NF4?
- Custom scripts or community shared links like this NF4 version of
flux1-schnell-bnb
might help: flux1-schnell-bnb-nf4.safetensors.
And there you go! With these tweaks, you should be all set to get the best out of your Flux AI checkpoints. Happy creating!