Skip to content

Main Navigation

Puget Systems Logo
  • Solutions
    • Recommended Systems For:
    • Content Creation
      • Photo Editing
        • Recommended Systems For:
        • Adobe Lightroom Classic
        • Adobe Photoshop
        • Stable Diffusion
      • Video Editing
        • Recommended Systems For:
        • Adobe After Effects
        • Adobe Premiere Pro
        • DaVinci Resolve
        • Foundry Nuke
      • 3D Design & Animation
        • Recommended Systems For:
        • Autodesk 3ds Max
        • Autodesk Maya
        • Blender
        • Cinema 4D
        • Houdini
        • ZBrush
      • Real-Time Engines
        • Recommended Systems For:
        • Game Development
        • Unity
        • Unreal Engine
        • Virtual Production
      • Rendering
        • Recommended Systems For:
        • Keyshot
        • OctaneRender
        • Redshift
        • V-Ray
      • Digital Audio
        • Recommended Systems For:
        • Ableton Live
        • FL Studio
        • Pro Tools
    • Engineering
      • Architecture & CAD
        • Recommended Systems For:
        • Autodesk AutoCAD
        • Autodesk Inventor
        • Autodesk Revit
        • SOLIDWORKS
      • Visualization
        • Recommended Systems For:
        • Enscape
        • Lumion
        • Twinmotion
      • Photogrammetry & GIS
        • Recommended Systems For:
        • ArcGIS Pro
        • Agisoft Metashape
        • Pix4D
        • RealityCapture
    • AI & HPC
      • Recommended Systems For:
      • Data Science
      • Generative AI
      • Large Language Models
      • Machine Learning / AI Dev
      • Scientific Computing
    • More
      • Recommended Systems For:
      • Compact Size
      • Live Streaming
      • NVIDIA RTX Studio
      • Quiet Operation
      • Virtual Reality
    • Business & Enterprise
      We can empower your company
    • Government & Education
      Services tailored for your organization
  • Products
    • Computer System Styles:
    • Desktop Workstations
      • AMD Ryzen
        • Ryzen 7000:
        • Mini Tower
        • Mid Tower
        • Full Tower
      • AMD Threadripper
        • Threadripper 7000:
        • Mid Tower
        • Full Tower
        • Threadripper PRO 5000WX:
        • Full Tower
        • Threadripper PRO 7000WX:
        • Full Tower
      • AMD EPYC
        • EPYC 9004:
        • Full Tower
      • Intel Core
        • Core 13th Gen:
        • Small Form Factor
        • Core 14th Gen:
        • Mini Tower
        • Mid Tower
        • Full Tower
      • Intel Xeon
        • Xeon W-2400:
        • Mid Tower
        • Xeon W-3400:
        • Full Tower
    • Custom Computers
    • Laptop Workstations
      • Puget Mobile 17″
    • Rackstations
      • AMD Rackstations
        • Ryzen 7000:
        • R120-4U
        • R550-6U 5-Node
        • Threadripper 7000:
        • T120-4U
        • Threadripper PRO 5000WX:
        • WRX80 4U
        • Threadripper PRO 7000WX:
        • T140-4U
        • EPYC 9004:
        • E140-4U
      • Intel Rackstations
        • Core 14th Gen:
        • C130-4U
        • Xeon W-3400:
        • X140-4U
        • X141-5U
    • Custom Rackmount Workstations
    • Puget Servers
      • Puget Servers
        • AMD EPYC:
        • E200-1U
        • E140-2U
        • E280-4U
        • Intel Xeon:
        • X200-1U
    • Custom Servers
    • Storage Solutions
      • Network Attached Storage
        • QNAP NAS Recommendations
      • Puget Storage
        • Puget Storage:
        • 12-Bay 2U
        • 24-Bay 2U
        • 36-Bay 4U
    • Recommended Third Party Peripherals
      Curated list of accessories for your workstation
    • Puget Gear
      Quality apparel with Puget Systems branding
  • Publications
    • Articles
    • Blog Posts
    • Case Studies
    • HPC Blog
    • Podcasts
    • Press
    • PugetBench
  • Support
    • Contact Support
    • Support Articles
    • Warranty Details
    • Onsite Services
    • Unboxing
  • About Us
    • About Us
    • Contact Us
    • Our Customers
    • Enterprise
    • Gov & Edu
    • Press Kit
    • Testimonials
    • Careers
  • Talk to an Expert
  • My Account
  1. Home
  2. /
  3. Hardware Articles
  4. /
  5. Stable Diffusion LoRA Training – Professional GPU Analysis

Stable Diffusion LoRA Training – Professional GPU Analysis

Posted on December 12, 2023 (December 19, 2023) by Jon Allman

Table of Contents

  • Introduction
  • Test Setup
  • Performance
  • VRAM Usage
  • Conclusion

Introduction

Training AI models takes a significant amount of time and can require the use of hundreds or even thousands of graphics cards working together, often in a data center, to complete the task. As an alternative to training new models from scratch or fine-tuning all of the parameters of an existing model, LoRAs were introduced. LoRA is an acronym for “Low-Rank Adaptation”, and is a method of fine-tuning models using a much smaller set of parameters and without fundamentally changing the model underneath. This allows for fine-tuning with just a fraction of the resources required compared to traditional fine-tuning.

Today, we will be exploring the performance of a variety of professional graphics cards when training LoRAs for use with Stable Diffusion. LoRAs are a popular way of guiding models like SD toward more specific and reliable outputs. For instance, instead of prompting for a “tank” and receiving whatever SD’s idea of a tank’s characteristics includes, you could include a LoRA trained on images of an “M1 Abrams” to output images of tanks that reliably mimic the features of that specific vehicle. 

However, depending on the size of your dataset, training a LoRA can still require a large amount of compute time, often hours or potentially even days. Because of this, if you want to train LoRAs or explore other methods of fine-tuning models as anything more than a hobby, getting the right GPU can end up saving you a lot of time and frustration.

Image
Open Full Resolution

In this article, we will be examining both the performance and VRAM requirements when training a standard LoRA model for SDXL within Ubuntu 22.04.3 using kohya_ss training scripts with bmaltais’s GUI. Although there are many more options available beyond standard LoRAs, such as LoHa, LoCon, iA3, etc, we’re more interested in measuring a baseline for performance, rather than optimizing for filesize, fidelity, or other factors. This also means that we won’t be focusing on the various settings that impact the behavior of the final product but don’t impact performance during training, such as learning rates.

We will specifically be looking at the Professional GPUs from NVIDIA (RTX) and AMD (Radeon PRO), as their large VRAM amounts and high number of compute cores make them ideal for professional AI workflows like this. We will, however, be testing consumer-grade GPUs in an upcoming article. They often don’t have the VRAM capacity necessary for this type of training and aren’t intended to be run at full load for extended periods of time, but their relatively low cost makes that type of card enticing for those who are just getting started in AI.

Test Setup

Threadripper PRO Test Platform

CPU: AMD Threadripper PRO 5995WX 64-Core
CPU Cooler: Noctua NH-U14S TR4-SP3 (AMD TR4)
Motherboard: ASUS Pro WS WRX80E-SAGE SE WIFI
BIOS Version: 1201
RAM: 8x Micron DDR4-3200 16GB ECC Reg.
(128GB total)
GPUs:
AMD Radeon PRO W7900
Driver Version: 6.2.4-1683306.22.04
NVIDIA RTX 6000 Ada
NVIDIA RTX A6000
NVIDIA RTX 5000 Ada
NVIDIA RTX A5000
Driver Version: 535.129.03
PSU: Super Flower LEADEX Platinum 1600W
Storage: Samsung 980 Pro 2TB
OS: Ubuntu 22.04.3 LTS

Benchmark Software

Kohya’s GUI v22.2.1
Python: 3.10.6
SD Model: SDXL

AMD: PyTorch 2.1.1 + ROCm 5.6
NVIDIA: PyTorch 2.1.0 + CUDA 12.1

xFormers 0.0.22.post7
Optimizer: Adafactor
Arguments: scale_parameter=False, relative_step=False, warmup_init=False
Other arguments: network_train_unet_only, cache_text_encoder_outputs, cache_latents_to_disk

To see how various Pro-level cards perform, we decided to look at a number of the top GPUs from NVIDIA, as well as the AMD Radeon PRO W7900. NVIDIA is definitely the top choice for AI workloads at the moment (which is why we are testing multiple NVIDIA GPUs), but AMD has been doing a lot of work recently to catch up in this space. NVIDIA is still our go-to recommendation in most situations, but we like to include at least one AMD GPU when we can in order to keep tabs on their progress.

If you have an AMD GPU and you’re interested in utilizing ROCm for training LoRAs and other ML tasks, we recommend following AMD’s guide for ROCm installation to make sure that your system has ROCm installed and configured correctly.

After confirming that ROCm was installed correctly, we moved on to setting up two virtual environments (one AMD and one NVIDIA) for kohya_ss using conda, cloning the repository, and installing the requirements. By default, a CUDA-based version of PyTorch is installed, so for the AMD virtual environment, we uninstalled that version and installed the ROCm version instead.

Once the correct version of PyTorch is installed, we’re ready to configure and run kohya_ss!

We decided to train an SDXL LoRA with the base model provided by StabilityAI using a set of thirteen photos of myself, resized to 1024×1024 each, matching the default image size of SDXL. Since we’re not concerned about the quality of the output, we’re not using any captions or regularization images.

Because we’re training on relatively large images, without the proper configuration, even GPUs with 48GB of VRAM can run out of memory during training. One of the most impactful settings to enable is some form of Cross-Attention, either SDPA or xFormers. We have tested the AMD cards with SDPA enabled, and the NVIDIA cards with both SPDA and xFormers. 

Following recommendations for SDXL training, we enabled the following settings: network_train_unet_only, cache_text_encoder_outputs, cache_latents_to_disk

Thankfully, these options not only save some VRAM but also improve training speed as well. 

Gradient checkpointing can be used to significantly reduce VRAM usage, but it comes with a notable performance loss. None of the GPUs we tested here required the use of gradient checkpointing, however, as they have more than enough VRAM.

Because we found that the reported speeds and VRAM usage leveled out after a couple of minutes of training and additional epochs yielded identical results, we chose to test each GPU with 1 epoch of 40 steps per image, for a total of 520 steps using a batch size of 1.

Finally, we used the Adafactor optimizer with the following arguments: scale_parameter=False, relative_step=False, warmup_init=False

Call to Action
Looking for an AI Workstation?
Call to Action
Looking for an AI Workstation?

Performance

SDXL LoRA Training Performance - SDPA
SDXL LoRA Training Performance - xFormers
Previous Next
System Image
SDXL LoRA Training Performance - SDPA
Open Full Resolution
SDXL LoRA Training Performance - xFormers
Open Full Resolution
Previous Next

By and large, we didn’t uncover any anomalies during the SDPA performance testing, except for the poor performance we found for the AMD Radeon PRO W7900 when using Network Dimension 1. Typically, we find Net Dim 1 produces the highest iterations per second, but for some reason, the W7900 struggled at this level.

Although xFormers is not available for AMD GPUs, we decided to test the NVIDIA side with xFormers (chart #2) and found some interesting results. Instead of decreased performance as Network Dimensions increase, the latest Ada generation cards get a slight performance bump at certain levels: the RTX 6000 Ada sped up at Net Dim 64 and the RTX 5000 Ada at Net Dim 128. However, overall performance for the Ada generation is better with SPDA than xFormers, so there doesn’t seem to be much reason to use xFormers with these cards for LoRA training.

For the previous generation of NVIDIA cards, we found a very slight performance boost with xFormers over SDPA, so it may be worth using xFormers over SDPA for these cards.

The W7900’s performance trailed behind all of the NVIDIA cards tested here, with an average performance difference of about 55-65% compared to the 6000 Ada.

VRAM Usage

SDXL LoRA Training VRAM Usage - SDPA
SDXL LoRA Training VRAM Usage by GPU - xFormers
Previous Next
System Image
SDXL LoRA Training VRAM Usage - SDPA
Open Full Resolution
SDXL LoRA Training VRAM Usage by GPU - xFormers
Open Full Resolution
Previous Next

With a dataset of large images like these (1024×1024), VRAM usage is quite high. As expected, we found that VRAM use increased along with Network Dimensions. There was not a significant difference in VRAM usage between SDPA and xFormers, but we did find that the AMD Radeon W7900 used about 7GB more VRAM at every level than the NVIDIA GPUs at every level.

On the NVIDIA side, it was pleasing to find that the VRAM usage did not exceed 20GB, which is good news for those with 24GB GPUs. Based on this testing, it seems likely that NVIDIA GPUs with 24GB VRAM will have no trouble training SDXL LoRAs without the use of gradient checkpointing. However, if during our upcoming consumer-level GPU testing we find that AMD’s higher VRAM usage holds true for those models as well, then higher Network Dimensions will likely require the use of gradient checkpointing when training SDXL LoRAs, which will incur a performance penalty.

Conclusion

It’s been great to see the strides that AMD has made with the ROCm ecosystem, particularly in the second half of 2023, and it’s now easier than ever to utilize AMD GPUs for ML tasks. Although we still see NVIDIA holding the performance crown in this round of LoRA testing, it was refreshing to be able to compare the two manufacturers, and I’m confident that we’ll see performance improvements as ROCm support matures.

Any of the GPUs featured in this article are more than capable of training LoRAs, but if you’re someone who regularly trains LoRAs for Stable Diffusion, then the performance offered by the Ada-generation of NVIDIA GPUs could save you a significant amount of time spent training. These results show that the latest NVIDIA RTX 6000 Ada 48GB can be up to 24% faster than the previous generation NVIDIA RTX A6000 48GB, or 55% faster than AMD’s current best offering – the Radeon PRO W7900 48GB.

While these Pro-level cards are what you should be using to maximize performance and flexibility if you do this type of work regularly, they are certainly very expensive and can be difficult to justify if you are just starting out. In an upcoming article, we will continue this testing with a number of more affordable consumer-level GPUs, so stay tuned for more LoRA training results!

If you are looking for a workstation for AI and Scientific Computing, you can visit our solutions pages to view our recommended workstations for various software packages, our custom configuration page, or contact one of our technology consultants for help configuring a workstation that meets the specific needs of your unique workflow.

Tower Computer Icon in Puget Systems Colors

Looking for an AI and Scientific Computing workstation?

We build computers tailor-made for your workflow. 

Configure a System
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with one of our technical consultants today.

Talk to an Expert

Related Content

  • Effects of CPU speed on GPU inference in llama.cpp
  • Puget Mobile 17″ vs M3 Max MacBook Pro 16″ for AI Workflows
  • Stable Diffusion Linux vs. Windows
  • Stable Diffusion LoRA Training – Consumer GPU Analysis
View All Related Content

Latest Content

  • DaVinci Resolve Studio 18.6 – Consumer GPU Performance Analysis
  • Effects of CPU speed on GPU inference in llama.cpp
  • PC Gaming Performance Tweaks
  • How to View Your Windows 10 and 11 Product Key
View All

Tags: AI, AMD, GPU, kohya_ss, LoRA, NVIDIA, RTX 5000 Ada, RTX 6000 Ada, RTX A5000, RTX A6000, SDXL, stable diffusion

Who is Puget Systems?

Puget Systems builds custom workstations, servers and storage solutions tailored for your work.

We provide:

Extensive performance testing
making you more productive and giving better value for your money

Reliable computers
with fewer crashes means more time working & less time waiting

Support that understands
your complex workflows and can get you back up & running ASAP

A proven track record
as shown by our case studies and customer testimonials

Get Started

Browse Systems

Puget Systems Mobile Laptop Workstation Icon

Mobile

Puget Systems Tower Workstation Icon

Workstations

Puget Systems Rackmount Workstation Icon

Rackstations

Puget Systems Rackmount Server Icon

Servers

Puget Systems Rackmount Storage Icon

Storage

Latest Articles

  • DaVinci Resolve Studio 18.6 – Consumer GPU Performance Analysis
  • Effects of CPU speed on GPU inference in llama.cpp
  • PC Gaming Performance Tweaks
  • How to View Your Windows 10 and 11 Product Key
  • When the Windows Store App Simply Won’t Cooperate
View All

Post navigation

 AMD Ryzen Threadripper 7995WX Content Creation PreviewStable Diffusion LoRA Training – Consumer GPU Analysis 
Puget Systems Logo
Build Your Own PC Site Map FAQ
facebook instagram linkedin rss twitter youtube

Optimized Solutions

  • Adobe Premiere
  • Adobe Photoshop
  • Solidworks
  • Autodesk AutoCAD
  • Machine Learning

Workstations

  • Content Creation
  • Engineering
  • Scientific PCs
  • More

Support

  • Online Guides
  • Request Support
  • Remote Help

Publications

  • All News
  • Puget Blog
  • HPC Blog
  • Hardware Articles
  • Case Studies

Policies

  • Warranty & Return
  • Terms and Conditions
  • Privacy Policy
  • Delivery Times
  • Accessibility

About Us

  • Testimonials
  • Careers
  • About Us
  • Contact Us

© Copyright 2024 - Puget Systems, All Rights Reserved.