Beginner Training Guide
This guide assumes you already know how to download/clone repositories from GitHub and that you are familiar with python and pip.
Choosing a Fork
The original BasicSR fork by Xinntao has a lot of issues and in general and lacks a lot of features and bug fixes that forks made by various community members have. Also, if you get the latest Xinntao fork you will be training a new-arch model instead of an old-arch model. This means you will not be able to train in scales other than 4.
If you are reading this guide as a refresher and currently use BlueAmulet's fork, please consider switching to Victorca25's fork. It has a lot more features as well as everything important from BlueAmulet's. Make sure when using this one to use the "master" branch as it has all the recent features and bugfixes. If this is your first time reading this guide, I still highly recommend Victorca25's fork.
You can see a current list of currently maintained forks here.
Every fork of BasicSR will require a few dependencies to be installed. You can install most of them at once by running this in the console:
pip install numpy opencv-python pyyaml tensorboardx
You also need to install pytorch and torchvision, but which version you need depends on your system. If you have an NVIDIA graphics card, make sure to select the latest cuda version from the list. If you don't, I don't recommend training in the first place. Just select whatever stable version is currently available, pick your OS, choose to install it through pip, then run the output it gives.
Once this is done you should have all the required dependencies. Some forks may require others, but these should work for most of them.
Creating a Dataset
All BasicSR/ESRGAN models are trained using low-resolution images, often called LR for short or LQ (Low Quality), and high-resolution images, often called HR for short or GT (Ground Truth). For a 4x scale model, this means that your LR images will be 4x smaller in resolution than your HR images.
It is important to create the best dataset you can for your upscale task. Many pre-existing datasets exist, such as DF2K or Manga109, but a dataset can be anything. Your HRs could be high quality frames of a TV show, for example, with the LRs being the same images scaled down by 4 using a bicubic filter. This would then create a model that is good at upscaling small images that are visually similar to the LRs you created. The dataset is arguably the most important part of training a model. Without a good dataset, your model will not work well.
Examples of bad datasets
- Random images with no similarity to each other
- Images with lots of JPEG artifacts
- Low-resolution images
- A dataset with only 5 images
- A dataset where every image is almost exactly the same
Examples of good datasets
- 800 high-quality pictures of mountains
- 3,000 exported frames of a 1080p cartoon
- 500 images of cropped-out faces
A few things to note:
- Your HR images and LR images must have exactly matching names
- Your HR images must be exactly 4x (or whatever scale you are training) the resolution of your LRs. This means you must crop your HR images so that each dimension (width and height) are multiples of 4.
- The more images you have, the better the model will become (generally)
Once you have your dataset set up, you need to create a validation set. This can just be a few images taken from your LR and HR folders and placed into separate HR and LR directories. These images are just used as a reference to see how your model is doing during training. Note: These images will NOT affect training in any way.
This configuration setup is based on victorca25's fork and will be explaining how to modify the YAML training configs. The options that you need to change will be explained below.
First, you should know where the training configs are located. You can find them in
/codes/options/train/. Here, you will find
train_template.yml. If you will be modifying this file, I recommend making a copy of it just in case. There will be only a few changes you need to make:
- This will be your model name. Typically, these also include the scale. Example: 4xBox, 2xFaithful.
- This is the scale of your model. You should already know what this is if you have already created your dataset. Typically this is just 4.
- This is the path to your dataset's HR folder
- This is the path to your dataset's LR folder
- This is the number of threads that BasicSR will use. Typically this is just the number of cores your CPU has.
- This is the number of images that BasicSR will look at in each iteration. Typically this is set to the highest it will go before running out of VRAM.
- dataroot_HR (validation)
- This is the path to your dataset's validation HR folder
- dataroot_LR (validation)
- This is the path to your dataset's validation LR folder
- The direct path to the directory of the repository you downloaded
- The model that your model will use as a sort of base to get started with. The ones included with BasicSR originally are RRDB_ESRGAN_x4.pth or RRDB_PSNR_x4.pth, but you can use any old-arch model.
- The frequency that BasicSR will run ESRGAN on your validation LRs using the latest version of your model. Typically this is set to 5000.
- The frequency that BasicSR will save your model.
To start training, open your command-line interface of choice, navigate to the
/codes/ folder, and type
python train.py -opt train_template.yml, just replace train_template.yml with whatever your training config is named. If all goes well, it should spit out a bunch of info and then start training from iteration 0, epoch 0. If you set everything up correctly, you should have a new folder in your
experiments folder that is named after your model.
To pause the training press ctrl+c, it should save the latest state and model. If you spam it, press it at the wrong time, or have powershell quick edit mode, there is a possibility it will not work. In this case you would fall back to the latest save which is specified on save_checkpoint_freq. These files are saved in the experiments folder. To resume, edit the YAML training config and remove the # before resume_state. Then link to the .state file. To continue, simply use the command from the previous paragraph.
CUDA out of memory
- This means you need to decrease your batch size. If you can't decrease your batch size any more, decrease your HR size.
- If you're getting this error during validation, it means your validation images are too large. Try cropping them or splitting them into multiple images.
Module not found
- This means you did not install the required libraries through pip. Try again or see if your path file is pointing to a different python installation
Could not broadcast shape ____ to shape ____
- This could mean a few things, most likely your LR and HR sizes are mismatched. Make sure they are clean multiples of each other.