Difference between revisions of "Beginner Training Guide"

Revision as of 16:03, 8 October 2021

Getting Started

This guide assumes you already know how to download/clone repositories from GitHub and that you are familiar with python and pip.

Choosing a Fork

The original BasicSR fork by Xinntao has a lot of issues and in general and lacks a lot of features and bug fixes that various community forks have added and fixed. Also, if you get the latest Xinntao fork you will be training an new-arch model instead of an old-arch model. This means you will not be able to train in scales other than 4.

We highly recommend Victorca25's traiNNer over existing forks of BasicSR. It has many extra features (with many additions making it easier to use), along with all of the features from past versions. If you are reading this guide as a refresher and currently use BlueAmulet's fork, please do switch to traiNNer. BlueAmulet's fork is now unmaintained.

You can see a current list of currently maintained forks here.

Installing Dependencies

traiNNer will require a few dependencies to be installed. You can install most of them (inlcuding optional dependencies) at once by running this in the console: pip install numpy opencv-python pyyaml lmdb scipy Pillow joblib tensorboardx If you would like to use JSON files, run this: pip install PyYAML

You also need to install pytorch and torchvision, but which version you need depends on your system. If you have an NVIDIA graphics card, make sure to select the latest cuda version from the list. If you don't, training is likely not a good idea. Just select whatever stable version is currently available, pick your OS, choose to install it through pip, then run the output it gives.

Once this is done you should have all the required and optional dependencies. Some forks may require others, but these should work for traiNNer.

Creating a Dataset

All BasicSR/traiNNer/ESRGAN models are trained using low-resolution images, often called LR for short or LQ (Low Quality), and high-resolution images, often called HR for short or GT (Ground Truth). For a 4x scale model, this means that your LR images will be 4x smaller in resolution than your HR images.

It is important to create the best dataset you can for your upscale task. Many pre-existing datasets exist, such as DF2 or Manga109, but a dataset can be anything. Your HRs could be high quality frames of a TV show, for example, with the LRs being the same images scaled down by 4 using a bicubic filter. This would then create a model that is good at upscaling small images that are visually similar to the LRs you created. The dataset is one of the most important part of training a model. Without a good dataset, your model will not work well.

Datasets don't have to just be downscaled images, though. You can use images with compression artifacts, or images with noise for the LR/GT frames. This will train your model to remove such artifacts. traiNNer makes this very simple, we'll discuss it lower down in the guide.

Examples of bad datasets in general

Random images with no similarity to each other
A dataset with only 5 images
A dataset where every image is almost exactly the same

Examples of bad HR images

Images with lots of JPEG artifacts
Low-resolution images

Examples of good datasets

800 high-quality pictures of mountains
3,000 exported frames of a 1080p cartoon
500 images of cropped-out faces

A few things to note:

Your HR images and LR images must have exactly matching names
Your HR images must be exactly 4x (or whatever scale you are training) the resolution of your LRs. This means you must crop your HR images so that each dimension (width and height) are multiples of 4. (traiNNer takes care of this)
The more images you have, the better the model will become (to an extent)

Once you have your dataset set up, you may want to create a validation set. This can just be a few images taken from your LR and HR folders and placed into separate HR and LR directories. These images are just used as a reference to see how your model is doing during training. Note: These images will NOT affect training in any way.

Configuring traiNNer

This configuration setup is based on victorca25's traiNNer and will be explaining how to modify the YAML training configs. The options that you need to change will be explained below.

First, you should know where the training configs are located. You can find them in /codes/options/sr/. Here, you will find train_sr.yml. If you will be modifying this file, I recommend making a copy of it just in case. There will be only a few changes you need to make:

name: This will be your model name. Typically, these also include the scale. Example: 4xBox, 2xFaithful.
scale: This is the scale of your model. You should already know what this is if you have already created your dataset. Typically this is just 4.
dataroot_HR: This is the path to your dataset's HR folder
dataroot_LR: This is the path to your dataset's LR folder
n_workers: This is the number of threads that traiNNer will use. Typically this is just the number of cores your CPU has.
batch_size: This is the number of images that traiNNer will look at in each iteration. Typically this is set to the highest it will go before running out of VRAM. It seems to yield more stable results while training.
crop_size: This is the resolution that traiNNer will automatically crop your dataset to. This number may be lowered to reduce VRAM usage.
dataroot_HR (validation): This is the path to your dataset's validation HR folder (not required)
dataroot_LR (validation): This is the path to your dataset's validation LR folder (not required)
root: The direct path to the directory of the repository you downloaded
pretrained_model_G: The model that your model will use as a sort of base to get started with. The ones included with BasicSR originally are RRDB_ESRGAN_x4.pth or RRDB_PSNR_x4.pth, but you can use any old-arch model.
val_freq: The frequency that traiNNer will run ESRGAN on your validation dataset using the latest version of your model. Typically this is set to 5000. (not required)
save_checkpoint_freq: The frequency that traiNNer will save your model. You may desire a lower save frequency if you test the models yourself.

Training

To start training, open your command-line interface of choice, navigate to the /codes/ folder, and type python train.py -opt train_sr.yml, just replace train_sr.yml with whatever your training config is named. If all goes well, it should spit out a bunch of info and then start training from iteration 0, epoch 0. If you set everything up correctly, you should have a new folder in your experiments folder that is named after your model.

To pause training, press CTRL+C. It should save the latest state and model. If you spam it, press it at the wrong time, or have Powershell quick edit mode, there is a possibility it will not work properly. In this case you would fall back to the latest save/resume state which is specified on save_checkpoint_freq (ie: 7200.state). These files are saved in the experiments folder. To resume, edit the training config and remove the # before resume_state. Then link to the .state file. To continue, simply use the command from the previous paragraph.

Common errors

CUDA out of memory: This means you need to decrease your batch size. If you can't decrease your batch size any more, decrease your crop_size.

If you're getting this error during validation, it means your validation images are too large. Try cropping them or splitting them into multiple images.

Module not found: This means you did not install the required libraries through pip. Try again or see if your path file is pointing to a different python installation

Could not broadcast shape ____ to shape ____: This could mean a few things, most likely your LR and HR sizes are mismatched. Make sure they are clean multiples of each other.

@@ Line 3: / Line 3: @@
 === Choosing a Fork ===
-The original [https://github.com/xinntao/BasicSR/ BasicSR fork by Xinntao] has a lot of issues and in general and lacks a lot of features and bug fixes that forks made by various community members have. Also, if you get the latest Xinntao fork you will be training a [[ESRGAN new-arch|new-arch model]] instead of an [[ESRGAN old-arch|old-arch model]]. This means you will not be able to train in scales other than 4.
+The original [https://github.com/xinntao/BasicSR/ BasicSR fork by Xinntao] has a lot of issues and in general and lacks a lot of features and bug fixes that various community forks have added and fixed. Also, if you get the latest Xinntao fork you will be training an [[ESRGAN new-arch|new-arch model]] instead of an [[ESRGAN old-arch|old-arch model]]. This means you will not be able to train in scales other than 4.
-If you are reading this guide as a refresher and currently use [https://github.com/BlueAmulet/BasicSR BlueAmulet's fork], please consider switching to [https://github.com/victorca25/BasicSR/tree/master Victorca25's fork]. It has a lot more features as well as everything important from BlueAmulet's. Make sure when using this one to use the "master" branch as it has all the recent features and bugfixes. If this is your first time reading this guide, I still highly recommend [https://github.com/victorca25/BasicSR/tree/master Victorca25's fork].
+We highly recommend [https://github.com/victorca25/traiNNer/tree/master Victorca25's traiNNer] over existing forks of BasicSR. It has many extra features (with many additions making it easier to use), along with all of the features from past versions. If you are reading this guide as a refresher and currently use [https://github.com/BlueAmulet/BasicSR BlueAmulet's fork], please do switch to [https://github.com/victorca25/traiNNer/tree/master traiNNer]. BlueAmulet's fork is now unmaintained.
 You can see a current list of [[Maintained BasicSR Forks|currently maintained forks here]].
 === Installing Dependencies ===
-Every fork of BasicSR will require a few dependencies to be installed. You can install most of them at once by running this in the console: <code>pip install numpy opencv-python pyyaml tensorboardx</code>
+traiNNer will require a few dependencies to be installed. You can install most of them (inlcuding optional dependencies) at once by running this in the console: <code>pip install numpy opencv-python pyyaml lmdb scipy Pillow joblib tensorboardx</code> If you would like to use JSON files, run this: <code>pip install PyYAML</code>
-You also need to install [https://pytorch.org/get-started/locally/ pytorch and torchvision], but which version you need depends on your system. If you have an NVIDIA graphics card, make sure to select the latest cuda version from the list. If you don't, I don't recommend training in the first place. Just select whatever stable version is currently available, pick your OS, choose to install it through pip, then run the output it gives.
+You also need to install [https://pytorch.org/get-started/locally/ pytorch and torchvision], but which version you need depends on your system. If you have an NVIDIA graphics card, make sure to select the latest cuda version from the list. If you don't, training is likely not a good idea. Just select whatever stable version is currently available, pick your OS, choose to install it through pip, then run the output it gives.
-Once this is done you should have all the required dependencies. Some forks may require others, but these should work for most of them.
+Once this is done you should have all the required and optional dependencies. Some forks may require others, but these should work for traiNNer.
 === Creating a Dataset ===
-All BasicSR/ESRGAN models are trained using low-resolution images, often called LR for short or LQ (Low Quality), and high-resolution images, often called HR for short or GT (Ground Truth). For a 4x scale model, this means that your LR images will be 4x smaller in resolution than your HR images.
+All BasicSR/traiNNer/ESRGAN models are trained using low-resolution images, often called LR for short or LQ (Low Quality), and high-resolution images, often called HR for short or GT (Ground Truth). For a 4x scale model, this means that your LR images will be 4x smaller in resolution than your HR images.
-It is important to create the best dataset you can for your upscale task. Many pre-existing datasets exist, such as DF2K or Manga109, but a dataset can be anything. Your HRs could be high quality frames of a TV show, for example, with the LRs being the same images scaled down by 4 using a bicubic filter. This would then create a model that is good at upscaling small images that are visually similar to the LRs you created. The dataset is arguably the most important part of training a model. Without a good dataset, your model will not work well.
+It is important to create the best dataset you can for your upscale task. Many pre-existing datasets exist, such as DF2 or Manga109, but a dataset can be anything. Your HRs could be high quality frames of a TV show, for example, with the LRs being the same images scaled down by 4 using a bicubic filter. This would then create a model that is good at upscaling small images that are visually similar to the LRs you created. The dataset is one of the most important part of training a model. Without a good dataset, your model will not work well.
-==== Examples of bad datasets ====
+Datasets don't have to just be downscaled images, though. You can use images with compression artifacts, or images with noise for the LR/GT frames. This will train your model to remove such artifacts. traiNNer makes this very simple, we'll discuss it lower down in the guide.
+==== Examples of bad datasets in general ====
 * Random images with no similarity to each other
+* A dataset with only 5 images
+* A dataset where every image is almost exactly the same
+==== Examples of bad HR images ====
 * Images with lots of JPEG artifacts
 * Low-resolution images
-* A dataset with only 5 images
-* A dataset where every image is almost exactly the same
 ==== Examples of good datasets ====
@@ Line 35: / Line 39: @@
 A few things to note:
 * Your HR images and LR images must have exactly matching names
-* Your HR images must be exactly 4x (or whatever scale you are training) the resolution of your LRs. This means you must crop your HR images so that each dimension (width and height) are multiples of 4.
+* Your HR images must be exactly 4x (or whatever scale you are training) the resolution of your LRs. This means you must crop your HR images so that each dimension (width and height) are multiples of 4. (traiNNer takes care of this)
-* The more images you have, the better the model will become (generally)
+* The more images you have, the better the model will become (to an extent)
-Once you have your dataset set up, you need to create a validation set. This can just be a few images taken from your LR and HR folders and placed into separate HR and LR directories. These images are just used as a reference to see how your model is doing during training. '''Note: These images will NOT affect training in any way.'''
+Once you have your dataset set up, you may want to create a validation set. This can just be a few images taken from your LR and HR folders and placed into separate HR and LR directories. These images are just used as a reference to see how your model is doing during training. '''Note: These images will NOT affect training in any way.'''
-=== Configuring BasicSR ===
+=== Configuring traiNNer ===
-This configuration setup is based on victorca25's fork and will be explaining how to modify the YAML training configs. The options that you need to change will be explained below.
+This configuration setup is based on victorca25's traiNNer and will be explaining how to modify the YAML training configs. The options that you need to change will be explained below.
-First, you should know where the training configs are located. You can find them in <code>/codes/options/train/</code>. Here, you will find <code>train_template.yml</code>. If you will be modifying this file, I recommend making a copy of it just in case. There will be only a few changes you need to make:
+First, you should know where the training configs are located. You can find them in <code>/codes/options/sr/</code>. Here, you will find <code>train_sr.yml</code>. If you will be modifying this file, I recommend making a copy of it just in case. There will be only a few changes you need to make:
 ;name
 : This will be your model name. Typically, these also include the scale. Example: 4xBox, 2xFaithful.
@@ Line 53: / Line 57: @@
 : This is the path to your dataset's LR folder
 ;n_workers
-: This is the number of threads that BasicSR will use. Typically this I just the number of cores your CPU has.
+: This is the number of threads that traiNNer will use. Typically this is just the number of cores your CPU has.
 ;batch_size
-: This is the number of images that BasicSR will look at in each iteration. Typically this is set to the highest it will go before running out of VRAM.
+: This is the number of images that traiNNer will look at in each iteration. Typically this is set to the highest it will go before running out of VRAM. It seems to yield more stable results while training.
+;crop_size
+: This is the resolution that traiNNer will automatically crop your dataset to. This number may be lowered to reduce VRAM usage.
 ;dataroot_HR (validation)
-: This is the path to your dataset's validation HR folder
+: This is the path to your dataset's validation HR folder (not required)
 ;dataroot_LR (validation)
-: This is the path to your dataset's validation LR folder
+: This is the path to your dataset's validation LR folder (not required)
 ;root
 : The direct path to the directory of the repository you downloaded
@@ Line 65: / Line 71: @@
 : The model that your model will use as a sort of base to get started with. The ones included with BasicSR originally are RRDB_ESRGAN_x4.pth or RRDB_PSNR_x4.pth, but you can use any old-arch model.
 ;val_freq
-: The frequency that BasicSR will run ESRGAN on your validation LRs using the latest version of your model. Typically this is set to 5000.
+: The frequency that traiNNer will run ESRGAN on your validation dataset using the latest version of your model. Typically this is set to 5000. (not required)
 ;save_checkpoint_freq
-: The frequency that BasicSR will save your model.
+: The frequency that traiNNer will save your model. You may desire a lower save frequency if you test the models yourself.
 === Training ===
-To start training, open your command-line interface of choice, navigate to the <code>/codes/</code> folder, and type <code>python train.py -opt train_template.yml</code>, just replace train_template.yml with whatever your training config is named. If all goes well, it should spit out a bunch of info and then start training from iteration 0, epoch 0. If you set everything up correctly, you should have a new folder in your <code>experiments</code> folder that is named after your model.
+To start training, open your command-line interface of choice, navigate to the <code>/codes/</code> folder, and type <code>python train.py -opt train_sr.yml</code>, just replace train_sr.yml with whatever your training config is named. If all goes well, it should spit out a bunch of info and then start training from iteration 0, epoch 0. If you set everything up correctly, you should have a new folder in your <code>experiments</code> folder that is named after your model.
+To pause training, press CTRL+C. It should save the latest state and model. If you spam it, press it at the wrong time, or have Powershell quick edit mode, there is a possibility it will not work properly. In this case you would fall back to the latest save/resume state which is specified on save_checkpoint_freq (ie: 7200.state). These files are saved in the experiments folder. To resume, edit the training config and remove the # before resume_state. Then link to the .state file. To continue, simply use the command from the previous paragraph.
 === Common errors ===
-CUDA out of memory
+;CUDA out of memory
-* This means you need to decrease your batch size. If you can't decrease your batch size any more, decrease your HR size.
+: This means you need to decrease your batch size. If you can't decrease your batch size any more, decrease your crop_size.
 * If you're getting this error during validation, it means your validation images are too large. Try cropping them or splitting them into multiple images.
-Module not found
-* This means you did not install the required libraries through pip. Try again or see if your path file is pointing to a different python installation
+;Module not found
-Could not broadcast shape ____ to shape ____
+: This means you did not install the required libraries through pip. Try again or see if your path file is pointing to a different python installation
-* This could mean a few things, most likely your LR and HR sizes are mismatched. Make sure they are clean multiples of each other.
+;Could not broadcast shape ____ to shape ____
+: This could mean a few things, most likely your LR and HR sizes are mismatched. Make sure they are clean multiples of each other.

Difference between revisions of "Beginner Training Guide"

Revision as of 16:03, 8 October 2021

Contents

Getting Started

Choosing a Fork

Installing Dependencies

Creating a Dataset

Examples of bad datasets in general

Examples of bad HR images

Examples of good datasets

Configuring traiNNer

Training

Common errors

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools