Image Upscaling Using Neural Networks

Do you remember those classic scenes from CSI TV series? When a detective, peering at a pixelated image from a surveillance camera, instructs the tech whiz, "zoom enhance". With some keyboard strokes, the blurry image transforms, revealing a perfectly clear license plate. We've all had a good laugh at that, dismissing it as pure Hollywood bullshit, right?

Well, as it turns out, the world of image upscaling —turning low-resolution images into high-resolution versions— isn't as far-fetched as it seemed. Over the last decade, we've seen significant leaps in the capabilities of image upscaling, thanks to the power of neural networks. Now, that "zoom enhance" command doesn't seem so laughable anymore!

Basic Image Upscaling

The most straightforward approach to upscale an image involves basic upsampling through interpolation. Interpolation is a direct technique for expanding the image's size by simply introducing additional pixels into the low-resolution image. Bicubic interpolation, leverages a weighted average from adjacent pixels to produce the output. This process involves a 4x4 neighbourhood, thus utilizing 16 neighbours. It is a popular choice in image editing software such as Photoshop.

128 pixel image with jpeg compression artifacts	Resized to 1024pixels with Bicubic Sharper in Photoshop

As it is visible, upscaling through interpolation doesn't result in a clear image. This is because interpolation simply fills new pixel values based the closest known pixels. It doesn't introduce any new information into the image.

When it comes to enlarging an image significantly, say from 128px to 1024px, basic upsampling will unfortunately fall short of expectations. We want 'zoom and enhance' wizardry we're used to see in Hollywood movies!

Technical details: Neural Network Architectures in Image Upscaling

ESRGAN

ESRGAN (Enhanced Super-Resolution Generative Adversarial Networks) was a significant milestone in in 2019 for the image upscaling domain. It took the original SRGAN's concept and improved upon it by introducing an enhanced generator and a more robust discriminator, allowing it to generate high-resolution images from low-resolution inputs. The architecture improved the realism and quality of upscaled images, setting a new standard in super-resolution tasks.

GFPGAN

Building upon the concept of ESRGAN, GFPGAN was developed specifically for blind face restoration. This model leverages the rich facial priors encapsulated in a pretrained face GAN. The innovative channel-split spatial feature transform layers incorporated into the restoration process allowed the model to jointly restore facial details and enhance colors, achieving superior performance on both synthetic and real-world datasets.

RealESRGAN

RealESRGAN extended the powerful ESRGAN architecture to address complex real-world degradations. The model introduced a high-order degradation modeling process to simulate real-world degradations better and also tackled common ringing and overshoot artifacts. An innovative U-Net discriminator was employed to stabilize training dynamics and enhance the discriminator's capabilities, resulting in superior visual performance on real datasets.

CodeFormer

Introduced in August 2022, CodeFormer improved blind face restoration using a learned discrete codebook prior. It shifted the task to a code prediction challenge, providing visual information to generate high-quality faces from degraded inputs. Utilizing a Transformer-based prediction network, CodeFormer modeled the global context of low-quality faces, producing natural approximations of target faces, irrespective of degradation severity. The model included a unique feature transformation module for adaptiveness to varying degradation, offering a balance between fidelity and quality. Outperforming previous models, CodeFormer's robustness was confirmed through experiments on synthetic and real-world datasets.

R-ESRGAN 4x+ & Stable Diffusion

It's readily apparent that the dress in the image has been enhanced with exceptional clarity through upscaling. This kind of detail far surpasses what can be obtained with a simple interpolation method.

However, it's also clear that while the facial blurriness has been considerably minimized, the resulting facial features seem distorted. Moreover, it appears more like a different individual rather than the person in the original photo.

128 pixel image with jpeg compression artifacts	Resized to 1024pixels with R-ESRGAN 4x+ & Stable Diffusion

CodeFormer & Stable Diffusion

In this instance, the facial restoration is entirely achieved, successfully eradicating any blurry components and compression artifacts. However, CodeFormer selectively executes this restoration solely on the face and leaving the rest of the image blurred.

Yet again, the face doesn't quite match the original photo. While CodeFormer has created an aesthetically pleasing face with a few similar elements, it's not an accurate representation of the original person.

128 pixel image with jpeg compression artifacts	CodeFormer & Stable Diffusion

CodeFormer & R-ESRGAN 4x+ & ControlNet & Stable Diffusion

In this combination, both the face and the rest of the image are upscaled without any trace of blurriness. R-ESRGAN 4x+ initially restores the entire image, following which CodeFormer makes another pass over the face for its restoration. It also exhibits the characteristic "dead fish eyes", a trait quite infamous in AI-generated human faces. This is caused from the fact that both eyes are generated independently, thus lacking a sense of coherence.

128 pixel image with jpeg compression artifacts	CodeFormer & R-ESRGAN 4x+ & ControlNet & Stable Diffusion

CodeFormer & R-ESRGAN 4x+ & ControlNet & Face Resemblance & Stable Diffusion

Boostpixels offers the capability to combine everything into a cohesive process, enabling face resemblance using a custom trained model which specifically handles facial resemblance in a distinct phase. The process of facial restoration is enhanced to produce results that are more aesthetically appealing and is likely more aligned with the individual's self-perception.

As a point of reference, this operation requires approximately 18 seconds on a NVIDIA 4090 GPU. This is indicative of the computational power required for this task.

128 pixel image with jpeg compression artifacts	CodeFormer & R-ESRGAN 4x+ & ControlNet & Face Resemblance & Stable Diffusion

Quite impressive result, considering the very limited data that the input image has.

Zoomed in to the 128 pixel image with jpeg compression artifacts	Zoomed in to the 1024 pixel upscaled image with CodeFormer & R-ESRGAN 4x+ & ControlNet & Face Resemblance & Stable Diffusion

Last words

Image upscaling has witnessed remarkable development over the last few years, transforming what seemed like Hollywood fiction into tangible reality. From the birth of ESRGAN in 2019 to specialised models of 2022 like CodeFormer, we've seen an extraordinary evolution.

As AI continues to progress, the implications for fields like forensics, medicine, and entertainment are enormous. It is reasonable to anticipate the development of more specialized models aimed at specific use cases like upscaling cartoons, photos, human faces and objects, etc. Alongside this, advancements can also be foreseen in more generalized methodologies. These may include training on larger datasets and optimizing strategies. Such evolution would not only increase the accuracy of these AI models but also broaden their applicability in various real-world scenarios.