How does MidJourney create images in real time?

MidJourney probably integrates diffusion models with language models. The language model interprets the textual description, extracting key features and themes. This interpreted information then guides the diffusion process, ensuring that the generated image aligns with the textual description.

Dev Das

20 Oct 2023 — 4 min read

Understanding how diffusion models work

Artificial intelligence has witnessed a surge in capabilities in the past two years, particularly in the domain of image generation. One such intriguing development is the ability to generate images in real time based on textual descriptions. MidJourney, a platform that has garnered attention for this capability, leverages advanced diffusion models to achieve this feat.

But how does it work?

Before diving into MidJourney's technology, it's essential to understand the broader landscape of image generation. Traditionally, Generative Adversarial Networks (GANs) have been the go-to models for generating images. GANs consist of two neural networks – the generator and the discriminator – which work in tandem. The generator creates images, while the discriminator evaluates them. Over time, the generator improves its output, aiming to produce images that the discriminator can't distinguish from real ones.

However, GANs, while powerful, have limitations. Training can be unstable, and they often require vast amounts of data and computational power.

Enter diffusion models, which offer a different approach to image generation.

Diffusion models, in the context of image generation, are a type of generative model that simulate a process of diffusion (or reverse diffusion) to produce images. Instead of the adversarial process seen in GANs, diffusion models operate by gradually transforming a random noise into a coherent image over several steps.

The idea is akin to starting with a blank canvas and progressively adding details until a full image emerges. This process is guided by a series of probabilistic steps, each refining the image further. The model learns the optimal transformations from training data, ensuring that the final images are coherent and aligned with the desired output.

For platforms like MidJourney, the challenge isn't just generating images but doing so based on textual descriptions. This requires the model to understand the semantics of the input text and translate that understanding into visual elements.

While Midjourney’s model is proprietary and not documented as open source, to achieve this, MidJourney probably integrates diffusion models with language models. The language model interprets the textual description, extracting key features and themes. This interpreted information then guides the diffusion process, ensuring that the generated image aligns with the textual description.

The process possibly begins with an initial noise tensor, essentially a random array of values that doesn't resemble any meaningful image. Think of this as a canvas filled with random splatters of paint.

Before the diffusion process starts, the system needs to understand the text prompt. A language model or a text encoder processes the prompt and converts it into a fixed-size vector, known as an embedding. This embedding captures the semantic essence of the text and guides the diffusion process to ensure the final image aligns with the prompt.

The core idea behind diffusion models is to simulate a process where noise is added or removed from the initial tensor in a controlled manner over several steps. The model is trained to understand how real images would evolve if noise were progressively added to them. During generation, this process is reversed, starting with noise and removing it step by step to approach a real image.

The text embedding (from step 2) plays a crucial role in guiding the diffusion process. At each step, the model not only considers the current state of the image tensor but also the text embedding. This ensures that the noise removal or addition aligns with the desired output as specified by the prompt.

For instance, if the prompt is about "a serene sunset over the ocean," the embedding guides the diffusion process to favour patterns and colours associated with sunsets and oceans. It isn't a one-shot operation. It involves multiple iterations, with each step refining the image further. At each iteration, the model:

- Evaluates the current state of the image tensor.

- Considers the guidance from the text embedding.

- Decides on the optimal noise addition or removal for that step.

This iterative process ensures that the image gradually transitions from random noise to a coherent representation.

After several iterations, the noise tensor transforms into an image that aligns with the text prompt. The number of iterations and the specifics of the noise manipulation at each step are learned parameters, optimised during the training of the diffusion model.

While computationally intensive, these models are optimised for real-time performance through techniques like model pruning, quantisation, and efficient neural architecture search which reduce computational demands without significantly compromising quality.

Specialised hardware significantly speeds up the generation process. These hardware components are designed for parallel processing, which is crucial for neural network computations.

Instead of relying solely on centralised servers, the process is often distributed through edge computing where the data is often processed closer to the source. This reduces latency and leads to faster image generation.

However, while impressive, real-time image generation isn't without challenges.

Achieving high-quality images in real time is challenging. There might be scenarios where the model opts for speed over precision, leading to images that might not fully capture the nuances of the input text.

Textual descriptions can be ambiguous. For instance, the phrase "a large green field with a blue sky" can be visualised in numerous ways. Handling such ambiguities is a challenge for any text-to-image synthesis model.

Even with optimisations, real-time image generation is computationally intensive. This can lead to higher operational costs and might limit scalability.

The capabilities of platforms like MidJourney are just the tip of the iceberg. As diffusion models and other generative techniques evolve, we can expect even more accurate and faster image generation. Potential advancements might include better handling of ambiguous descriptions, integration of user feedback in real-time, and generation of animated sequences or videos based on textual narratives.

Time will only tell what the future may look like. As of today, MidJourney struggles with the consistency of characters in images as well as inputting text. But it’s a matter of time before they fix it.

Will it replace art? What happens when all art is AI generated and there are no more original data sets to learn from? Is copyright a genuine concern? Is human creativity at risk? Should art even be scalable?

There are a million questions that the future will answer. For now, image-generating LLMs are here to stay. How they evolve is what will define how we define art and visual design in the years to come.

About Me:

I write 'cos words can be fun. More about me here. Follow @hackrlife on X

How does MidJourney create images in real time?

Dev Das

Understanding how diffusion models work

Read more

Dancing with the Hours

The weight of creation

Falling into life

The art of waiting in line