Tuesday, November 14, 2023

ChatGPT Explains Why It's Such A Lousy Artist

I gave ChatGPT the following prompt.

Why does AI work so much better for writing fiction than it does for creating images? When you and I write our stories, poor results are easy to correct. When my prompts give me poor image results - too many people, improper facial expressions, people in the wrong locations, etc. - it can be nearly impossible to edit the prompts to get what I want.

Here's its reply.

The difference in performance between AI-generated fiction and AI-generated images largely stems from the inherent nature of text and visual content, as well as the complexity and flexibility of the respective AI models.

  1. Complexity and Specificity: Writing fiction involves creating narratives, which can be guided and adjusted with relative ease. Text-based AI can follow a narrative flow and maintain consistency in themes, character development, and story arcs. In contrast, image generation requires translating abstract concepts into precise visual details. While text can be ambiguous or suggestive, an image needs to be specific in its depiction. This specificity makes it challenging to accurately capture nuanced details, like facial expressions or exact placements, in response to a prompt.
  2. Editing and Revision: When writing fiction, it's often easier to make incremental edits and revisions. If a part of the story doesn't fit, it can be rephrased or rewritten. However, with images, even minor adjustments require a complete regeneration of the image, which can unpredictably alter other aspects of the image. The generative process for images doesn't easily allow for 'editing' specific parts while keeping others intact.
  3. Interpretation and Ambiguity: Text allows for a certain level of interpretation and ambiguity, which can be a strength in storytelling. However, images are more literal and concrete. The AI has to make specific choices about colors, shapes, and spatial relationships, which can sometimes lead to unexpected or undesired results.
  4. Training and Model Limitations: AI models for text and image generation are trained differently. Text-based models like GPT are trained on vast amounts of diverse textual data, enabling them to understand and generate a wide range of writing styles and content. Image generation models, while also trained on large datasets, may encounter more limitations due to the diversity and complexity of visual data. Also, accurately interpreting and executing complex visual prompts is a significant challenge.
  5. User Expectations and Perceptions: When reading text, readers often use their imagination to fill in gaps or interpret descriptions. In contrast, images are immediately visually processed, and any discrepancies or errors are more directly apparent. Users might have a very specific visual outcome in mind, which can be hard to meet precisely through a text prompt.

In conclusion, the differences in the capabilities and outcomes of AI in fiction writing and image generation are rooted in the fundamental differences between text and visuals, as well as the current state and limitations of AI technology in interpreting and generating these different forms of content.

That's quite good.

Meanwhile, I've been working on the characters from my Hardy Boys ripoff mystery-comedy. I rather like the Aussies, Malcolm, 17, and Angus, 15. Malcolm is too ripped for my tastes, but he has the menacing air I was hoping to get.

3 comments:

tim eisele said...

If we take the AI at its word, this does explain a lot. If that is in fact how the artbots are working (trying to generate the image as a whole instead of piecewise), I can see where it would have a lot of trouble adjusting itself to changing requests if it thinks that an image has to be a single unit.

It would also explain a lot of the divergence from human art, because humans don't necessarily work that way. We think in terms of layers (foreground versus background versus far distance) and discrete objects (the background is "a MOUNTAIN and some TREES and a LAKE illuminated by THE SETTING SUN", not "THE BACKGROUND IS A SCENIC SUNSET VISTA").

We may actually be getting to the point where programmers can debug software by just asking it what it thinks it is doing, like you just did here. If this trend continues, I can see programmers eventually becoming a lot like psychiatrists.


tim eisele said...

It's also interesting that this image actually has depth of field (the house behind them is defocused), while that wasn't happening in the earlier images you posted. So, there is some hope that the other things I have been complaining about will also get cleaned up over time.

K T Cat said...

Tim, it wants you to give it prompts specifying what is in the foreground, midground and background.

Then it totally ignores those specifications if there are more than 2 characters in the scene.

Again, I wish it had a syntax.