Creative AI has been around for some time now with artists experimenting with its affordances and applications, but tools like DALL-E, Midjourney, Stable Diffusion, and others offer an easy way for just about anyone to play around with this sophisticated technology without requiring any technical expertise. Using these tools, an image can be generated by typing in a simple prompt — it could be one word or a detailed phrase — or by using an existing image. These user-friendly tools have quickly harnessed and normalized the power and frivolity of creative artificial intelligence with results ranging from absurd to extraordinary. But the AI in these tools feels flat: a superficial offering replicated across the dozens of options available. Users aren’t required to know about or understand the make-up of intelligence behind the software, a disregard that inadvertently reduces these tools to the banality of an Instagram filter. The future success of AI depends on the way it gets embedded in everyday life — achieving near invisibility as a medium — and with that deep cultural integration comes a kind of ghostwriting collaborative relationship with users. In a post-truth world of deepfakes and bots, we may do well if we turn a spotlight on the AI in these tools and critically examine its affordances and limitations toward new forms of image-making and cultural activity.
While we may wonder if AI will ruin creativity altogether, it’s probably safe to say it won’t destroy art—anymore than photography destroyed painting or the internet made us stupid—but it’s worth digging into how it’s changing the way we think and create. The creative playfulness of AI image generation encourages us to rethink digital aesthetics, the provenance of new images ontologically caught between being photographs and data visualizations, notions of cultural and individual identity, memory, and authorship. Addressing these shifts requires asking: How exactly are new images synthesized? Where do the images used to train these models come from? What technical and aesthetic roles does language play within the latent space of AI? Should we be concerned about racial, gender, aesthetic, and other biases in these tools?
Perhaps the most alluring aspect of AI image generation is AI image generation.
In the most simplified terms, it uses a combination of machine vision, natural language processing, and noise reduction to generate a new image based on the user’s input, which can be a text-based prompt or an existing image. Each system, whether DALL-E, Midjourney, Stable Diffusion, or another, is trained on hundreds of millions of images taken from the web and other image databases. Each uses two trained neural networks, one to process text and the other to analyze the generated image with reference images. The whole process involves learning the relations between the images and how they’re described: learning how to process semantics from natural language — or ordinary, everyday human language as opposed to programming code — and learning how to read images to generate new ones.
The images used to train these models include numerous aesthetics from any kind of visual media shared on the web or in the image databases, but it’s important to note that the AI in these tools, however sophisticated it might be, doesn’t have awareness, consciousness, or autonomy. Though each tool is capable of synthesizing images that resemble hand drawings, photographs, digital illustrations, and other media, the system doesn’t know the cultural and material differences between them. Ontologically, all material results are the same, even though they may differ semiotically. Additionally, while a text prompt can include any use of natural language, in the context of machine understanding, there seems to be a preference for something between a description and a command. In her book Artificial Unintelligence, Meredith Broussard writes, “Part of the communication problem that exists in computational culture derives from the imprecision of everyday language and the precision of mathematical language.”*
Natural language is loose: it can’t be owned or contained. Computer code, on the other hand, is designed to do exactly what you tell it to do. It’s precise, while natural language is polysemous and contextual. Without supportive relations between a prompt and the system’s training, the results can miss the mark, making a prompt more of a mere suggestion that doesn’t go very far creatively.
Multimedia artist Will Pappenheimer seems to understand that the key affordance of text-based image generation is the text itself. His images illustrate the value of a well-crafted text prompt, achieving a cleverness that makes a poet out of a visual artist. Some of my favorites include: (What is?) flooded moldy dutch still life; (What is?) gerrymandered soul-search hair-do; identical twin lips tree diagram; and Pre-Raphaelite identical twins. The resulting images visually resemble Pappenheimer’s discerning use of language and ideas. They are absurd and uncanny, showing recognizable objects placed in bizarre compositions. Despite his overt references to the western art historical canon, the images recall an unspecified period of time, creative style, and even medium. For example, the flooded moldy dutch still life works show compositions derived from traditional Dutch still life paintings but read as digitized film stills.
To me, they resemble a cross between Peter Greenaway’s The Cook, the Thief, His Wife, and Her Lover (1989) and Derek Jarman’s Caravaggio (1986). The moldy fruit and composition evoke the historical past, but the pixelization brings us somewhere into the 1980s, a past to be sure but one acquainted with computers and digital images. However, the effect also reveals a gesturing toward the present in the creation of a false failed pixelization. Like Jarman’s and Greenaway’s films, Pappenheimer’s moldy fruits are performative and meant to resemble the qualities of the past while they inevitably reveal visual clues that link the images to the present (which, in this case, is also the past).
The text prompts in these examples are worthy of being image titles. More than instruction, they offer shape to an idea or a mood that’s materialized in each image. The visual results reveal how the system negotiates the semantic disconnect between, say, “gerrymandered” and “soul search”, two concepts without an organic relation between them, imbuing them with what cultural theorist Mark Fisher referred to as a strange simultaneity, conferring a past-present-future aesthetic of time.
This effect is also evident in Unpredictable Past, a series of AI-generated images by the artist and media theorist Lev Manovich. Using Midjourney, Manovich questions whether history can be understood and replicated by the system’s unique artificially intelligent reasoning to generate images of a past he experienced firsthand while at school in Moscow. The prompt— group photo of students in 10th grade of Russian high-school in [year]ˆ— is its own art form as it satisfies the system’s preference for code-like natural language. By reducing the amount of complexity in each word and using the prompt uniformly across multiple image generations, his experiment yields the most ideal conditions for machine understanding. Words in this grouping are a constant, except the year, which is variable, the X factor that can be used in a routine. This works within the machine learning system’s logic and helps close the gap between everyday and mathematical language. This lengthy, precise description — providing almost step-by-step instructions, recalls Cory Arcangel’s Photoshop CS images — but also proves that descriptions and commands aren’t the same in the context of machine understanding.
These historical fictions produce a temporal restlessness that hints at but doesn’t commit to photographic realism. They register timescales that appear muddled in both the visual representation of people and in how the photographic medium is simulated. Regardless of the year referenced in the prompt, the past is represented in broad strokes: uniforms and hairstyles are relatively unchanging. The effect feels all studium and no punctum, if we’re using Roland Barthes terms, and not at all how historical photographs normally function. Instead, precise details feel blurred and incomplete, much more like they represent a faded memory.
Like Pappenheimer’s images, they proffer a strange simultaneity, producing an impossible scale of homogeneity and visual repetition across the student subjects. This is what makes the images aesthetically appealing, and almost provocative in their refusal to adhere to the logic of camera photography and its cultural and historical positioning. They aren’t a composite of existing images of people, rather a remixing of what real people in this imagined time and place might look like. They confer an aesthetic of imitation and approximation: almost real people in an almost real place and time. Along with that, they are a remixing of ideas and associations between existing images and specific words in the prompt, like “high school” and “Russian”. Given the current political landscape, it’s impossible NOT to think about how these newly generated images incorporate the present in their recreation of history. We could even say this series offers a present-historical-fiction reflecting a war that is currently taking place.
In this way, MidJourney’s AI gestures toward a disengagement between language, code, and cultural understanding. The images illustrate its AI is incapable of automating human qualities that exceed a superficial description of how humans are represented photographically, despite improving human face synthesis in newer iterations of the software. This is not the fault of language, which does occupy the capacity for eloquence and complexity, rather AI’s misunderstanding and disregard of cultural and individual nuance.
AI image generators haven’t trained their AI to understand history, geography, and human nature — three key features of Manovich’s work. I’m reminded of transdisciplinary artist Stephanie Dinkins’ work with the humanoid robot BINA48. The robot was created in 2010 using a real, living person, Bina Aspen Rothblatt, an African American woman, as a visual and intellectual template. The robot consists of a head and shoulders only but resembles the person on which it is based, complete with rubber skin and over 20 facial motors to activate realistic facial expressions. The robot uses a combination of AI technologies to collect and interpret information, including voice recognition software and eyes that are fitted with video cameras to interpret visual information. An internet connection allows the robot chatbot capabilities, all in support of BINA48 being able to have meaningful conversations with humans. BINA48 raises many questions about how AI and other sophisticated technologies are modeling real human beings and their experiences, possibly in an attempt at immortality through an artificial replica or intellectual clone of a person that can outlive them. We also must wonder how well it can replicate a person’s identity — accounting for things that go deeper than biographical data and physical appearance, such as the makeup and experience of marginalized groups, or the ability to distinguish between identity and biology.
Dinkins began Conversations with Bina48 in 2014, a project about how AI intersects with race, gender, and history through having one-on-one conversations with BINA48. Her interactions with the robot illustrate its inability to fully grasp qualities of human identity as answers to questions like What emotions do you feel? are weak, and Who are your people? are left unanswered. In other recordings of conversations with the robot available online, Bina Aspen Rothblatt and BINA48 are in conversation and humorously disagree about many things, including superficial biographical data, like their favorite color or movie, which begs the question: How and when did BINA48 watch a movie? And secondly, why do their answers differ on so many subjective topics if the robot can’t claim to have personal taste? The falseness of BINA48 is on one hand comical but on the other deeply problematic. False claims are embedded in BINA48’s understanding of the world through which empty signifiers can be innocuous in one context and can be dangerous truths in another.
Dinkins’ conversations with the robot expose the disconnect between BINA48 and Bina the person, which says more about what’s missing from the robot’s AI training than anything else, revealing a pattern among AI researchers and developers of ignoring the ethical and sociological implications of their pursuits.
The development of BINA48’s AI should give us pause as it decontextualizes the cultural, historical, and human factors of its intelligence. On one hand, it’s unfair to equate BINA48’s AI to MidJourney’s AI. BINA48 is not fooling anyone into thinking the robot is a real person (yet), and its mechanical eye and head movements remain visual reminders that it isn’t even close. On the other hand, the uncertainty of the AI’s substrate is consistent across both BINA48 and AI image generators. The AI in the latter seems to operate in arguably more nuanced ways by hiding its selections and choices more effectively. As Timnit Gebru writes in Wired, “The dangers of these models include creating child pornography, perpetuating bias, reinforcing stereotypes, and spreading disinformation en masse, as reported by many researchers and journalists.”† Moreover, the visual realism of MidJourney and other image generating tools isn’t far from closing the gap between “real” and “fake” images of objects, places, and people.
Working with neural networks and natural language processing, AI artist Mario Klingemann’s Appropriate Response (2020) humorously offers words of wisdom to its visitors. Installed as a kind of altar piece — mixing aesthetics of church, school, and even hinting at a stock market exchange board (via a split-flap text display) — it constructs a composite of the places we trust, even if superficially or blindly, to seek advice and guidance. The work answers the question: How much meaning can you put into just 125 letters? The AI was trained on 60,000 quotes and aphorisms collected from the internet. Answers are delightfully absurd but don’t exceed the tone and scope of a typical adage: If they’re lonely, it’s because they’re all over the place; Don’t forget to eat, drink and laugh and be proud of your health; and so on. Most importantly, it illustrates the influence of words. For Klingemann, “They can move mountains. They can make people do things. They can change their lives.”‡ This work draws on their power, letting the force of words become self-evident and visible, in turn giving participants a forum through which to critique our own submission to their power. In this way, AI can be used as part window into the external world of human belief and part mirror, offering self-reflection, clarity, and self-critique of our role in making meaning.
The prophetic potential is shared and perhaps augmented in Sasha Stiles’ Technelegy, a project that seeks to fuse the artist and the machine in near equal collaboration to identify and understand transhuman experience more fully. This AI poet alter ego is the result of intense research into natural language tools to unpack the very operations of techne, the weaving and fabrication of sign systems that communicate, express, document, and generate meaning. AI poetry reminds us that language is a technical system with its own set of rules and conditions, but it’s equally a flexible art form that moves and means on its own terms. These dueling and complementary conditions produce a wide range of outcomes and expressive possibilities, but in Stiles’ work, they aren’t strictly governed by the whims of AI, rather shaped and ushered equally between the artist and the program. Theirs is a collaboration that extracts creativity in a recursive way between creator and system. Both Stiles and Klingemann lend deep knowledge, creative finesse, and thoughtful collaboration with their technical tools to uncover the hidden aesthetic potential of AI.
But AI for the masses offers a different framework for creating new work. AI generated images are dependent on diffusion — a deep learning training model centered on the elimination or reduction of noise. This produces an uncertain substrate, one that is subject to the highly black boxed operation of deciphering between signal and noise. This is an already familiar condition of computational photography — the camera technology we have on our smartphones, which creates new images through an elaborate remixing of other images you’ve taken or accessed on your device. Sifting through potentially thousands of images, this process requires choosing between signal and noise.
Nearly ten years ago, Hito Steyerl wrote about what was then a new phenomenon involving taking pictures with a smartphone, which relies on software to produce images, despite the lens on phone cameras. Steyerl writes, “The result might be a picture of something that never even existed, but that the algorithm thinks you might like to see.” Though computational photography uses a different combination of algorithms to learn from other images accessed through or taken with a person’s smartphone than AI image generators, it similarly relies on making distinctions between signal and noise. For Steyerl, the results are grim. “It will increase the amount of noise just as it will increase the amount of random interpretation.”§ But who decides what is signal and what is noise? Technically, AI image generators synthesize images from the noise of millions of poor images circulating the web and stored in image databases. What is a system really programmed to recognize, ignore, and transform? And what does it mean when culture and language are transformed into data before being synthesized into new forms of cultural content? Are we creating ideal images? If so, whose ideals do they serve?
There is no doubt the ability to make any new image one can imagine by using words or an image prompt is a creative affordance, but this freedom assumes a ubiquitous, objective origin that simply doesn’t exist. Images aren’t so much new as they are re-imagined through existing images, and this substrate is only concealed as photorealism improves. It’s unclear how visual details are weighted, though those possibilities are offered extensively by MidJourney. But we can’t help but wonder: how is language in the text prompt or visual data parsed and measured? How do misinformation and cultural stereotypes shape image synthesis?
Nancy Burson’s experiments with composite photography in the 1980s demonstrate how images can be synthesized in ways that clarify and position the work’s cultural and political critique. The uneven fusion of faces in Warhead I (1982), for example, is a composite of five world leaders in which the individual visual influence of each leader corresponds to the number of nuclear warheads they had in their possession. Though distorted and askew, the face of American president Ronald Reagan dominates at 55% with heavy traces of Soviet leader Leonid Brezhnev at a close 45%. The other three world leaders infused in the image—Margaret Thatcher, François Mitterrand, and Deng Xiaoping—each account for less than 1% of the total image, making them merely drops in a much larger digital bucket. Burson’s experiment is a visualization of a global nuclear threat represented accordingly by the most likely transgressor at the time the image was made. The image doesn’t attempt to predict the future, nor rule out the possibility that another world leader might gain access to nuclear warheads and wreak global havoc should they choose to. What’s striking about it is its transparency: source material is identified (stock photos of the world leaders were used) and the formula (the percentage breakdown) with which the final image was generated is expressed in the image description. Moreover, the distortion of the final image is left intact to both express the monstrosity of this Frankenstein-like visual experiment and the global threat to which it refers while avoiding an attempt to pass as an image of a real person. The fakery and weighted bias of this image are transparent, intentional, and explicit. In this way, Burson’s image doesn’t share the same ethos as diffusion, but it speaks to the ways that aspects of society can be mathematically determined, predicted, or summarized visually.
In her groundbreaking digital image resolution studies work, Dutch artistic researcher Rosa Menkman refers to the underlying problem with advanced image technologies as the “obfuscation of compromises” within the system. Their hidden decision-making, a form of ghostwriting or autocomplete technology, is agreed upon by tech developers, and it can’t be questioned by users if it can’t be examined. The black boxing effect of AI keeps users ignorant of its biases, unaware of other possibilities or unable to explain how the translation process actually works when mining millions or even billions of images to synthesize new ones. Drawing on the embedded bias of the Lena photograph used for image-processing research, the most uploaded image to ARPANET, Menkman writes:
How much are the performance, texture and materiality of digital photography actually influenced by the use of the image of Caucasian Lena? What would it have meant for the standardization of digital image compression if the image chosen for the test card would have been the first African American Playboy centerfold Jennifer Jackson (March, 1965); or if the 512 x 512 pixel image had instead featured the image of Grace Murray Hopper, one of the first African American pioneers in computer programming and the person responsible for inventing some of the first compiler-related tools –– moreover, the woman who, coincidentally, coined the widely used computer slang “bug.” How much do the compression standards we use on a day to day basis reflect the complexities of the good 512 x 512 pixel Lena image and how well do these standard settings function when capturing another kind of color complexity?¶
Confronting the biased formats and file standards that are embedded in AI’s source material is a necessary task if we wish to avoid continuing the cycle of obfuscation and cultural harm. We aren’t far from being able to make narrative- and character-based feature-length films. Text prompts would benefit from the structure of a screenplay, where language is used to describe what one might see on screen. Actions, film editing instructions, shot types, the introduction of a new character, dialogue, and more are all laid out as descriptive text. Stable diffusion could easily produce basic flythroughs of landscapes or interior environments, then work up to speaking characters that interact within these fabricated spaces. Video game NPCs already represent a version of this application. I imagine text-to-video would at first be crude short segments, which could be trimmed and edited together. Fully edited long-form works could be easily managed from there. One will have the ability to prompt an Alfred Hitchcock staircase or an Andrei Tarkovsky long-take or a Spike Lee double-dolly shot, but hopefully something more intangible, like the observational intimacy of Agnès Varda. Still, while requesting such features by name and specificity might suggest admiration and respect for the original creator, algorithmic imitation overlooks the craft and art of the original. We live with human bias as a necessary condition of art, but are we satisfied with this scale of computational subjectivity?
Without sensitivity and care at the level of training and design, we risk outsourcing the interpretation of culture and what it means to be human for computer models that are unfit for this task of interpretation and translation, or at the very least, that don’t undermine or erase it. It is impressive to improve upon the individual capabilities of machines, but we should also be interested in how they may be synthesized into an operable whole. Left unchecked, these tools can push aesthetic mediocrity and unexamined creativity. More seriously, they can programmatically reinforce stereotypes, misinformation, and bigotry with the help of feedback loops created between messy AI training other messy AI. Alternatively, they can facilitate a pathway to a new aesthetic and a portal to new visual and ontological perspectives guided by the critical tug of the artist behind it. The latter will only be possible if we pay close attention to how these tools are built and make meaning in the world and guide them as needed.
- *Broussard, Meredith 2018. Artificial Unintelligence: How Computers Misunderstand the World. Cambridge: MIT Press.
- †Gebru, Timnit 2022. Effective Altruism is Pushing a Dangerous Brand of ‘AI Safety. Wired. Retrieved from https://www.wired.com/story/effective-altruism-artificial-intelligence-sam-bankman-fried/
- ‡Interview with Mario Klingemann by Onkaos, 2020. Retrieved from https://underdestruction.com/2020/08/29/appropriate-response/
- §Steyerl, Hito 2014. Proxy Politics: Signal and Noise. e-flux. Retrieved from https://www.e-flux.com/journal/60/61045/proxy-politics-signal-and-noise/
- ¶Menkman, Rosa 2020. Beyond Resolution. the i.R.D. Retrieved from https://beyondresolution.nyc3.digitaloceanspaces.com/Rosa%20Menkman_Beyond%20Resolution_2020.pdf