Skip to main content

Text to Image Prompt: The technology and science behind A.I. and what it could mean for image makers? | PhotoVogue Festival 2023: What Makes Us Human? Image in the Age of A.I.

As we start a conversation on AI and what it means to be human, “Text to Image Prompt: The technology and science behind A.I. and what it could mean for image makers?” will deep dived into the emerging creative AI technologies. We demystified the technology behind prompt engineering, curation and generative AI - and the text to image / text to image tools which are blurring the lines between real and imagined.

Released on 11/22/2023

Transcript

Hey, everyone.

Lovely to be here today,

and thank you for letting me, Alessia, be part

of the festival and kick off.

I m going to start just to give a bit of an overview,

actually, of the technologies that drive AI

and how they apply to image making.

I m actually gonna start just a little bit of a background

on myself.

I was gonna play this video,

but it s something I think you should all watch after.

And actually, I didn t realize when I started

that Fred is in this video,

but it s a Today show talk about the launch

of Photoshop in 1990 and the application that Photoshop

had as a revolution in digital technology and image making.

Why that applies to me, cause that s where I started.

Kind of gives away my age a little bit.

But at university and as part of the work that I did,

both as learning photography and digital publishing,

Photoshop 1 was one of the first tools

that we started using.

And from that, my career in the technology world

has kind of grown from there.

One of my first jobs was at Lonely Planet Images,

back in early 2000, building the first stock image library,

and integrating, actually, all the slides

and converting them into the repository,

and created a search and eComm platform

where you could buy and license those images.

When I moved to Europe;

you can tell by my accent that I m not from here,

I m Australian, one of the projects

and companies I worked with was a company called Photobox.

It is a gifting and photographic company

where you basically create custom gifts,

photo books, mugs, et cetera.

What was so fascinating about that job

was that it was the largest repository of images

outside Google and Facebook at the time.

And every single product that we made,

which was manufactured through our factories,

was an individual product based on images.

And then, for the last four years,

five years, actually, I ve been working at Conde Nast

in a variety of roles on the technology side,

but basically re-imagining, you know, our platforms,

experiences and livestream.

But on the flip side, I started as an artist.

I actually trained as a photographer.

When I moved to the UK, I actually paused to raise a family

and build a life in Europe,

and this year reestablish my work as a photographer

and published my first photo book.

So, what we re gonna just talk about today

is actually the technologies behind AI,

in a very simplistic, you know, context,

I m not an engineer, so I m not a programmer,

I don t come from a computer science background.

I come from a creative and storytelling background,

but I do work with these technologies every day

and with engineers, programmers,

machine learning specialists,

to kinda create these technologies.

But AI has been around for a really long time, right?

The work and the transformation and the exponential stuff

that s happening at the moment

is really around generative AI.

But anytime you ve been using your phone, a camera,

you are using some form of AI technology.

Because ultimately, the digital images

that we have today are basically made up of numbers

and mathematics; ones and zeros, right?

Pixels; no more, no less.

And that s when it started, you know, back in the seventies

with the beginnings of the translation of film

into digital mathematics and computer language.

So, what does it mean now, right?

Through that timing, I mean, like, I ve been around

doing technology for 30 years,

but the rate of change is exponential.

I thought it was fast when I started,

but now I can t even keep up.

There are technologies that have been launched

on a daily basis that completely supersede

what was happening the week before.

So, let s look at some AI building blocks.

It s actually quite simple;

I probably shouldn t say that out loud,

but it kinda really is.

Because basically, AI works off understanding the language

and the image and the taxonomy of the thing

that it is looking at.

So, if we think about that in technology terms,

you ll hear words like metadata, or taxonomy,

or information architecture,

but it s basically creating context

around the objects that it sees.

AI can t work without that knowledge, right?

It needs to understand the inputs,

and then obviously whatever goes in

is a manifestation of what comes out.

So, if the inputs that go in that are biased,

then unsurprisingly, often the results

that you get are biased.

And within that, AI covers a broad spectrum

of computer science, and we re only gonna touch

on a really small part of that today.

So, what is the biggest part of AI

in terms of a learning mechanism

is the large language model, right?

It s the data that you feed it, by which it translates

and brings back an image.

And in that context, and over probably the last 20 years,

from the context of visual technologies,

we ve gone through a very simple transformation.

The first one, which is around the digital camera,

was actually to make an image, right?

To take those transparencies or that film

and convert it into pixels.

From there, we ve been learning how to see a picture, right?

To understand that kind of context and add metadata

or information around it.

That is often very keyword centric:

here is a tree, there s a cat.

And if you watch any of the early videos,

it was all about understanding what is a cat,

what is a tree, what is a car.

From that, we ve learned how to describe a picture,

and then ultimately add meaning to it.

And that s where, like, the last 10 years has really focused

around not only just saying this is some information

about a cat, it s what that cat doing, is it lying on a bed,

and transcribing an element of knowledge and context to it.

The transformation in the last 12 months

is the generative side, where the algorithms are now working

within themselves and adding in technologies

and almost talking to each other

to create these generative images.

And then, in real time

and, like, literally in the last week, we can start

to do that in real time.

So, where technologies may have taken years or decades

to kind of absorb and learn all this information,

we are now seeing a rate of change on a daily basis,

to the point where you have AIs talking to Ais,

to create the prompts, to create the images in real time.

So, I m just gonna walk through

very simply how it works, right?

And how these words, context keywords

ultimately create meaning.

This is a really nice article actually

from the Financial Times in the UK

that was a visual narrative about explaining,

in very simple terms, how AI learns from information

to translate it into context.

Because basically, it has to translate words.

So, how does it do that?

We go to work by train.

From that, it basically breaks up those words into subsets

of other words, right?

And they become tokens.

So, everything is basically broken down

into each of its parts, and then rebuilt over time.

By doing that and adding thousands of articles,

thousands of contexts,

thousands of ways of expressing train,

it will look for patterns.

And those patterns are then what it translates back

when you put into ChatGPT, write me an article,

create me an image,

it is basically taking the sum of all those patterns

to create an image.

It then looks for things that are negative, positive,

like for like, to start to create context.

And within that, it starts to look at generating text

that is similar to what it understood in the beginning,

but it can only understand the inputs that you give it,

which is really important when you get to image making

or storytelling, because it s always looking at the history

or the context of the past, not the context of the future.

So, as we work in an engineering context,

by taking these millions and then billions of outputs,

we can simplify it back into patterns and logic,

and the most likelihood of something being consistent

to something else, translating it back into meaning.

And from that, as it learns, the more information it has,

the more it can understand context and consistency.

It can come back with a sentence,

or it can come back with a narrative.

All AI is built on these large language models or data sets.

So, if you were listening to Reffit the other day,

he takes one data set and another

and puts it together to create a narrative;

weather data, location data, image data

to translate into a new story or a new visual narrative.

But ultimately, all AI is built on the foundation

of a language model.

So, therefore, what is vision to a machine?

So, a lot of the work is text to prompt.

We now have image to image, video to video.

So, how is it doing that?

Actually, it is basically the same thing.

It s classification.

When we start to input images,

and in the early days and why it took 10 years

to kind of get this exponential growth,

is that people were manually keywording the metadata

for every image that it had, right?

That is how the model learnt.

So, you take any image, it can identify who it is,

where it is, the location,

it can take information from the JPEG of your photo,

where was it taken, and location data.

And that becomes everything that the machine

and the learning model uses to create images.

So, from this, we can then create new stories.

This is actually one of my favorite pieces of art

by a creative technologist called Dries Depoorter,

who took two very simple data sets:

Instagram images

and the locations that they were tagged;

and then open-source open camera data,

to find the source of that image

and then match them, right?

So, all these kind of connections

that occur are based on taking two data sets

and finding the similarities between the language models,

and then translating it into a visual story.

But it is fascinating in the world today,

given so many, you know, surveillance cameras,

every one of us posting on Facebook, Instagram, TikTok,

is that the machines can know everything about you

in an instant; where you are, who you re hanging out with,

the images that you take, and the stories that you tell.

So, what does this mean in the context of curation?

We re gonna talk about two types of AI now:

curation AI and recommendations,

and then the generative side.

Cause, ultimately, as all this information goes in,

it s looking for those patterns,

and all it will bring out in the beginning is a median

or an average, right?

The most likelihood that you will buy this thing,

the most likelihood that you will like watching this video

is based on the median or the statistical average

of this data.

And in that context, as I said before, all AI is created

in a historical context.

So, while it can predict some elements of the future,

it is always looking to the past.

It s a little bit small.

So, I actually also asked ChatGPT, what is AI?

And if I pull some things out, it s basically the role,

sorry, to discuss the role of algorithms and storytelling.

Basically, the role is to organize,

analyze, and interpret data.

We find patterns.

Algorithms are used to personalize stories

for individual users.

We recommend content based on our history,

and we analyze past viewing behavior to suggest new content

based on what we know about you.

And from that, we can help and support creativity,

but we also risk biases.

So, these machine learning models are constantly learning,

like, all day, every day.

The more inputs that we have, the more tests we do

with ChatGPT, it is learning, it is improving,

and it is creating new patterns

based on, you know, how people are responding

to the technology and the images that they like,

create, and share.

So, very simply, two elements; curation AI.

This is the very simple part of anything

that you would use on your phone today.

Whether it s your Spotify, your Instagram,

your discover page will tell a lot about you as a person.

You can tell for a while I was very interested

in Pedro Pascal.

But in terms of how they work,

it basically takes your listening behavior,

your reading behavior, what you look at on the internet,

and then serves you similar content.

That is why we often see,

in these kind of social media platforms,

a lot of repetition.

Because basically, if we re going for engagement and views,

we look for, as a publisher context, things that perform

and actually replicate that.

So, let s flip now to the creation side

and what s happened in the last 12 months.

So, as we build out these models,

or not we, but the technologists around the world today,

the next phase that is occurring

is very much about taking these vision technologies

and adding meaning and context, right?

So, where we may have started

with a pattern of I understand the keyword,

it is very logical.

The world that we are in today is a little bit more blurry

in terms of what is real, what is not, and what is meaning.

But the challenge with this is actually around the input,

because words are very literal,

and I think sometimes we forget that,

because prompt engineering and the languages that we use

to create these images are actually also very keyword-based,

in the way that they are written to generate an image.

And yet, images are the reverse, right?

There is so much more meaning around the context

of love that you can t quite explain

to a large language model, or maybe we can.

Part of the challenge is, I think, at the moment

for computer science is to solve this problem,

is to take words that are literal in translation

and create multiple meanings for very simple concepts.

Because, ultimately, life exists within the context

of this language, whether it s verbal, visual,

cultural, or over time.

And it really impacts, you know, the way we communicate,

the relationships we have, democracy,

publishing, media, conversation.

But the challenge, I think, at the moment

is that a lot of gen AI is really based on a simple prompt,

and they can get complex; very, very related

to how computer and engineering languages work.

They re very specific,

they re asking for very specific concepts,

but they don t actually do nuance.

And yet, we re creating this paradigm

of a new set of language that is very literal

in its translation.

And often, the images that come back are also very literal.

So, this was something I did actually six months ago,

and I asked ChatGPT, no, sorry, Midjourney this week, again.

Earlier on in a talk this year, I did a very simple prompt,

not a complicated one.

You can get right into the APIs and the backend technologies

and do very complicated prompt engineering

to really specify what you want to create.

But in its most simple terms, it works like metadata, right?

You ask it a question,

or you give it very specific instructions.

I want Serena holding a tennis racket on a tennis court.

The quality of the image that has transformed

in the last six months is extraordinary,

but that is based off all the other images

it s now creating and feeding off itself to improve

that visual representation.

But my question has always been, well, not question,

is I would actually much prefer to see the real thing.

So, while I can create an image to represent, do it faster,

publish it, save time, save money,

it doesn t create the same feeling

as actually taking a photograph.

So, the biggest part and transformation

that s happening right now is that you can train the AI

with your own images.

So, ChatGPT last week literally launched a variety

of technologies where you can take your own images

and put them into the system

to generate very similar images in your own style.

So, I tried that, based on a project I would like to start

but kind of struggling with.

So, this year my mother passed away,

and when I was going through her archive,

there is almost no photographs of my mother with our family;

there s only four in existence

of my mother, father, and my sister and I.

And so, what does it mean as a child

or as someone who is in grief to try and imagine

or have an archive of images that don t exist.

So, I literally started to try and create my own ChatGPT.

It s gonna take some time.

This is the first render that it did to try

and create a family photo in the likeness.

As you can see, it s way more illustrative

and performative;

and look, I would really love to look like Wes Anderson,

but I don t.

But using those prompts and using my own images,

it s still struggling with that context.

So, I tried to create and fake some images of myself

as a young child with my mother.

It still can t quite do it.

But like with the Serena Williams example,

it won t take long before it will start

to render very interesting and close-to-real-life images

of a family.

However, that s still not the memory that I have, right?

So, while I can illustrate, transform,

and morph for photograph; I took this one, it s not my son

but it was taken at my son s 21st.

But to translate it into different person,

different location; while it is a fantastic illustration,

it doesn t create the meaning and the history

and the story that I would create with the photograph

that I ve published or created, or the memory that I had.

I wanted to show, and it s actually Marco, a little video.

This is the level of where the technology

and how the speed have changed,

which is now you can create, in real life, images,

video-to-video based on recording yourself,

in a matter of minutes, which as a creative s perspective

is extraordinary and super exciting,

because the kind of stories we can tell

from this is amazing.

Also, the level of, like, I don t know, just technology

and creativity is awesome.

You could do that at any style that you want:

manga, you know, CPR, et cetera.

So, I think, from a future-based storytelling,

it s really interesting as a tool to enable

new creative paths to telling stories.

So, to kind of wrap up, I just wanted

to flip it a bit in terms of, the rate of change,

even for someone who works in technology,

is a lot right now.

But ultimately, it comes down to the input.

What stories you wanna tell,

which technologies you wanna use, whether it s Photoshop,

film, 10 by 8, or even doing it in digital.

But ultimately, I actually think

that there will be a shift back to memory making

and more traditional forms of storytelling.

While I like an AI, like, telling me where to go to travel,

it s not gonna drive a decision,

nor is a photograph that is basically a facsimile

of my family and my life gonna represent the meaning

that I have growing up in my family.

So, I did ask ChatGPT also, can it do meaning or memory?

And it can t.

And while computer scientists are looking

to create historical and collective meaning,

I don t think we ll ever get to a position

where it can mimic.

I mean, it shouldn t be able to mimic.

It s a technology; it s a tool.

But ultimately, while it might mirror human-like qualities,

it will never replace.

And I think, as artists and creatives, we need to remember

that, even though the discussion around technology

is really loud right now.

Cause the question isn t that we re gonna use

these technologies; they re here, right?

We re all using them, we re all experimenting,

we re all finding new paths and ways to tell our stories.

But it s more that, as they become more generic,

and like I said it s a median, most of the images

that we create look very similar.

So, how do you know it s artist A, B, or C?

What story are they telling?

If it s a fashion image

and it looks just like every other fashion image,

we re gonna actually become numb

to these kind of technologies.

So, I m just gonna show two pieces of work to flip it back

to the photographic side.

Dito Pepe is a Czech photographer.

This is one of my favorite series

to really talk about the context of image and meaning-making

that I m just not sure an AI can do.

Her project was done where she imagined her life

as a mother and a wife based on the men that she met.

Often, the people in the photographs

are the children of the male.

And she basically did a whole series that represented

what her life might look like, from the styling,

the location, where she lived, her demographics,

whether she rich or whether she s poor.

And these images, I think, very simply illustrate

that AI technologies can help us tell all kinds of images,

but it s the meaning and the context

that is a little bit harder.

And just to end, because I thought I d show

a little bit of my work, the project that I did

over the last two years

during COVID was very similar.

So, it was called A Chance of Love.

It s based off photo book, a novel, and a series of images

that were based on meeting men on Tinder

and going on a first date.

And the interesting thing about that

is that you are interfacing intimacy and connection

via and through an algorithm.

And that algorithm gives you a lot of same same.

So, whether you are swiping left or right,

you will have [indistinct] in the mirror guy,

you will have selfie guy, you will have gym guy,

you will have all these very generic views of images,

because that s what ranks, right;

back to the large language models.

And everything looks the same,

so how are you meant to find partner, love,

intimacy, connection, when all you are seeing

is the same types of people saying the same types of thing

over and over again?

Because ultimately, when you meet someone,

they re not like what they look like on the screen, right?

And actually taking their portrait,

taking a photograph and meeting someone

through a chat with the opportunity for connection

is actually quite hard.

And it s currently filtered by an algorithm that decides

before you do who you re gonna be popular to

and who is most likely to swipe left or right,

and is basically kind of interfacing

with our choice of connection.

Which is something I think we need to bear in mind,

not just for dating, but for TikTok,

for any of our communication channels, is that the context

by which we have is different in a relational context

versus the AI and the technology.

In the book, all the men agreed,

we published all the conversations

and all the setups to, like, meet, connect,

take their portrait.

It has poetry.

And it was only nine images over two years.

But ultimately, while it started as a discussion

around connection, it also became a reflection

on algorithmic narratives

and these interfaces that we have with technology

that determine who we meet, what we see,

what we read, and ultimately how we just infer meaning

in our lives.

So, just to end up.

I think one of the challenges that we face at the moment

is this meaning context, right?

As humans, we actually don t necessarily want

all this speed, right?

We don t necessarily want technology driving that change.

Sometimes we want a bit of boredom.

We also want a bit of serendipity.

And technology finds it really hard to do serendipity.

And then, just to end... hope I m on time.

With all these technologies,

the key is just to understand how they work.

And in the context of AI, its input in creates output.

So, the quality of your prompt, the quality of the images

that you upload, the quality of the construct

of the story you tell is ultimately what will determine

the image, not the AI itself.

And that is it.

Thank you.

[audience clapping]

Starring: Mel McVeigh