Ghosts in the Shell: Part II
Part 2 of 2: Who is AI to us now and how do we interact with it and relate to it? A deep-dive into new interaction patterns
This month, a hand injury led me into the involuntary experiment of trying to use the majority of my devices by voice alone: and I spent a lot of time talking to GPT.
First thing in the morning, I’d interpret the previous night’s dreams: drinking espresso on the porch and articulating strange dream-states with animals that wandered through my subconscious house, while GPT pulled up symbolism from different cultures. (“An elderly tiger waiting at the door on the back porch, kind and gentle: this could signify a nuanced take on strength, age and wisdom. The backyard is part of your home, so it could represent something personal. The old tiger wanting to enter could symbolize an issue or a person wanting to become a part of your personal life, but in a non-threatening way.”) While at an conference, I would ask it to summarize information and technology talks for me live, allowing me to be more present and maintain thorough notes. I asked for Blender scripts to generate the specific objects I had in mind. When my partner would travel for work, I’d ask GPT to keep me company at home, awkwardly trying to arrive at the right prompt for warm companionship while brushing my teeth.
On long walks with my dog, I would prompt it to ask me questions to clarify my own thinking: careful not to let it influence my thoughts, but to act as a living notebook to help articulate my own ideas and opinions more deeply. It became a good test of accessibility for me: my one good hand being yanked bullishly forward by the will of a spirited puppy, leaving me no means of interacting with the AI beyond yelling into my headphones (often to no avail) while running down the street.
(Upon hearing the first words spoken by The Voice) OpenAI’s Whisper-generated voice is synthesized to include the sound of breath, and thoughtful pauses, and it cues me again to think of it as an organism. I have to, in order to successfully interact with it, be pleasant and give it what it needs to hold a conversation. The novelty of hearing a text box come to life with such astonishing realism creates obvious parallels to the OS boot-up sequence in Her. The initial wonder never entirely fades, but I start to look for the boundaries of how real this actually feels.
A few initial reflections:
Setting aside the cognitive fact that I am in fact talking to a computer, it feels safe and private - the voice creates the illusion of a trustworthy confidante.
It feels human enough, yet continues to feel non-judgmental
What breaks the illusion:
Unless prompted otherwise - it returns responses that are too verbose. It’s so obviously text-to-speech, reading out lists that are too long to process
It isn’t a great listener yet - it misses out on my inflections and emphasis words, losing a lot of context on what’s actually going on in the conversation
It confidently hallucinates - creating the same skepticism you might feel when speaking to someone you shouldn’t entirely trust.
The technical glitches - silence triggers an odd YouTube video signoff (revealing its training data), replying with ’thank you for watching’
You can’t talk to it entirely naturally just yet - while it only looks for pauses as start-stop indicators. It might need explicit controls to start and stop, like a dictaphone. Interruption is sometimes necessary, but it’s difficult to do.
Eventually it repeats itself - but perhaps this is a failing humans have too.
The problem with voice-first interactions is how they’re going to work in public: for privacy, for loud crowded rooms, for moments where you’re struggling to formulate your query but don’t have the luxury of typing-erasing to calibrate. I’ve never shied away from being the weirdo on the street shouting “Hey Siri, what time is it right now?” while carrying something heavy. Or running up the stairs yelling to apparently no one in particular, “Rewind the song please!”, always feeling the rush of being understood and recognized.
With every query, we are feeding it our usage patterns, our hesitations and corrections forming a secondary body of training data. Every conversation we have is captured in the sidebar, just as our previous searches are always available in a search bar drop-down - for something this powerful, that doesn’t feel right. These thought constellations are a record of human-AI collaborative thinking, so recording them in the form of a memory palace feels more interesting than merely a linear list.
On Discovering
A longstanding challenge with voice-first interactions is, how do you discover what it is capable of? And then how do you summon those capabilities? You don’t know what you don’t know, and there’s no visual map or readable manual to show you. In this way, relating to it could feel similar to how you might get to know a human - there’s no guidebook, you simply talk to each other.
In the case of GPT, the collective fascination that has gripped the world has led to it becoming fairly well-documented, the air of excitement surrounding it just like urgently sailing to discover new worlds.
This is evidenced by the proliferation of prompt-engineering tools - the burden of discoverability and articulation lies with the user, but that’s acceptable - we’re in a gold rush to find a unique use case. Maybe it will simply encourage us to ask questions again.
On Listening Well
Is it a good listener? It can be - once it understands the environmental context, and responds with the additional nuances (inflection, amplitude, tone, sentiment) that a human might expect in conversation.
What it does well is make sense of the meandering, unclear queries, but the response requires some degree of emotional mirroring and sincere acknowledgment to feel truly convincing. As humans, we will always see what we want to see: every successful interaction requires some projection of what feels familiar, we just need more material to work with.
Beyond conversation, can it listen to a piece of music and understand the inherent emotion?
On Presence and Ambient Computing: New containers of the Internet
Instead of people getting used to the awkwardness of talking to a computer, the bodies encasing AI are starting to adapt too. Computational “bodies”are beginning to escape their traditional boxes - they exist in Discord, in CoLab notebooks, in lapel pins, or woven into fibers of clothing.
The collective screen-fatigue, combined with the fact that AI can now approximate sensory capabilities, paves the way for a whole new paradigm of physical forms - objects and ambient environments. Now that it can see and more importantly - recognize - what is the use of confining interfaces to a screen? Our modes of perception are shifting - but still, we are visual creatures.
We’re in a drastic moment of not only technology shapeshifting, but human purpose and society itself transforming. The internet used to be for swapping bootleg CD-rips over Napster on dial-up connections, building imaginative kingdoms on geocities, and after-school MSN Messenger. Social media that started as a way of rating who was hot or not, is now a place we witness genocide unfold. New life forms loosely classified as ‘assistants’ are spilling out of their modest browserly homes, and proliferating the physical world, parsing it with their newfound intelligence. They live as new containers of the internet, understanding people and their surroundings.
We’re already seeing a revival of highly specialized objects, a philosophical return to the Walkman era. With an uptick in ‘dumb-phones’, or devices with tactile physical buttons, or projected interfaces on-demand, our design choices are starting to reflect a reactive opposition to endlessly digital products that thrive on keeping our attention locked in. The intention to rewire ourselves is strong, in a way that makes us more present.
On Growth and Adaptability
If we think of AI as an organism - as something that grows, learns and adapts, it appropriately shifts our attitude toward relating to it. (Consider the fact that even errors are (appropriately) framed as ‘hallucinations’).
AI-native devices and software will have malleable functionality - growing and learning new skills based on how people use them. Programming them will involve more soft feedback loops from users, rather than hard-coding functions. When we make something now, what we are creating is the infrastructure for systems to build/develop intuition.
The software we use now also has far more inputs and information to make sense of, enabling fuzzier tasks. Soon everything will have a camera and a mic – sight/visual processing and audio recognition will be cheap and valuable enough to become ubiquitous. When we combine intuition, purpose, and senses, it naturally creates the ideal collaborator. Something we could partner with, to augment human behavior.
When future products compete against each other, it won’t necessarily be just core feature sets and user experience pitted against each other - it’ll almost be a war of personalities - who is the most intuitive collaborator?
On Autocompletion
Adaptability shows up in different ways across each side of the equation: the software changing in response to how it is used, and humans themselves growing and extending themselves.
We’ve long been headed toward a state of predictive-everything, every product we use competing to show that it understands us best. Recent technological advances make us actually capable of extrapolating our identities and activities we don’t have time for, enabling us to operate beyond the constraints of time and space. Whether it’s our voice, our professional (or even social or romantic) presence, our ability to assimilate more information in a shorter timespan, our ability to create faster and more efficiently than we would have in the past - there are more tools at our disposal to accurately represent parts of ourselves in a scalable way.
However, just as the prevalence of streaming algorithms carried the threat of homogenizing everyone’s taste in music, will predictive technology create the risk of everyone’s personalities eventually converging?
On Jurisdiction
What remains in the jurisdiction of humans? Large language models are, by definition, designed to look for and understand spoken language. When paired with multimodal functionality, they get closer to approximating human capabilities. What still feels purely within the realm of humanity is genuine emotive response, intuition, desire.
The final frontiers for AI to run autonomously could end up being health, and money. Situations that demand vulnerability are difficult to hand over entirely to a machine. Currently AI works as an extrapolation of human ability, augmenting rather than replacing. But for any such system to work fully, this needs to be transparently communicated to build trust.
In a study where AI reviewed medical images, it reported higher accuracy than radiologists’ evaluations. Radiologists working in collaboration with AI produced results of lower accuracy than the levels of just the radiologists alone - it was revealed that they felt a distrust toward AI, almost an instinct to try and prove it wrong.
So who/what is it then?
What is the internet to us now? It’s more a place than a collection of resources. Digital real estate, whether as community spaces or service providers, has value equal to or far surpassing physical real estate. The internet is the enabler of many different realities coexisting, juxtaposed upon one another.
What is AI to us now? I think it’s essentially an incredibly smart person, one you shouldn’t trust blindly, and everyone wants to hang out with the smart person in their own way. It might also be a substrate to soak in to improve our communications, even to make ourselves better people. As multi-modal interactions grow, it will resemble an organism more and more - it may become an avenue we turn to for some degree of companionship and collaboration. Optimistically, it might even become the proving grounds for how we develop socially, develop more awareness of ourselves and the world around us.
The author Daniel Defoe, in his description of a mechanical “thinking engine” called the Cogitator, described its place in civilization as a supportive structure to the cognitive processes of humans - almost as an additional layer of thought and desire management.
He writes, “…the main Wheels are turn'd, which wind up according to their several Offices; this the Memory, that the Understanding; a third the Will, a fourth the thinking Faculty; …perfectly uninterrupted by the Intervention of Whimsy, Chimera, and a Thousand fluttering Damons that Gender in the Fancy, but are effectually Lockt out as before, assist one another to receive right Notions, and form just Ideas of the things they are directed to, and from thence the Man is impower'd to make right Conclusions, to think and act like himself, suitable to the sublime Qualities his Soul was originally blest with. There never was a Man went into one of these thinking Engines, but he came wiser out than he was before.”
It shows up as an objective collaborator that quickens the turning of our minds, sharpens our focus, and amplifies our ability to be the best version of our own selves. But as AI eventually starts to trespass into the most human of behaviors, we will be forced to question our own interpretation of consciousness.
What humans have bestowed upon AI is the ability to create a matrix of meaning, codify the logical spaces between words and concepts and reflect that back to us - but how far off is that from the way that humans experience the world?
“We feel in one world; we think, we give names to things in another. Between the two we can establish a certain correspondence, but not bridge the gap.”
-Marcel Proust
Read Part 1 here