How to chat with images using AI and WebAR

Blog Author
5 min read
Interact with famous personalities in real-time using advanced image recognition, OpenAI's Vision API, and lip-syncing tech. No pre-training needed—just point, capture, and converse. Built with Mattercraft for a seamless AR experience

Using AI in combination with AR is proving to be one of the most impactful ways developers are using to take their AR content to the next level, by providing a balance of immersion and real-world knowledge and awareness.

Using Mattercraft's latest features, I wanted to run a quick experiment to see how the two could work together to enable everyday portraits and art to "speak" to users once scanned in WebAR.

In this demo, you can point your phone at any image of a famous person and instantly engage in a conversation with them. The system recognizes the image, uses text-to-speech, and even lip-syncs responses in multiple languages—all in near real-time. The best part? There's no need to train any image target or fine-tune a model beforehand.

Everything happens dynamically.

 


How it works

I've broken down the key elements below so you can re-create the experience in Mattercraft. If you have any questions about the individual steps, feel free to reach out to me on LinkedIn

Image Recognition

The user takes a picture of a person, which is then sent to a Cloud Run function I developed on Google Cloud Platform (GCP). This function interfaces with an API I created around the Zappar image CLI tool. The API processes the image and returns the image target (zpt), which is dynamically added to the scene.

JavaScript code snippet for capturing an image from a canvas and sending it to a server. It uses captureToDataURL to convert the canvas to a Base64 JPEG image with 0.9 quality and handles errors with console.error().

Example of how you can capture a picture on code

JavaScript code snippet creating a new ImageTracker object with contextManager. It sets properties such as source to a blob URL, showPreviewDesignTime to false, and maskObjectsBeneathSurface to true.Using the OpenAI Vision API

Simultaneously, the picture is sent to OpenAI's Vision API (model GPT-4o). The base prompt is enhanced with a system prompt that instructs the model to role-play as the person in the image.

For example:

"You are a helpful assistant. This is a roleplaying game and people can ask questions about famous people and learn some history. You can also search the internet for information. People who use this application want to know more about you. Start with greeting  as the person in the provided image and say your first and last name. In general keep your answers short. Can you return in your first answer if this person is male or female. For example: Hi I'm Will Smith, how are you doing? [male]"

I was pleasantly surprised by how accurately the model could identify people.

 

Text-to-Speech & Lip Syncing

The response generated by the model is then sent to a Text-to-Speech API, along with the identified gender. The resulting audio, along with the image, is then processed by a model I found on Replicate.com, which lip-syncs the audio to the image. The resulting video is dynamically added to the scene in Mattercraft. (https://replicate.com/devxpy/cog-wav2lip)


Conversation Flow

Users can continue asking questions, with each query sent to OpenAI along with the conversation history. The initial image is used only once for lip-syncing, ensuring seamless interaction.

 

UI Development

The user interface is built in Mattercraft, where I designed and styled multiple elements using CSS.

A hierarchical HTML structure is displayed, showing nested elements: "Parent" contains "Play," "Subs," "Button," "Loading" (with "Spinner" and "Text"), and "Message" (with "MessageButton").

 

Conclusion

It was exciting to see how quickly I could prototype in Mattercraft. With the backend logic already in place, I was able to iterate rapidly on the frontend (AR) logic and assess its performance. This allowed for easy adjustments to prompts and results as needed.


Example response:
 

https://webxr.be/talk-with-image/api/video/66cb7b7b818df.mp4 (result from the lipsync model)
https://webxr.be/talk-with-image/api/audio/66cb7b7b818df.mp3 (returned from text to speech API)
https://webxr.be/talk-with-image/api/img/66cb7b7b818df.jpg (snapshot from camera)




Where to find Stijn 

If you've got more questions for Stijn about the project or want to work with him you can find him on LinkedIn or check out more of his creations over on X at @stspanho