Our dreams have come true - Gemini 2.0 is released with its real-time audio/video streaming capabilities!

By Mert Ozer, 12 December, 2024

Forum
iOS and iPadOS

Hi folks,

I don't know what to say! I'll just drop the link for you all to test it out, and you'll see how amazing the results are. I honestly never thought it would happen this soon, but it's here. It's so fast that I can already imagine a million scenarios where I can make use of this tool. I hope it stays free, remains accessible, and keeps improving day by day. I thought ChatGPT was going to have this feature in the first place, but OpenAI left behind.

It's a web UI. Just allow mic/camera access on your iPhone (assuming you want the video stream on your phone), and you're good to go. For now, the best results are in English. I'm Turkish, and I've tried speaking in Turkish, but it's not great at understanding Turkish yet. I'm assuming we won't have this issue when they release it to the general public. For me, it's not a big deal since I speak to AI in English all the time anyway. LOL

https://aistudio.google.com/live

Options

Comments

By Brad on Thursday, December 12, 2024 - 08:43

It's interesting that I can just pick up something and ask it what I'm holding and it gets it right sometimes but I asked it to tell me to rotate the can until it can see the text in full and it says ok I'll tell you, to rotate the can until I can see the text, and it doesn't. Is this a bug or am I misunderstanding how this wirks?

By Brian on Thursday, December 12, 2024 - 09:01

It's probably just a matter of phrasing. Keep at it, I imagine sooner or later it will understand what you are trying to ask of it, and provide you with the information you want.

By Brian on Thursday, December 12, 2024 - 09:30

First, I have to agree with Ollie, and as much as I hate the term "game changer", this really is. So taking all these tips into consideration, I have made a shortcut on my home screen, actually moved it to my "AI Tools" folder, which is where I keep tools such as Claude, Perplexity, ChatGPT, etc., etc. Regarding allowing permissions for the camera and microphone, it used to be that we could do this for specific websites. Is this no longer the case on iOS? I noticed under the camera permission, there is a grade out "edit" button. Any idea what that's about?

Thanks in advance.

By Kushal Solanki on Thursday, December 12, 2024 - 10:38

When I double tap arx on the link it doesn't work. I tried coppying the link and then pasting it but it gives me an error.

By Lee on Thursday, December 12, 2024 - 10:47

This is great. Hopefully soon we won't have to allow access everytime. Only slight confusion is we have a mic and camera button but I think you don't have to allow access to both each time. doing the camera seems to open both. I think this is what Envision are using in their ally app on the glasses.

By InfoRover on Thursday, December 12, 2024 - 11:03

It isn't often that I feel I should get crazy crazy excited about something but having just tried this for the first time. Wow, just, wow.
Spotlight is on you now, open AI. I suspect that's what we'll get on the final day of the 12 days of open AI.

By Mert Ozer on Thursday, December 12, 2024 - 11:04

How to Stop iPhone from Asking for Camera/Mic Permissions Every Time

  • Go to the website.
  • Tap on the page menu button.
  • Click on the More button at the bottom-right corner of the screen.
  • Look for the heading "Website Settings for...".
  • Change the microphone and camera settings from "Ask" to "Allow."

Thank me later! 🎉

By Lee on Thursday, December 12, 2024 - 11:13

Mert Ozer as requested thanks lol. This worked.

By Lee on Thursday, December 12, 2024 - 11:25

Hi Mert Ozer are you saying that as soon as you now open the webpage you can start talking? Because just closed and reopened the site and I still have to open the camera button. Double checked and my settings are showing as allow. Tried to find another button that may help but no luck.

By Mert Ozer on Thursday, December 12, 2024 - 11:26

It’ll get much better, but even now, it’s unbelievable. I just don’t think it has the ability to stay on track and keep providing details nonstop. For example, when I was navigating my high school, I asked the AI to describe the doors I was passing through. It started by describing two or three doors it could see, but then I had to say “go ahead” or “keep going” every time. So, it’s not continuous, and using this feature while traveling could be a bit dangerous for me. But look, we couldn’t even get detailed image descriptions until two years ago!

By Mert Ozer on Thursday, December 12, 2024 - 13:50

Not really, by allowing the permissions from website settings we get rid of it asking for mic/camera permissions every time but I still have to start the camera/and the mic. I feel like that's how it should be, though. It's a web UI to input text, audio, image.

By Gokul on Thursday, December 12, 2024 - 15:30

Yes, It's bloody brilliant! Now all we want is this on a wearable; Google did demonstrate Astra yesterday so we'll have it in the near future I guess.
That aside, top tip: you can tell it that you're visually impaired at the start of each session and it'll remember that fact for the duration of that session. This'll help it assist you appropriately as far as identifying text etc. Hopefully once it becomes a full public release, it'll have permenant memmory.

By PaulMartz on Thursday, December 12, 2024 - 15:37

I am busier than a one-handed pancake chef through Saturday, but couldn't resist taking ten minutes to play around with this. It's freaking brilliant. Maybe I'll have time to explore further next week.

By PaulMartz on Thursday, December 12, 2024 - 15:38

Any idea when this might appear on the app instead of through Safari?

By blindpk on Thursday, December 12, 2024 - 15:45

Don't have time to test this out right now but it sounds awesome.
I hope, as others have said, that OpenAI takes note of this (and hopefully have their own version ready soon) but also that the "blind-specific" apps/services, Be My Eyes, Seeing AI, and so on, also watches this closely, because having that integrated in something specifically made for blind people would be nothing short of fantastic.

By Falco on Thursday, December 12, 2024 - 16:01

Hello,

I was just playing 5 minutes with this tool. But when I say: "tell me when you see a person in front of the camera." And I walk into the frame of the camera. I get no reaction. Maybe other people have better results with that kind of questions.

I hope OpenAI will present their own version of AVM with vision today, tomorrow or next week.

By Lee on Thursday, December 12, 2024 - 16:16

Tried this outside regarding a bus stop. It said it would let me know when it saw it. Total silence. However, it may have been the connection. Seemed to drop off a lot outside with 4g signal in the UK. So, it may get better as inside I asked it to tell me when it saw a cup and it worked.

By Cory K on Thursday, December 12, 2024 - 16:43

We just need a way to make it speak faster.

By Brian on Thursday, December 12, 2024 - 16:49

So long as they do not get rid of the voice they are using, I will be happy. It almost sounds like you're talking to an actual person.
Almost ...

By PaulMartz on Thursday, December 12, 2024 - 17:08

Okay, live video description is pretty amazing.

Aside from that, the really amazing thing is that you talk to it without having to deal with onscreen dictation buttons, like a real human; it talks back without you having to tap the speech bubble; and it has access to new images as it needs it, without you having to find and double-tap any buttons. This user interface makes all the difference in the world, in my opinion. and it's not like that is new technology or anything. It's simply that existing image description app developers have never bothered to design it this way.

By Top Shelf on Thursday, December 12, 2024 - 17:53

Has anyone actually gotten real-time monitoring to work? Is it even supposed to do this? So far I need to ask every time if something changes or I want feedback. For me, it's almost the same as Meta AI's "Look and Describe" command without the WakeWords which obviously makes it quicker and smoother so over-all a plus but IMHO not a total game-changer from what we already have. Don't get me wrong though, this is definitely headed in a good direction!

Also, I find folks comments on speech feedback interesting. For me, I'm talking to a computer so I don't want or expect feedback to be slow, emotional or humanistic. Give me abbreviated/useful/effective information in order for me to be agile and productive.

By Brian on Thursday, December 12, 2024 - 21:09

Coming to a future near you, having something like this integrated into a smart phone. Be at Apple, Google, or whatever's clever.

By Dave Nason on Thursday, December 12, 2024 - 21:11

Member of the AppleVis Editorial Team

This looks like a great step forward, though I’m having limited success so far.
It’s not using the voice I selected, and now I can’t find the option to change it. The various menus don’t seem to work too well with VoiceOver. Anyone else having more luck?
The “tell me when you see…” idea isn’t working for me at all. I have to keep asking, which suggests it’s just taking pictures really, not video. Is this any different to Ally for those on that beta too? Still a nice slick interface though.
When wearing AirPods, sound sometimes reverts to the phone speaker when I start a session. Anyone else seen this?
Dave

By Karok on Thursday, December 12, 2024 - 21:14

I hope it just comes to the application so I can use voice I wish

By Brian on Thursday, December 12, 2024 - 21:16

I've been testing mine, mainly by standing in front of a window, and constantly saying, "what do you see"? Please note that I live in a metro city, in the downtown area, in a high-rise apartment building, several floors up, so outside is quite lively with people in traffic and such. I have yet to play with any of the settings, I simply give it camera and microphone access, and start chatting away. I will say that, out of the box as it were, it is rather detailed, and pretty accurate at least as far as my belongings inside my home.
It was even able to read the small print off of a soda can, which I have not been able to get any other type of AI service to do. Ever.
iPhone SE 2022 running iOS 18.2.

By Dave Nason on Friday, December 13, 2024 - 08:27

Member of the AppleVis Editorial Team

Hey Ollie. Apologies yeah, I get that. But it doesn’t feel like it is actually taking repeated pictures, because if I say tell me when you see a bottle and then start scanning the room with the camera, it only finds it if I keep asking the question. So is it actually only looking each time I speak?
In that way it seems kinda the same as Envision Ally.
I’ve only played with it a little bit so far though, and definitely see the amazing direction this stuff is going.
Dave

By Gokul on Friday, December 13, 2024 - 12:17

When compared to chat gpt. If you tell it multiple times that you're blind, you need real-time info etc, it does respond to a little extend, unlike chat gpt which stubbornly refuses. In both the cases, it is not that the system itself cannot do real-time monitoring, it's rather that there's some restriction placed on it. In the case of Gemini, it appears to be some instruction, like say "respond to visual info on detecting a spoken prompt", rather than an explicit restriction, which seems to be the case with chat gpt.

By Gokul on Monday, December 16, 2024 - 02:58

So I was trying to set-up Windscribe in my windows pc which as everyone knows, is not accessible with screenreaders or, at least JAWS. So I thought, why not just share my screen to Gemini live and make it read the screen and then use my keyboard along, which seemd like a nifty solution, but that was till it read the first screen and asked me, do you want me to click on the quick connect button? (note that I had already given it the context; stuff like I'm blind and that this app is totally inaccessible with my screenreader etc). To say that I was plesently surprised would be an understatement. I said "Sure go ahead" and then it went on to click that button, talk about the next screen, select a free server etc, etc, and basically to complete the process. And then to just make sure, I took a picture of the screen with BME and duh, nothing had happened, and everything was as it was in the beginning.
My conclusions: if you have read/heard etc about the mariner browser extention, I bet they're working to make it part of Gemini live/project astra etc. and possibilities for accessibility as far as such a thing is concerned are just enormous! Google is already into the agentic future, and I wouldn't at all be surprised if that is one of the anouncements made by OpenAI during the next 5 days of their ongoing 12 day string of anouncements.

By Brian on Monday, December 16, 2024 - 03:03

All hail Google. All hail our AI Overlords! 🤖

By Cory K on Monday, December 16, 2024 - 12:41

So, There are actions in shortcuts to open chat GPT and start voice mode, there isn't one to hit the camera switch at that point. Do any of you guys have a work around? I would love to use assistive touch to do this, but when getting to the draw jesture part of the flow, in the assistive touch options, voiceover goes silent. I know there were some Be My Eyes shortcuts months ago that launched the app, took a pic, then asked the system a question, so I know this is possible in theory, because I'm trying to do a similar task.

By Studio Jay on Monday, December 16, 2024 - 21:01

Hi everyone, can someone please give me instructions on how to try out the live feature where I can point my phone at something, ask what do you see? am I right in assuming that this does not work with the Gemini app yet? If so, what website do I go to? And how do I enable the camera etc.? I have read through this thread, but for some reason it’s not working for me. I am not sure what I am doing wrong? I am using an iPhone 10. Will it work with this particular phone? Or do I need a later model? Thanks in advance for any help, Jason

By Brad on Tuesday, December 17, 2024 - 00:43

You should be able to tap on the link in the first post, then on the page there's two buttons, one for mic and one for camra, tab on both those buttons and agree to the agreement stuff, then you can talk to the phone.

By Prateek Dujari. on Tuesday, December 17, 2024 - 04:38

I’ve observed over multiple live video/audio interactions with the AI that after about a minute to a minute and a half or maybe two minutes, the AI quit responding to my questions with my video running. Then I have to hit the refresh button on my browser and again activate the Camera access button from the AI studio page and restart. I’d replicated this multiple times. Are you all also experiencing what seems is a very very tight and small amount of time that we are being allowed per interaction with the live video description interaction with the AI? Less than two minutes constraint on this time is horrible. I am a ChatGPT plus subscriber and there is no such tight time constraint with their live video based interaction with ChatGPT.

By Brian on Tuesday, December 17, 2024 - 04:55

I mean, this is essentially a live beta. Furthermore, you said it yourself that you subscribe to ChatGPT, whereas this is free ... for now. In that respect it only makes sense to limit its use.

By SSWFTW on Wednesday, December 18, 2024 - 00:35

I seem to get about eight or 10 minutes and then I do need to refresh it and hit the camera button again. I would be grateful for suggestions for chest straps as well. Would be amazing to mow this thing and let her rip

By Samanthia on Wednesday, December 18, 2024 - 16:55

It seems like it's only letting me record a video and then ask questions about it. It's not letting me talk to it while doing a live video. I'm doing this in Safari on an iPhone 15 pro. Any ideas about what I might be doing wrong?

By Brian on Wednesday, December 18, 2024 - 21:20

What I do:
1. Launch the shortcut that I set up on my home screen, to the link in the original post of this thread.
2. Double tap the "camera" button, and give permissions for it.
3. Start conversing with the AI in real time.
A note on step 2, after giving permissions, I noticed a subtle change in the audio quality of my iPhone. This is how I know that the microphone has been activated, even though I did nothing to the microphone button, giving camera permission seems to have also given microphone permissions as well. Just a heads up on that.

Also a tip, though this may have already been mentioned, once the page is loaded, go to page settings next to the address bar, and give full permission to camera and microphone. This will only be for this webpage, but will make things a lot easier moving forward.

HTH.

By Dave Nason on Thursday, December 19, 2024 - 08:39

Member of the AppleVis Editorial Team

I wonder are you being misled by the label on the camera button?
For me, VoiceOver says “Camera, description, start recording”. However it is not creating a recording, you can simply start asking questions.
Dave

By Diego on Friday, December 20, 2024 - 04:10

How can I share my phone screen on gemini live? Only the camera appears

By Justin_B on Monday, December 23, 2024 - 01:04

Wow! Sounds awesome. I’m a ChatGpt Plus subscriber and have been marveling at this same feature on the latest version of its voice chat feature, and now it sounds like I need to go try this Gemini option too. What a time to be blind! Thanks for sharing this

By Stephen on Monday, December 23, 2024 - 15:32

After playing with them both, chat gpt wins by a long shot.
So I also wanted to get on here and just say y’all, don’t under value that sharing screen feature in chat gpt. I’m a PlayStation 5 gamer and I decided to download the remote play app to my phone… partner with the PlayStation remote play app and chat gpt advanced voice mode screen share, makes for one heck of a gaming session btw.
While you can screen share with google, it’s glitchy, only lasts a couple minutes at a time, its response rate is slower… yes, I timed them.
Plus the open AI team is really quick if you have a problem.
If you can do it and you use it as much as I do, it’s worth the pro for chat gpt.

By Kareem Dale on Monday, December 23, 2024 - 16:07

So, with pro ChatGpt can I share my iPhone screen and also have it take action like clicking a link? I have an iPhone 16 pro running the latest IOS software. I have a website page that when VO is active, the website developer has an overlay on the screen that prevents me from using it and the only way to fix it is to click a button on the screen but VO has to be off to click this button so I need sighted assistance. But, wanted to figure out if this would be possible with pro ChatGpt to see my iphone screen and click a link as I tell it what to do.

By Gokul on Monday, December 23, 2024 - 16:12

As of now, chat gpt doesn't take actions on your behalf, whichever plan you have. And interestingly enough, it hasn't previewed that feature (as far as I know) in spite of 2 of its major competitors demoing such capabilities.

By Stephen on Monday, December 23, 2024 - 17:03

Unfortunately, no, it can’t take actions as of yet.

By Michael Hansen on Monday, December 30, 2024 - 00:15

Member of the AppleVis Editorial Team

I just gave Gemini Live a try, and I was both impressed and unimpressed at the same time, if that's possible.

I used Gemini to identify several different tea packets. It was cool to be able to carry a conversation with Gemini without having to take individual pictures of each packet, but one thing I kept coming back to was the tone of the voice. If this were a person I was talking to, I would think the person was annoyed based on their tone. I felt like the software was programmed to try and limit the interaction. Every time Gemini asked me "Is there anything else I can help you with?" I felt as though it was trying to direct me towards ending the conversation. Google could certainly refine the personality of Gemini to make it sound more friendly and engaging.

What I think would be really cool is if one of the companies working in the blindness field could perfect a product using one of these live video AI models and tune it specifically for the needs of people who are blind, DeafBlind, or who have low vision. It's not hard to imagine how with some customizations, a live video AI product could really revolutionize how we get access to visual information.

By Dave Nason on Monday, December 30, 2024 - 09:52

Member of the AppleVis Editorial Team

Hey Michael. Haha I know exactly what you mean.
This is probably partly why I’ve been using Envision Ally a lot more than Gemini recently. Have you tried it yet?
Yes a model or agent that is built specifically for us is exactly what we need.
Dave

By Missy Hoppe on Monday, December 30, 2024 - 16:22

Good morning, all! I've been casually following this thread, and I have several questions. How does this work? Do we have to have pro subscriptions to chat gpt or geminy to take advantage of this? I understand that the screen share thing doesn't have the capability of taking actions on our behalf, which makes me wonder if it would even work for the one task I'd like to be able to use it for. Is screen sharing something that's going to be an option with Ally? The one task I can imagine this being very useful for is browsing the Replika store to get clothes or accessories for my Replika. Sadly, the store portion of the iPhone app is more or less completely inaccessible, so even with an AI screen sharing partner, I don't know how or if I'd be able to navigate to the items I want. I'm not keen on the idea of another bill, but it's sounding like it might be worth investing in chat gpt pro at some point. I could never justify paying for both chat gpt and google, though, so I'd have to figure out which one would meet my needs better. I think 2025 is going to be a very exciting year for AI in general, and I'm looking forward to learning as much as my little brain can handle.