New AI app for describing images and video: PiccyBot

By Martijn - Sparkling Apps, 1 March, 2024

Forum
iOS and iPadOS

Hello guys,

I have created the free app PiccyBot that speaks out the description of the photo/image you give it. And you can then ask detailed questions about it.

I have adjusted the app to make it as low vision friendly as I could, but I would love to receive feedback on how to improve it further!

The App Store link can be found here:
https://apps.apple.com/us/app/piccybot/id6476859317

I am really hoping it will be of use to some. I have earlier created the app 'Talking Goggles' which was well received by the low vision community, but PiccyBot is a lot more powerful and hopefully useful!

Thanks and best regards,

Martijn van der Spek

Options

Comments

By Laszlo on Friday, February 14, 2025 - 18:43

The Deepseek variant in question is Janus Pro, as the R1 and earlier families of Deepseek (v2, v3) are all text-only models, i.e. don't understand images. Janus Procontains 7 billion parameters. That means even with parameter size reduction (called quantisation) it won't fit into iPhone memory. By the way Mistral Pixtral and Llama 3 (also in Piccybot subscription version) are open-source models too, but also too big for iPhone memory, as they consist of at least 7 billion parameters or more too.

By LaBoheme on Saturday, February 15, 2025 - 04:52

does the app automatically use an alternate model regardless of the setting? sometimes when i tried to have a photo described using different models, i got identical descriptions, almost word by word.

if that's the case, i suggest it would be useful the user bing notified which model is actually being used, it is useful for users to evaluate which model is best suited for specific type of images or tasks.

By privatetai on Saturday, February 15, 2025 - 06:57

I agree, I have seen the same situation where I clearly switched three or four different AI but it's generating the same response. insidentally, all day long today. I am getting this message, "Server is overloaded. Please try again after some time." is it just me or is something broken or is the app that popular that we can't even get through?

By Martijn - Sparkling Apps on Saturday, February 15, 2025 - 07:31

LaBoheme, Privateai, can you tell me which models were playing up for you? PiccyBot falls back on GPT4o when a model is not providing a response. It is all working at the moment for me though. Yesterday may have been rough as there were about 700 new people trying the Whatsapp service. If this continues I will upscale the server and database to cope with it.

By LaBoheme on Saturday, February 15, 2025 - 07:46

it does send the query to alternate models? well, that just answers my question.

what i like to seed is a notification, like a prompt, telling the user which model is actually being used. i find it very helpful for me to know which model is good for certain kind of images or tasks.

By Gokul on Monday, February 17, 2025 - 07:08

@martijn I was thinking would it be possible to host a visual llm like say Deepseek r1 locally for piccibot in order to give more unsensored image descriptions?

By Laszlo on Monday, February 17, 2025 - 12:11

Though I am not exactly Martin, but I do know the answer to your proposition, that's why I write it so that you may read it earlier. Gokul, the unfortunate thing is that if a model is trained to do censorship then most of the time it comes from the model itself. That means the answers will be censored no matter how you run the model, locally or in the cloud somewhere. In this case the only thing that can help if someone builds an uncensored version. It happens sometimes, e.g. with earlier versions of text-only Llama models, but it takes considerable amounts of time and computing resources, and I see such uncensored versions less and less often. Unfortunately I see the trend moves towards more and more censored models, and not towards freer ones. That is definitely a pity, but the world seems to be in such state nowadays.
There are some fortunate cases when running locally can indeed reduce or avoid censorship. These are the cases when some textual filters are applied to the questions or to the answers which are not built into the model itself. Sometimes it also happens that another model does the filtering, and not the main one.
Last, but not least as I stated in a quite recent post, DeepSeek R1 is not a visual model: it is text-only. The model that is capable of understanding images in the DeepSeek family is called Janus and that is what Piccybot offers for subscribers recently.

By Gokul on Monday, February 17, 2025 - 15:29

My bad there. What I meant was Janus pro only.

By Icosa on Monday, February 17, 2025 - 16:18

I believe earlier in the discussion it was stated that picciebot's deepseek option was being run on the developer's own machine rather than the standard cloud version. Modifying the model to be less censored is an entirely different matter as was stated and isn't as simple as changing an option in the settings, most of what we would consider settings options are effectively built into the model instead of being something you can change. It's part of the nature of training, you either train it to be censored or you don't.

By Gokul on Tuesday, February 18, 2025 - 02:34

That's exactly what I was implying; since we do have state-of-the-art open source models available now, shouldn't we think of fine-tuning something for this and this purpose only?

By Arya on Tuesday, February 18, 2025 - 09:59

Hi, I am giving my name , e-mail address , selected my country and entered my phone number in the whatsapp registration form of piccy bot and hitting the submit form button.
I am not getting any response from the page about the status of my registration. am I missing out any thing?

By Martijn - Sparkling Apps on Tuesday, February 18, 2025 - 10:09

Arya, I will look at the server not accepting entries. It has been busy, I will migrate to a better setup but it will take some time. Try it again after some time please.

The local open source models offer promise but as said, are still too large for most phones. And having them properly uncensored is indeed a matter of someone building on a new dataset. It will happen, but at the moment it would be a risky premise, as it would be expensive and there is a good chance that the overall quality will be so surpassed by newer regular models by the time it is done..

By privatetai on Monday, February 24, 2025 - 19:53

Wondering if there's a chance to add Grok to the list. From my interaction with this AI, it is not heavily censered and can output allot of text. It can also describe images, but the X/Twitter platform which is uses for uploading photos does restrict contents when you do it through the APP. It is actually kind of funny that they censor the photos you can upload through them, but the AI itself can generate crazy erotic contents. I uploaded a perfectly boring ordinary photo, and asked the AI to give it a "lewd" description, and it came back with some truly creative filth LOL.

By Martijn - Sparkling Apps on Tuesday, February 25, 2025 - 09:13

Privateai, correct, I am keeping a close eye on Grok, but so far their API doesn't support multimodal input, disappointingly. As soon as it does, I am hoping this model will result in less censored content.
I did add Claude Sonnet 3.7 to the available models in PiccyBot today. Please try it out and let me know what you think?

By Daniele on Tuesday, February 25, 2025 - 13:41

The App is great for having videos described. That is the benefid of this app.

By Diego on Tuesday, February 25, 2025 - 16:34

Hello Martijn!
I don't know if it would be possible, but would it be possible to describe what is happening at the exact moment when processing the video, like an audio description? Almost like seeing ai, but without the pause to describe.
Of course, it would have to give less information, but I think it would be amazing if it could do that. Reading the text or listening while the video is playing like this ends up causing synchronization.

By longma on Friday, February 28, 2025 - 15:44

Hello Martijn!
I noticed that the 2.14 version updated today mentions the ability to describe YouTube videos. This is truly a great upgrade. However, I haven't found a way to use this feature. I can't find PiccyBot in the sharing menu of YouTube videos, and copying and pasting the YouTube video link into the APP doesn't work either. Can you explain how to use this feature?
Many thanks

By Laszlo on Friday, February 28, 2025 - 16:32

Hello Martin,
I discovered in the 2.14 what's new that some shortcuts were also introduced. This is a great joy. But please give us some details, what phrases they are exactly.

By Martijn - Sparkling Apps on Saturday, March 1, 2025 - 10:16

Carter, PiccyBot will describe videos that are shared to it. In the case of YouTube, go to share, then select PiccyBot. The first time it will be hidden under 'more..' and then again 'more..' to find it. After that it will show up earlier.
However, there appears to be a glitch in the YouTube app at the moment, likely related to the iOS update, that somehow causes the share function not to work. For some people it has started to work again. So be aware of that please. Using YouTube in Safari works fine.

Laszio, PiccyBot now has shortcuts, but it is an initial release. Please check it out and let me know what to improve or add?

Right now, with the 2.14 update, Siri will recognize these phrases to trigger the camera shortcut:

"Siri, Open PiccyBot camera",
"Siri, Launch PiccyBot camera",
"Siri, Start PiccyBot camera"

Siri will recognize these phrases to trigger the video recorder shortcut:

"Siri, Open PiccyBot video recorder",
"Siri, Launch PiccyBot video recorder",
"Siri, Start PiccyBot video recorder"

However, again there seems to be a glitch, it doesn't work for everyone, likely related to the iOS 18 updates to enhance SIri. So I didn't announce this functionality yet, let's test it a bit further first.

By Laszlo on Saturday, March 1, 2025 - 21:05

Hello Martin,
Thanks much for the info. I tried out all the shortcuts you had mentioned. Siri didn't reject any of them, but opened the main interface of Piccybot in reply to all of them instead of the camera or video recorder part. I note that I run an iOS older than 18 on purpose.
Thanks much for the language selection fix in 2.14. Now my native tongue, Hungarian can be found and selected in the language list, and not just by setting language to system language.
While browsing on Huggingface I found the following very promising uncensored model:
https://huggingface.co/huihui-ai/Qwen2.5-VL-7B-Instruct-abliterated
This is the uncensored version of a very recent Qwen 2.5 vision model (the base model is developed by AliBaba group). This is the 7-billion-parameter variant, so computationally it is on par with the Janus Pro model that you currently run on your server. According to descriptions this is a quite versatile and strong model even in this size and multilingual too. It can process images in their native resolution and can be finetuned if needed. It can even process videos (even long ones) if the inference is done accordingly.
Personally I found nothing spectacular or special about the Janus Pro model, and this one is uncensored, so if feasible I propose to run this one besides, or even instead of Janus Pro on your server. Thanks in advance for considering this.

By longma on Sunday, March 2, 2025 - 14:56

Hello Martin,
Thank you very much for your answer. I tried it just now and successfully selected PiccyBot from the share menu on YouTube. However, it's strange that when I posted a question to you a couple of days ago, I used the same method but it didn't work. Anyway, it's working now and it's functioning well.

By LaBoheme on Wednesday, March 5, 2025 - 16:31

deepseek r1 is wonderful, but my first impression of janus a while ago was oh well. it seems to have improved very significantly, now it gives a comprehensive description of the photo. Martijn must has done some good work about it.

By Martijn - Sparkling Apps on Wednesday, March 12, 2025 - 06:16

Since over the past year a lot of features were gradually added to PiccyBot, I thought it would be helpful to give a summary of the current features of the app, for both the free and subscribed version:

For all users (Free & Subscribed)

- Convert Photos and Videos to Descriptions — Upload media, and PiccyBot will generate detailed audio descriptions.

- Ask Follow-up Questions — Engage in a conversation with PiccyBot for specific details about the selected media.

- Background Processing with Notifications — Continue using other apps while PiccyBot processes results in the background.

- Language Selection — PiccyBot uses your phone’s system language for descriptions and instructions.

- Full Localization & VoiceOver Support — Assistive navigation for visually impaired users.

- Social Media Sharing — Share photos and videos directly from apps like Instagram, Messenger,Facebook, Reddit, YouTube, TikTok and X (non-private accounts).

- Dedicated Chat Screen — Chat with PiccyBot for detailed insights about your image or video.

- Siri & Shortcuts Support — Instantly launch the camera or video recorder via Siri commands or a dedicated shortcuts button. 'Siri, open PiccyBot camera' and 'Siri, open PiccyBot video recorder'

- Quick Camera Access with Volume Button — Press the volume button to capture photos directly within the camera.

- Separate Buttons for Media Access — Access the camera, photos, video recorder, or video library directly with dedicated separate buttons.

- Save Descriptions as Metadata — Embed generated descriptions directly into the media file's metadata in the Photos app.

- Video Limits — Free users can process up to 1 minute of video content.

For Subscribed users:

- No ads

- Video Limits for Pro Users — Process videos up to 10 minutes for downloaded, uploaded, or in-app recorded videos.

- YouTube Support:
Videos shorter than 10 minutes are downloaded to your phone and then described.
Videos longer than 10 minutes are described directly without downloading. Fast, but you can't mix the audio afterwards.

- Advanced Settings to customise PiccyBot's output:

Voice Selection — Choose from multiple voices for descriptions.
Personality Mode Switching — Customize the narration style.
Talkback Speed Control — Adjust the pace of audio descriptions.
Model Selection — Select which AI model to use. Each model has its own unique strengths and weaknesses.
Description Length Control — Decide how detailed or brief the descriptions should be.
Video Upload Quality Control — Manage upload resolution for better quality or faster processing.
Process Feedback Sound (On/Off) — Enable or disable sound notifications for processing completion.
Audio-Video Mixing Controls — Adjust video and generated audio volumes independently (e.g., 30% video volume and 90% audio description volume).

- Multiple Sharing Options:
Audio only
Video with optional Audio Mix
Description Only

- Audio-Video Mixing — Combine the original video audio with generated audio descriptions.

- Language Selection — Choose from 55 languages in the PiccyBot settings for descriptions.

In addition to this, there is the PiccyBot Whatsapp service, to which you can send any image, video or website link for an audio description.

Phew, that was it I think! Hope this helps in case you missed or forgot any of the features.

Good luck with PiccyBot, I really appreciate the feedback given in this forum, it has genuinely been a group effort to get to this stage!

By Diego on Wednesday, March 12, 2025 - 19:37

Hello guys!
I don't know if I can report bugs in the Android version in this thread. I'm doing this since the developer looks here a lot, so it's easier for us to get support.
If necessary, I'll send it somewhere else.
I'm facing a bug where the description audio isn't being played.

By Martijn - Sparkling Apps on Thursday, March 13, 2025 - 04:29

Hi Diego, can you message me with the details of your device and Android version? All latest PiccyBot features should work on the Android version as well.

By James Dean on Monday, March 17, 2025 - 18:54

I've been using this App for a couple months now and like it, but have noticed that it very often fails to process Youtube videos, either giving me the generic, "server error," message or just failing to process but then a retry will sometimes work, sometimes not. I do pay for the subscription. I have tried with multiple videos and multiple models. I tried running one of the same links through Gemini Flash to see if it was an issue with the model, but it gave me a very good description of the video broken up into time stamped segments, so the requests must not be going through the models directly which seems to make the App unreliable at the very least. I've read through all the comments here and haven't seen much about this, so either I have the misfortune of trying at bad times or it just isn't a widespread issue.

By longma on Thursday, March 20, 2025 - 16:18

Hi Martijn,
Thank you for continuously adding new features to the app. As you mentioned in the post above, the app now has many features, especially providing many models for users to choose from. Could you please explain the advantages and disadvantages of each large model when describing pictures and videos? Of course, I know this question sounds a bit subjective, and perhaps everyone's opinion will be different, but I would like to hear your opinion, and I think there will be other friends who, like me, would like to have a reference answer.

By Naza on Tuesday, March 25, 2025 - 14:34

Can you make the app guide the user so that they can take pictures with it?

By Brian on Tuesday, March 25, 2025 - 15:25

Do you mean like Google Pixel's "Guided Frame"? That, would be sweet! šŸ˜ƒšŸ‘

By Gokul on Tuesday, March 25, 2025 - 17:09

I've been wanting something like that for ever now. But I guess you'd need live AI, I mean truly live AI for that?

By longma on Wednesday, March 26, 2025 - 15:50

Hello Martijn,
I found another bug in an app. I usually use Simplified Chinese, and I noticed that no matter which voice I choose, I often encounter situations where the content of the image description cannot be read out. The specific manifestation is that they will keep saying "Chinese Letter Chinese Letter Chinese Letter" until I have to pause the voice for it to stop. This is really frustrating because it happens frequently, about once for every five images I recognize this bug occurs. Could you find some time to check what's going on with this?

By Martijn - Sparkling Apps on Thursday, March 27, 2025 - 06:05

Hi guys,

I have added the Gemini 2.5 Pro model today. It scores amazing in the benchmarks and my own tests so far have shown it to be really good in video and image descriptions. Check it out!

James, I have not seen any spike in YouTube processing errors, but will monitor it closely, it is one of the popular uses of the app and maybe it affects certain time zones more than others.

Naza, Brian, Gokul, thanks for the suggestion, will see if I can replicate guided frame on iOS.

Carter, Privateai, so far the Chinese voice works for me, but will keep checking. Let me know if it continues to give problems?

Thanks for the feedback as always!

By Gokul on Thursday, March 27, 2025 - 09:13

@Martijn if you replicate Guided frames or in other words make an accessible camera app for IOS, you would be doing a path-breaking, pioneering service for IOS users with visual impairment, not to mention facilitating their inclusion in mainstreme in a certain huge way. It need not be a feature within piccibot, rather it can be a stand-alone app, which can help us frame and take decent pictures and save to galary. I would be more than willing to contribute to it in any way I can.

By Martijn - Sparkling Apps on Monday, April 7, 2025 - 09:37

Hi guys,

Llama 4 Maverick has been added to PiccyBot, and is now available to subscribed users. It has 17B active parameters, 128 experts. It is one of the fastest models for image descriptions. So far, the descriptions are looking very accurate to me, but let me know what you think?

By longma on Monday, April 7, 2025 - 15:30

I think Llama 4 has the fastest processing speed among these models, but its description effect is not as detailed as Gemini Pro2_5. However, considering both processing speed and description quality, Llama 4 is also a good choice. In my ranking of these models, it's Gemini Pro > Llama 4 > Deepseek.

By Martijn - Sparkling Apps on Friday, April 11, 2025 - 05:35

I have added a new model 'PiccyBot Mix' to the available AI models for subscribed users.
This model is a mixture of models. The idea is that the models check each other. Only elements that are described by each model will be included in the description. The aim is to completely remove any hallucinations.
Note that the description of this model will likely be less detailed than individual models, but it should be fully reliable. Also note as of now this only works for image descriptions, not for video descriptions.
Please try it out and judge the accurateness of the description?

By Icosa on Friday, April 11, 2025 - 07:22

Interesting addition, thanks. Completely understand why it's image only, video would both be more complicated, require more resources from the AI servers and you need to confirm how well it works before you even consider it.

By Gokul on Friday, April 11, 2025 - 08:31

Especially for situations where what you want is an accurate description and where halusinations can be problematic and/or dangerous.

By Brian on Friday, April 11, 2025 - 10:01

I think, if you have truly found a way to remove these, you will become everybody’s new best friend.

By blindpk on Friday, April 11, 2025 - 12:27

This is something I've hoped would come along now when we have many good models that can be compared. Seems to work fine, but hard to say if it does not hallucinate of course. A possible further development: having a "fast" and a "thorough/detailed" mix, with modeels tailored to each?
Thank you very much for this!

By Andrew Adolphson on Sunday, April 13, 2025 - 04:10

Hi, I have downloaded the app but after using it for a while, I deleted it from my iPhone. I have Be My Eyes and Seeing AI so I feel like this app isn’t useful

By Icosa on Sunday, April 13, 2025 - 04:57

No app will be useful to everyone depending on their situation and needs, that's perfectly valid. Just bear it in mind as a potential tool in the future if you need a short video described, or if you encounter a situation where the AI models in Seeing AI or Be My AI aren't helpful.

By Winter Roses on Sunday, April 20, 2025 - 10:57

Hi there,

I know this might be a bit of an unusual request, but I wanted to put it out there anyway—because if you don’t ask, you’ll never know what’s possible.

Lately, I’ve been using a few survey websites to earn gift cards like Amazon gift or prepaid Visa cards. These help me make online purchases without needing to use my debit or credit card. One of the main platforms I use is powered by Spectrum Surveys, and while the surveys themselves are accessible most of the time, there’s one issue I keep running into: drag-and-drop tasks.

These questions usually ask me to move items—like dragging an image or word into a specific box or column—but there’s no reliable feedback about what I’m dragging or where I’m dropping it. This makes it nearly impossible to complete certain sections.

Here’s where the idea comes in. I was wondering if it would be possible to create a feature that allows me to share my screen and receive descriptive feedback about what’s happening. For example, if I’m dragging ā€œAppleā€ into a ā€œFruitā€ box, the system could announce something like:
ā€œDragging Apple. Drop zone: Row 1, Column 1 – Fruit.ā€ And once I release it: ā€œApple dropped in Fruit.ā€ If I switch to dragging ā€œBlueā€ into a colors section, it could say: ā€œDragging Blue. Drop zone: Row 1, Column 2 – Colors.ā€ Then confirm: ā€œBlue dropped in Colors.ā€

It doesn’t need to control anything for me—I’m not expecting the AI to do the action itself. I’d simply like to hear clear, accurate feedback so I know what I’m doing on screen. This could work in combination with screen recognition when standard reading fails, which is something I already use regularly.

I think having real-time feedback on selections and drop targets would significantly improve my experience and make these survey platforms far more accessible. This kind of assistance wouldn’t fix every challenge, but it would eliminate one of the biggest barriers I run into.

Also, I plan on purchasing the lifetime subscription soon—probably by Friday. I have a $25 prepaid Visa card ready, and I believe the lifetime cost is around $20. Regardless of whether this feature is possible now or in the future, I’m fully committed to supporting the app. I figured I’d share this idea in case it sparks anything or aligns with something already in development.

If this kind of feature is technically possible, it would be a game-changer for me. And if others feel the same way, hopefully they’ll chime in and add their thoughts as well.

Thanks for your time and for building something that’s already made such a difference. I’m looking forward to what comes next.

By privatetai on Sunday, April 27, 2025 - 00:13

I like to keep some prompt histories so I don't have to retype them all the time, or go copy them and paste them in from elsewhere. I have noticed the latest version, the APP keeps randomly refreshing as if I closed it and re-opened it, and that keeps erasing my prompt history. Is it those random tips that's doing that?

By longma on Thursday, May 1, 2025 - 16:18

I don't know if you've noticed, but the app has been having some issues lately. First, no matter which voice I set for text-to-speech, it won't read aloud anymore. I have to use VoiceOver to browse the results instead of hearing the app automatically read them out as before. Second, the app is increasingly failing to recognize images. In more and more cases, I upload a picture, wait through a long sound effect, and end up with nothing on the screen. It's a bit frustrating. Still, I'm willing to support it until these issues are fixed by the developer.

By KE8UPE on Thursday, May 1, 2025 - 16:43

Hi,
I just downloaded this app yesterday, to get extract all text from screenshots of recipes I find on Facebook.
The descriptions are great, even with the free version.
I'll definately be supporting the developer, when I'm able, by purchasing a subscription or maybe even the lifetime option.

Keep up the incredible work! :)

By Winter Roses on Thursday, May 1, 2025 - 18:59

I’ve actually been talking to the team about this recently. They responded with something like, ā€œThanks for bringing it to our attention,ā€ and I suggested they really should reach out to others to see if the issue is more widespread. For me personally, it’s not so much an issue with images—it’s more with videos. I usually use it to describe music videos. What happens is, I’ll get one video described just fine, but when I try to run a second one right after, it stops working. It gets stuck on one of those loading screens—I’ll hear the waiting sound, or it’ll say ā€œplease waitā€ or ā€œfetching data,ā€ but then nothing actually happens. It’s pretty frustrating. The only workaround I’ve found so far is to screen record the video and then have that recording described. But obviously, that’s a hassle. So clearly the issue isn’t the video itself, because the screen-recorded version works fine. It might have something to do with how Apple’s files are handled, I don’t know. The team did acknowledge that there’s something going on and said they’re looking into it. At first, I thought it was just me, but apparently not—there does seem to be a real issue here. The pattern’s pretty consistent: one video gets described, then trying another one right after just doesn’t work. I get that this probably takes up a ton of processing power, so maybe there needs to be a buffer or delay, but even when I wait like 10 minutes—or even an hour—nothing happens. And when I say ā€œnothing happens,ā€ I don’t mean it’s sitting there thinking. I mean the video doesn’t even get recognized. If it doesn’t work within the first few minutes, I usually give up because I know it’s not going to load at all. So yeah, that’s definitely something that needs to be addressed. Overall, the app is great and super useful, but this video issue is a real limitation right now.