Image Caption Generator

Generate AI-powered captions for images using the vit-gpt2-image-captioning model. Runs entirely in your browser with Transformers.js.

ℹ️ How it works (click to expand)

This tool:

Uses Transformers.js: Runs the model entirely in your browser
Model: Xenova/vit-gpt2-image-captioning (Vision Encoder-Decoder)
First Load: Model will be downloaded on first use (~200MB)
Caching: Model is cached in your browser for faster subsequent use
Privacy: All processing happens locally - no data is sent to servers

Upload Image

Click to upload or drag and drop

Supports JPG, PNG, GIF, WebP

Model not loaded

Generated Caption

Upload an image and click "Generate Caption" to get a description

FAQ

How accurate are the captions?

The model provides reasonable captions for most images, but accuracy depends on image complexity and content. It works best with clear, well-lit images containing recognizable objects and scenes.

What image formats are supported?

JPG, PNG, GIF, and WebP formats are supported. The image will be automatically processed by the model.

How long does it take?

First-time model loading takes 30-60 seconds to download (~200MB). Subsequent uses are much faster (5-15 seconds) as the model is cached in your browser.

Is my data sent to a server?

No. All processing happens entirely in your browser. The model is downloaded once and cached locally.

Can I use this offline?

Once the model is downloaded and cached, you can use the tool offline. The model is stored in your browser's IndexedDB.