fal-ai/moondream3-preview/caption

Moondream 3 is a vision language model that brings frontier-level visual reasoning with native object detection, pointing, and OCR capabilities to real-world applications requiring fast, inexpensive inference at scale.

Inference

Commercial use

Schema

LLMs

Playground API

Input

Image URL*

Hint: Drag and drop image files from your computer, images from web pages, paste from clipboard (Ctrl/Cmd+V), or provide a URL. Accepted file types: jpg, jpeg, png, webp, gif, avif

Additional Settings

Customize your input with more control.

Result

Idle

A hedgehog is captured in a close-up shot, focusing on its face and nose. The hedgehog's spines are visible along its back, and its nose is dark and wet-looking. A gold ring with a small diamond is positioned on the grass in front of the hedgehog, partially obscured by its nose. The background is a blurred green grassy field, with small white flowers scattered throughout the grass.

What would you like to do next?

{
  "finish_reason": "stop",
  "usage_info": {
    "prefill_time_ms": 54.45315001998097,
    "decode_time_ms": 811.5944429300725,
    "input_tokens": 737,
    "ttft_ms": 91.87838807702065,
    "output_tokens": 23
  },
  "output": "A hedgehog is captured in a close-up shot, focusing on its face and nose. The hedgehog's spines are visible along its back, and its nose is dark and wet-looking. A gold ring with a small diamond is positioned on the grass in front of the hedgehog, partially obscured by its nose. The background is a blurred green grassy field, with small white flowers scattered throughout the grass."
}

Your request will cost $0.4 per million input tokens, and $3.5 per million output tokens.

fal-ai/moondream3-preview/caption

Input

Result

What would you like to do next?

Logs