Skip to main content

MoonDream Vision Overview

MoonDream is a tiny, fast vision language model that provides computer vision capabilities for the FYI Automation Tool. It enables automated UI element detection, screen analysis, and natural language queries about visual content.

What is MoonDream?​

MoonDream is a compact vision-language model that can:

  • Answer questions about images
  • Detect objects and UI elements
  • Point to specific locations in images
  • Run efficiently on various hardware (CPU, GPU, MPS)

Integration with FYI Automation Tool​

MoonDream powers the computer vision capabilities in the FYI Automation Tool by:

πŸ€– AI Automation Features​

  • UI Element Detection: Automatically finds buttons, inputs, and interactive elements
  • Screen Analysis: Understands what's visible on device screens
  • Natural Language Commands: Interprets commands like "tap the login button"
  • Coordinate Caching: Speeds up repeated operations by remembering element locations

πŸ“± Device Automation​

  • Screenshot Analysis: Processes device screenshots in real-time
  • Gesture Targeting: Provides precise coordinates for touch interactions
  • Validation: Verifies screen state and element presence
  • Error Recovery: Detects when UI elements are missing or changed

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Android Device│────│ FYI Backend │────│ MoonDream API β”‚
β”‚ (Screenshot) β”‚ β”‚ (Vision Service)β”‚ β”‚ (Computer β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ Vision) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Coordinate β”‚
β”‚ Cache DB β”‚
β”‚ (Performance) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Features​

πŸš€ Performance​

  • Tiny Model: Only ~1.8GB, runs anywhere
  • Fast Inference: Sub-second responses
  • Hardware Acceleration: Supports MPS (macOS), CUDA (NVIDIA), and CPU
  • Load Balancing: Multiple model instances for concurrent requests

🎯 Accuracy​

  • UI Element Detection: High precision for mobile app elements
  • Context Awareness: Understands app layouts and hierarchies
  • Variation Handling: Adapts to different themes and orientations
  • Confidence Scoring: Provides reliability metrics

πŸ”§ Integration​

  • REST API: Simple HTTP endpoints
  • Multi-format Support: Handles various image formats
  • Error Handling: Robust failure recovery
  • Monitoring: Health checks and performance metrics

Use Cases​

Mobile App Testing​

// Detect login button
const result = await detectObject("login_button", screenshot);

// Result contains precise coordinates for automation
{
found: true,
coordinates: { x: 540, y: 1200 },
confidence: 0.95
}

Screen Validation​

// Verify expected elements are present
const validation = await validateScreen(screenshot, "welcome message");

if (validation.found) {
console.log("Screen state is correct");
}

Natural Language Commands​

// Convert natural language to coordinates
const coords = await processCommand("tap the search icon", screenshot);

// Use coordinates for device automation
await device.tap(coords.x, coords.y);

Quick Start​

1. Start MoonDream Service​

cd moondream
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python main.py

2. Configure FYI Tool​

# In server/.env
MOONDREAM_URL=http://localhost:20200
MOONDREAM_MAX_VARIATIONS=2

3. Test Integration​

# Health check
curl http://localhost:20200/

# Test detection
curl -X POST "http://localhost:20200/v1/point" \
-F "object=button" \
-F "init_image=@screenshot.png"

API Endpoints​

EndpointMethodDescription
GET /Health check with statistics
POST /v1/queryVisual question answering
POST /v1/detectObject detection
POST /v1/pointPoint detection for UI elements

Configuration Options​

Environment Variables​

# MoonDream Service (moondream/.env)
MOONDREAM_WORKERS=1 # Number of model instances

# FYI Backend (server/.env)
MOONDREAM_URL=http://localhost:20200 # API endpoint
MOONDREAM_MAX_VARIATIONS=2 # Detection variations
MOONDREAM_API_KEY= # Optional API key

Hardware Optimization​

macOS (MPS)​

  • Automatically detected
  • Metal Performance Shaders acceleration
  • Best performance on Apple Silicon

NVIDIA GPU (CUDA)​

  • Automatic CUDA detection
  • Parallel processing on GPU cores
  • Maximum throughput for batch processing

CPU Fallback​

  • Works on any system
  • Slower but reliable
  • Good for development/testing

Best Practices​

Performance Optimization​

  • Load Balancing: Use multiple workers for concurrent requests
  • Caching: Enable coordinate caching for repeated elements
  • Image Size: Resize large screenshots before processing
  • Batch Processing: Group similar detection requests

Accuracy Improvement​

  • Clear Screenshots: Ensure good image quality
  • Descriptive Labels: Use specific element descriptions
  • Context Awareness: Include surrounding elements in queries
  • Confidence Thresholds: Adjust based on use case requirements

Error Handling​

  • Fallback Strategies: Implement coordinate fallbacks
  • Retry Logic: Handle temporary service unavailability
  • Validation: Always validate detection results
  • Monitoring: Track detection success rates

Troubleshooting​

Common Issues​

Service Won't Start​

# Check Python version
python --version # Should be 3.13+

# Verify dependencies
pip list | grep torch

# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"

Low Accuracy​

  • Use more descriptive element names
  • Ensure screenshots are clear and well-lit
  • Try different variations of the same query
  • Adjust confidence thresholds

Slow Performance​

  • Reduce number of concurrent requests
  • Use smaller images when possible
  • Enable GPU acceleration
  • Add more worker instances

Memory Issues​

  • Reduce MOONDREAM_WORKERS
  • Use CPU mode if GPU memory is limited
  • Monitor memory usage with system tools
  • Restart service periodically

Next Steps​