MoonDream Vision Overview
MoonDream is a tiny, fast vision language model that provides computer vision capabilities for the FYI Automation Tool. It enables automated UI element detection, screen analysis, and natural language queries about visual content.
What is MoonDream?β
MoonDream is a compact vision-language model that can:
- Answer questions about images
- Detect objects and UI elements
- Point to specific locations in images
- Run efficiently on various hardware (CPU, GPU, MPS)
Integration with FYI Automation Toolβ
MoonDream powers the computer vision capabilities in the FYI Automation Tool by:
π€ AI Automation Featuresβ
- UI Element Detection: Automatically finds buttons, inputs, and interactive elements
- Screen Analysis: Understands what's visible on device screens
- Natural Language Commands: Interprets commands like "tap the login button"
- Coordinate Caching: Speeds up repeated operations by remembering element locations
π± Device Automationβ
- Screenshot Analysis: Processes device screenshots in real-time
- Gesture Targeting: Provides precise coordinates for touch interactions
- Validation: Verifies screen state and element presence
- Error Recovery: Detects when UI elements are missing or changed
Architectureβ
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β Android Deviceββββββ FYI Backend ββββββ MoonDream API β
β (Screenshot) β β (Vision Service)β β (Computer β
β β β β β Vision) β
βββββββββββββββββββ ββββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Coordinate β
β Cache DB β
β (Performance) β
βββββββββββββββββββ
Key Featuresβ
π Performanceβ
- Tiny Model: Only ~1.8GB, runs anywhere
- Fast Inference: Sub-second responses
- Hardware Acceleration: Supports MPS (macOS), CUDA (NVIDIA), and CPU
- Load Balancing: Multiple model instances for concurrent requests
π― Accuracyβ
- UI Element Detection: High precision for mobile app elements
- Context Awareness: Understands app layouts and hierarchies
- Variation Handling: Adapts to different themes and orientations
- Confidence Scoring: Provides reliability metrics
π§ Integrationβ
- REST API: Simple HTTP endpoints
- Multi-format Support: Handles various image formats
- Error Handling: Robust failure recovery
- Monitoring: Health checks and performance metrics
Use Casesβ
Mobile App Testingβ
// Detect login button
const result = await detectObject("login_button", screenshot);
// Result contains precise coordinates for automation
{
found: true,
coordinates: { x: 540, y: 1200 },
confidence: 0.95
}
Screen Validationβ
// Verify expected elements are present
const validation = await validateScreen(screenshot, "welcome message");
if (validation.found) {
console.log("Screen state is correct");
}
Natural Language Commandsβ
// Convert natural language to coordinates
const coords = await processCommand("tap the search icon", screenshot);
// Use coordinates for device automation
await device.tap(coords.x, coords.y);
Quick Startβ
1. Start MoonDream Serviceβ
cd moondream
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python main.py
2. Configure FYI Toolβ
# In server/.env
MOONDREAM_URL=http://localhost:20200
MOONDREAM_MAX_VARIATIONS=2
3. Test Integrationβ
# Health check
curl http://localhost:20200/
# Test detection
curl -X POST "http://localhost:20200/v1/point" \
-F "object=button" \
-F "init_image=@screenshot.png"
API Endpointsβ
| Endpoint | Method | Description |
|---|---|---|
GET / | Health check with statistics | |
POST /v1/query | Visual question answering | |
POST /v1/detect | Object detection | |
POST /v1/point | Point detection for UI elements |
Configuration Optionsβ
Environment Variablesβ
# MoonDream Service (moondream/.env)
MOONDREAM_WORKERS=1 # Number of model instances
# FYI Backend (server/.env)
MOONDREAM_URL=http://localhost:20200 # API endpoint
MOONDREAM_MAX_VARIATIONS=2 # Detection variations
MOONDREAM_API_KEY= # Optional API key
Hardware Optimizationβ
macOS (MPS)β
- Automatically detected
- Metal Performance Shaders acceleration
- Best performance on Apple Silicon
NVIDIA GPU (CUDA)β
- Automatic CUDA detection
- Parallel processing on GPU cores
- Maximum throughput for batch processing
CPU Fallbackβ
- Works on any system
- Slower but reliable
- Good for development/testing
Best Practicesβ
Performance Optimizationβ
- Load Balancing: Use multiple workers for concurrent requests
- Caching: Enable coordinate caching for repeated elements
- Image Size: Resize large screenshots before processing
- Batch Processing: Group similar detection requests
Accuracy Improvementβ
- Clear Screenshots: Ensure good image quality
- Descriptive Labels: Use specific element descriptions
- Context Awareness: Include surrounding elements in queries
- Confidence Thresholds: Adjust based on use case requirements
Error Handlingβ
- Fallback Strategies: Implement coordinate fallbacks
- Retry Logic: Handle temporary service unavailability
- Validation: Always validate detection results
- Monitoring: Track detection success rates
Troubleshootingβ
Common Issuesβ
Service Won't Startβ
# Check Python version
python --version # Should be 3.13+
# Verify dependencies
pip list | grep torch
# Check GPU availability
python -c "import torch; print(torch.cuda.is_available())"
Low Accuracyβ
- Use more descriptive element names
- Ensure screenshots are clear and well-lit
- Try different variations of the same query
- Adjust confidence thresholds
Slow Performanceβ
- Reduce number of concurrent requests
- Use smaller images when possible
- Enable GPU acceleration
- Add more worker instances
Memory Issuesβ
- Reduce MOONDREAM_WORKERS
- Use CPU mode if GPU memory is limited
- Monitor memory usage with system tools
- Restart service periodically
Next Stepsβ
- Setup Guide: Complete installation and configuration
- API Reference: Detailed endpoint documentation
- Integration: How MoonDream works with FYI Automation Tool
- Optimization: Performance tuning and best practices
- Troubleshooting: Common issues and solutions