Skip to main content

MoonDream Optimization

Performance tuning and optimization strategies for MoonDream Vision in the FYI Automation Tool.

Hardware Acceleration

MPS (Apple Silicon)

MoonDream automatically detects and optimizes for Apple Silicon:

# Verify MPS acceleration
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Performance Characteristics:

  • Memory Usage: ~1.8GB per model instance
  • Throughput: 15-25 requests/second
  • Warm-up Time: 5-10 seconds
  • Precision: Full FP32 precision

CUDA (NVIDIA GPUs)

For NVIDIA GPUs with CUDA support:

# Check CUDA availability
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import torch; print('GPU:', torch.cuda.get_device_name(0))"

Performance Characteristics:

  • Memory Usage: ~2.2GB per model instance (with optimizations)
  • Throughput: 20-40 requests/second
  • Warm-up Time: 3-8 seconds
  • Precision: Mixed FP16/FP32 for optimal performance

CPU Fallback

When GPU acceleration isn't available:

# Check CPU configuration
python -c "import torch; print('CPU threads:', torch.get_num_threads())"

Performance Characteristics:

  • Memory Usage: ~1.5GB per model instance
  • Throughput: 2-5 requests/second
  • Warm-up Time: 10-20 seconds
  • Precision: Full FP32 precision

Model Optimization

Multi-Worker Configuration

Scale performance by running multiple model instances:

# moondream/.env
MOONDREAM_WORKERS=4

Worker Scaling Guidelines:

  • 1 Worker: Basic usage, development
  • 2-4 Workers: Moderate load, CI/CD pipelines
  • 8+ Workers: High-throughput production
  • Memory: ~2GB × number of workers

Memory Optimization

# In main.py, add memory optimizations
import os

# CUDA memory optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

# CPU memory optimization
os.environ["OMP_NUM_THREADS"] = "4" # Limit OpenMP threads

Model Precision Tuning

# Use mixed precision for CUDA
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-06-21",
trust_remote_code=True,
device_map=device_map,
torch_dtype=torch.float16, # Half precision for speed
low_cpu_mem_usage=True
)

Image Processing Optimization

Image Preprocessing

Optimize images before sending to MoonDream:

// Resize large images
const MAX_SIZE = 1024;
const optimizedImage = await resizeImage(screenshot, MAX_SIZE);

// Compress images
const compressedImage = await compressImage(optimizedImage, {
quality: 0.8,
format: 'jpeg'
});

Batch Processing

Process multiple detections efficiently:

// Instead of individual requests
const results = await Promise.all([
detectObject(screenshot, "button1"),
detectObject(screenshot, "button2"),
detectObject(screenshot, "input1")
]);

// Use batch API when available
const batchResult = await detectObjectsBatch(screenshot, [
"button1", "button2", "input1"
]);

Coordinate Caching Strategy

Cache Configuration

// Configure cache behavior
const CACHE_CONFIG = {
ttl: 24 * 60 * 60 * 1000, // 24 hours
maxEntries: 10000, // Per flow-device combination
minConfidence: 0.8, // Minimum confidence to cache
compression: true // Compress cached data
};

Intelligent Cache Invalidation

// Invalidate cache when UI changes
async function invalidateOnUIChange(deviceId: string, flowId: string) {
const currentScreenshot = await takeScreenshot(deviceId);
const previousScreenshot = await getLastScreenshot(deviceId, flowId);

if (previousScreenshot) {
const similarity = await compareScreenshots(currentScreenshot, previousScreenshot);
if (similarity < 0.9) { // Less than 90% similar
await invalidateFlowCache(flowId, deviceId);
}
}
}

Cache Warming

Pre-populate cache for common elements:

const COMMON_ELEMENTS = [
"login button", "search button", "menu button",
"username field", "password field", "submit button"
];

async function warmCache(deviceId: string, flowId: string) {
const screenshot = await takeScreenshot(deviceId);

for (const element of COMMON_ELEMENTS) {
try {
const result = await detectObject(screenshot, element);
if (result.found && result.confidence > 0.8) {
await storeCoordinateCache(flowId, deviceId, element, result);
}
} catch (error) {
console.warn(`Failed to cache ${element}:`, error);
}
}
}

Network Optimization

Connection Pooling

// Configure HTTP connection pooling
const agent = new https.Agent({
keepAlive: true,
maxSockets: 10,
maxFreeSockets: 5,
timeout: 30000
});

// Use in requests
const response = await fetch(MOONDREAM_URL, {
agent,
method: 'POST',
body: formData
});

Request Batching

// Group similar requests
class RequestBatcher {
private queue: Map<string, Promise<any>> = new Map();
private batchSize = 5;
private delay = 100; // ms between batches

async batchRequest<T>(
key: string,
requestFn: () => Promise<T>
): Promise<T> {
if (this.queue.has(key)) {
return this.queue.get(key)!;
}

const promise = this.processBatch(key, requestFn);
this.queue.set(key, promise);

promise.finally(() => {
this.queue.delete(key);
});

return promise;
}

private async processBatch<T>(
key: string,
requestFn: () => Promise<T>
): Promise<T> {
// Rate limiting logic
await this.rateLimit();
return requestFn();
}
}

Detection Strategy Optimization

Confidence Threshold Tuning

// Dynamic confidence thresholds based on element type
const CONFIDENCE_THRESHOLDS = {
button: 0.85,
input: 0.80,
icon: 0.90,
text: 0.75,
image: 0.70
};

function getConfidenceThreshold(elementType: string): number {
return CONFIDENCE_THRESHOLDS[elementType] || 0.80;
}

Multi-Attempt Detection

async function detectWithFallbacks(
screenshot: string,
element: string,
maxAttempts = 3
): Promise<DetectionResult> {
const variations = generateVariations(element);

for (let attempt = 0; attempt < Math.min(maxAttempts, variations.length); attempt++) {
try {
const result = await detectObject(screenshot, variations[attempt]);

if (result.found && result.confidence > 0.7) {
return result;
}
} catch (error) {
console.warn(`Detection attempt ${attempt + 1} failed:`, error);
}
}

// Return best result or failure
return { found: false, confidence: 0 };
}

Element Description Optimization

function generateVariations(element: string): string[] {
const base = element.toLowerCase().trim();

return [
base,
`${base} button`,
`${base} icon`,
`the ${base}`,
`${base} field`,
`${base} input`,
base.replace(/button|icon|field|input/g, '').trim(),
// Add context-specific variations
...generateContextVariations(base)
].filter((v, i, arr) => arr.indexOf(v) === i); // Remove duplicates
}

Database Optimization

Coordinate Cache Indexing

-- Optimize coordinate cache queries
CREATE INDEX idx_coordinate_cache_lookup
ON flow_coordinate_cache(flow_id, device_id, target_element);

CREATE INDEX idx_coordinate_cache_usage
ON flow_coordinate_cache(last_used_at, hit_count DESC);

CREATE INDEX idx_coordinate_cache_valid
ON flow_coordinate_cache(is_valid, updated_at DESC);

Query Optimization

// Use indexed queries
const cachedCoords = await prisma.flowCoordinateCache.findUnique({
where: {
flowId_deviceId_targetElement: {
flowId,
deviceId,
targetElement
}
}
});

// Batch coordinate updates
await prisma.flowCoordinateCache.updateMany({
where: {
flowId,
deviceId,
isValid: true
},
data: {
lastUsedAt: new Date()
}
});

Monitoring and Profiling

Performance Metrics

class VisionPerformanceMonitor {
private metrics = {
requestCount: 0,
totalResponseTime: 0,
cacheHits: 0,
cacheMisses: 0,
errors: 0
};

recordRequest(responseTime: number, cached: boolean) {
this.metrics.requestCount++;
this.metrics.totalResponseTime += responseTime;

if (cached) {
this.metrics.cacheHits++;
} else {
this.metrics.cacheMisses++;
}
}

getStats() {
return {
averageResponseTime: this.metrics.totalResponseTime / this.metrics.requestCount,
cacheHitRate: this.metrics.cacheHits / this.metrics.requestCount,
errorRate: this.metrics.errors / this.metrics.requestCount,
totalRequests: this.metrics.requestCount
};
}
}

Profiling Tools

# Profile Python performance
python -m cProfile main.py

# Monitor GPU usage
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv

# Check memory usage
ps aux | grep python | head -5

# Network monitoring
netstat -tlnp | grep :20200

Benchmarking

async function benchmarkDetection(
screenshot: string,
elements: string[],
iterations = 10
) {
const results = [];

for (const element of elements) {
const times = [];

for (let i = 0; i < iterations; i++) {
const start = Date.now();
await detectObject(screenshot, element);
times.push(Date.now() - start);
}

results.push({
element,
averageTime: times.reduce((a, b) => a + b) / times.length,
minTime: Math.min(...times),
maxTime: Math.max(...times)
});
}

return results;
}

Load Testing

Concurrent Request Testing

# Test with multiple concurrent requests
for i in {1..10}; do
curl -X POST "http://localhost:20200/v1/point" \
-F "object=button" \
-F "init_image=@test.png" &
done
wait

Stress Testing Script

#!/bin/bash
# stress_test.sh

CONCURRENT_REQUESTS=20
TOTAL_REQUESTS=100

for i in $(seq 1 $TOTAL_REQUESTS); do
curl -X POST "http://localhost:20200/v1/point" \
-F "object=test_button_$i" \
-F "init_image=@screenshot.png" \
-w "%{time_total}\n" \
-o /dev/null \
-s &
done | awk '{sum+=$1} END {print "Average response time:", sum/NR, "seconds"}'

Production Deployment Optimization

Kubernetes Optimization

apiVersion: apps/v1
kind: Deployment
metadata:
name: moondream
spec:
replicas: 3
selector:
matchLabels:
app: moondream
template:
metadata:
labels:
app: moondream
spec:
containers:
- name: moondream
image: moondream:latest
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
env:
- name: MOONDREAM_WORKERS
value: "2"
ports:
- containerPort: 20200
livenessProbe:
httpGet:
path: /
port: 20200
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 20200
initialDelaySeconds: 5
periodSeconds: 5

Troubleshooting Performance Issues

High Latency

Symptoms: Response times >2 seconds

Solutions:

  1. Check MoonDream service logs for bottlenecks
  2. Verify GPU/CPU utilization
  3. Check network latency between services
  4. Implement request queuing for peak loads

Memory Issues

Symptoms: Out of memory errors, service crashes

Solutions:

  1. Reduce MOONDREAM_WORKERS
  2. Enable memory optimization flags
  3. Monitor memory usage patterns
  4. Implement memory limits

Low Throughput

Symptoms: Can't handle request volume

Solutions:

  1. Scale up workers horizontally
  2. Optimize model loading
  3. Implement request batching
  4. Use faster hardware

Cache Inefficiency

Symptoms: Low cache hit rates, repeated detections

Solutions:

  1. Analyze cache usage patterns
  2. Adjust cache invalidation strategy
  3. Implement smarter cache warming
  4. Optimize cache storage format

Performance Benchmarks

Expected Performance by Hardware

HardwareThroughput (req/s)Latency (ms)Memory (GB)
Apple M1/M215-25200-4001.8
NVIDIA RTX 306020-35150-3002.2
NVIDIA RTX 408030-50100-2502.5
CPU (8 cores)2-5800-20001.5
CPU (16 cores)3-8600-15001.5

Optimization Impact

OptimizationPerformance GainMemory ImpactComplexity
Multi-worker2-4x throughput+2GB per workerLow
Mixed precision1.5-2x speed-0.5GBMedium
Image compression1.2-1.5x speedMinimalLow
Coordinate caching5-10x for repeats+100MBLow
Request batching1.3-2x throughputMinimalHigh

Continuous Optimization

Automated Monitoring

// Set up automated performance monitoring
setInterval(async () => {
const metrics = await getVisionMetrics();

if (metrics.averageResponseTime > 1000) {
console.warn('High latency detected, investigating...');
// Log detailed diagnostics
// Consider scaling up workers
}

if (metrics.errorRate > 0.05) {
console.error('High error rate detected');
// Alert administrators
// Check MoonDream service health
}
}, 60000); // Check every minute

Adaptive Scaling

// Implement adaptive worker scaling
async function adjustWorkersBasedOnLoad() {
const metrics = await getSystemMetrics();
const currentWorkers = getCurrentWorkerCount();

if (metrics.cpuUsage > 80 && currentWorkers < MAX_WORKERS) {
await scaleWorkers(currentWorkers + 1);
} else if (metrics.cpuUsage < 30 && currentWorkers > MIN_WORKERS) {
await scaleWorkers(currentWorkers - 1);
}
}

Performance Regression Testing

// Automated regression tests
async function runPerformanceRegression() {
const baseline = await loadBaselineMetrics();
const current = await measureCurrentPerformance();

const regressions = findRegressions(baseline, current);

if (regressions.length > 0) {
console.error('Performance regressions detected:', regressions);
// Alert team
// Create issue for investigation
}
}

This comprehensive optimization guide ensures MoonDream Vision delivers maximum performance for your automation needs while maintaining reliability and accuracy. Regular monitoring and tuning based on your specific use case will help maintain optimal performance over time. 🚀