MoonDream Optimization
Performance tuning and optimization strategies for MoonDream Vision in the FYI Automation Tool.
Hardware Acceleration
MPS (Apple Silicon)
MoonDream automatically detects and optimizes for Apple Silicon:
# Verify MPS acceleration
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"
Performance Characteristics:
- Memory Usage: ~1.8GB per model instance
- Throughput: 15-25 requests/second
- Warm-up Time: 5-10 seconds
- Precision: Full FP32 precision
CUDA (NVIDIA GPUs)
For NVIDIA GPUs with CUDA support:
# Check CUDA availability
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import torch; print('GPU:', torch.cuda.get_device_name(0))"
Performance Characteristics:
- Memory Usage: ~2.2GB per model instance (with optimizations)
- Throughput: 20-40 requests/second
- Warm-up Time: 3-8 seconds
- Precision: Mixed FP16/FP32 for optimal performance
CPU Fallback
When GPU acceleration isn't available:
# Check CPU configuration
python -c "import torch; print('CPU threads:', torch.get_num_threads())"
Performance Characteristics:
- Memory Usage: ~1.5GB per model instance
- Throughput: 2-5 requests/second
- Warm-up Time: 10-20 seconds
- Precision: Full FP32 precision
Model Optimization
Multi-Worker Configuration
Scale performance by running multiple model instances:
# moondream/.env
MOONDREAM_WORKERS=4
Worker Scaling Guidelines:
- 1 Worker: Basic usage, development
- 2-4 Workers: Moderate load, CI/CD pipelines
- 8+ Workers: High-throughput production
- Memory: ~2GB × number of workers
Memory Optimization
# In main.py, add memory optimizations
import os
# CUDA memory optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"
# CPU memory optimization
os.environ["OMP_NUM_THREADS"] = "4" # Limit OpenMP threads
Model Precision Tuning
# Use mixed precision for CUDA
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-06-21",
trust_remote_code=True,
device_map=device_map,
torch_dtype=torch.float16, # Half precision for speed
low_cpu_mem_usage=True
)
Image Processing Optimization
Image Preprocessing
Optimize images before sending to MoonDream:
// Resize large images
const MAX_SIZE = 1024;
const optimizedImage = await resizeImage(screenshot, MAX_SIZE);
// Compress images
const compressedImage = await compressImage(optimizedImage, {
quality: 0.8,
format: 'jpeg'
});
Batch Processing
Process multiple detections efficiently:
// Instead of individual requests
const results = await Promise.all([
detectObject(screenshot, "button1"),
detectObject(screenshot, "button2"),
detectObject(screenshot, "input1")
]);
// Use batch API when available
const batchResult = await detectObjectsBatch(screenshot, [
"button1", "button2", "input1"
]);
Coordinate Caching Strategy
Cache Configuration
// Configure cache behavior
const CACHE_CONFIG = {
ttl: 24 * 60 * 60 * 1000, // 24 hours
maxEntries: 10000, // Per flow-device combination
minConfidence: 0.8, // Minimum confidence to cache
compression: true // Compress cached data
};
Intelligent Cache Invalidation
// Invalidate cache when UI changes
async function invalidateOnUIChange(deviceId: string, flowId: string) {
const currentScreenshot = await takeScreenshot(deviceId);
const previousScreenshot = await getLastScreenshot(deviceId, flowId);
if (previousScreenshot) {
const similarity = await compareScreenshots(currentScreenshot, previousScreenshot);
if (similarity < 0.9) { // Less than 90% similar
await invalidateFlowCache(flowId, deviceId);
}
}
}
Cache Warming
Pre-populate cache for common elements:
const COMMON_ELEMENTS = [
"login button", "search button", "menu button",
"username field", "password field", "submit button"
];
async function warmCache(deviceId: string, flowId: string) {
const screenshot = await takeScreenshot(deviceId);
for (const element of COMMON_ELEMENTS) {
try {
const result = await detectObject(screenshot, element);
if (result.found && result.confidence > 0.8) {
await storeCoordinateCache(flowId, deviceId, element, result);
}
} catch (error) {
console.warn(`Failed to cache ${element}:`, error);
}
}
}
Network Optimization
Connection Pooling
// Configure HTTP connection pooling
const agent = new https.Agent({
keepAlive: true,
maxSockets: 10,
maxFreeSockets: 5,
timeout: 30000
});
// Use in requests
const response = await fetch(MOONDREAM_URL, {
agent,
method: 'POST',
body: formData
});
Request Batching
// Group similar requests
class RequestBatcher {
private queue: Map<string, Promise<any>> = new Map();
private batchSize = 5;
private delay = 100; // ms between batches
async batchRequest<T>(
key: string,
requestFn: () => Promise<T>
): Promise<T> {
if (this.queue.has(key)) {
return this.queue.get(key)!;
}
const promise = this.processBatch(key, requestFn);
this.queue.set(key, promise);
promise.finally(() => {
this.queue.delete(key);
});
return promise;
}
private async processBatch<T>(
key: string,
requestFn: () => Promise<T>
): Promise<T> {
// Rate limiting logic
await this.rateLimit();
return requestFn();
}
}
Detection Strategy Optimization
Confidence Threshold Tuning
// Dynamic confidence thresholds based on element type
const CONFIDENCE_THRESHOLDS = {
button: 0.85,
input: 0.80,
icon: 0.90,
text: 0.75,
image: 0.70
};
function getConfidenceThreshold(elementType: string): number {
return CONFIDENCE_THRESHOLDS[elementType] || 0.80;
}
Multi-Attempt Detection
async function detectWithFallbacks(
screenshot: string,
element: string,
maxAttempts = 3
): Promise<DetectionResult> {
const variations = generateVariations(element);
for (let attempt = 0; attempt < Math.min(maxAttempts, variations.length); attempt++) {
try {
const result = await detectObject(screenshot, variations[attempt]);
if (result.found && result.confidence > 0.7) {
return result;
}
} catch (error) {
console.warn(`Detection attempt ${attempt + 1} failed:`, error);
}
}
// Return best result or failure
return { found: false, confidence: 0 };
}
Element Description Optimization
function generateVariations(element: string): string[] {
const base = element.toLowerCase().trim();
return [
base,
`${base} button`,
`${base} icon`,
`the ${base}`,
`${base} field`,
`${base} input`,
base.replace(/button|icon|field|input/g, '').trim(),
// Add context-specific variations
...generateContextVariations(base)
].filter((v, i, arr) => arr.indexOf(v) === i); // Remove duplicates
}
Database Optimization
Coordinate Cache Indexing
-- Optimize coordinate cache queries
CREATE INDEX idx_coordinate_cache_lookup
ON flow_coordinate_cache(flow_id, device_id, target_element);
CREATE INDEX idx_coordinate_cache_usage
ON flow_coordinate_cache(last_used_at, hit_count DESC);
CREATE INDEX idx_coordinate_cache_valid
ON flow_coordinate_cache(is_valid, updated_at DESC);
Query Optimization
// Use indexed queries
const cachedCoords = await prisma.flowCoordinateCache.findUnique({
where: {
flowId_deviceId_targetElement: {
flowId,
deviceId,
targetElement
}
}
});
// Batch coordinate updates
await prisma.flowCoordinateCache.updateMany({
where: {
flowId,
deviceId,
isValid: true
},
data: {
lastUsedAt: new Date()
}
});
Monitoring and Profiling
Performance Metrics
class VisionPerformanceMonitor {
private metrics = {
requestCount: 0,
totalResponseTime: 0,
cacheHits: 0,
cacheMisses: 0,
errors: 0
};
recordRequest(responseTime: number, cached: boolean) {
this.metrics.requestCount++;
this.metrics.totalResponseTime += responseTime;
if (cached) {
this.metrics.cacheHits++;
} else {
this.metrics.cacheMisses++;
}
}
getStats() {
return {
averageResponseTime: this.metrics.totalResponseTime / this.metrics.requestCount,
cacheHitRate: this.metrics.cacheHits / this.metrics.requestCount,
errorRate: this.metrics.errors / this.metrics.requestCount,
totalRequests: this.metrics.requestCount
};
}
}
Profiling Tools
# Profile Python performance
python -m cProfile main.py
# Monitor GPU usage
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv
# Check memory usage
ps aux | grep python | head -5
# Network monitoring
netstat -tlnp | grep :20200
Benchmarking
async function benchmarkDetection(
screenshot: string,
elements: string[],
iterations = 10
) {
const results = [];
for (const element of elements) {
const times = [];
for (let i = 0; i < iterations; i++) {
const start = Date.now();
await detectObject(screenshot, element);
times.push(Date.now() - start);
}
results.push({
element,
averageTime: times.reduce((a, b) => a + b) / times.length,
minTime: Math.min(...times),
maxTime: Math.max(...times)
});
}
return results;
}
Load Testing
Concurrent Request Testing
# Test with multiple concurrent requests
for i in {1..10}; do
curl -X POST "http://localhost:20200/v1/point" \
-F "object=button" \
-F "init_image=@test.png" &
done
wait
Stress Testing Script
#!/bin/bash
# stress_test.sh
CONCURRENT_REQUESTS=20
TOTAL_REQUESTS=100
for i in $(seq 1 $TOTAL_REQUESTS); do
curl -X POST "http://localhost:20200/v1/point" \
-F "object=test_button_$i" \
-F "init_image=@screenshot.png" \
-w "%{time_total}\n" \
-o /dev/null \
-s &
done | awk '{sum+=$1} END {print "Average response time:", sum/NR, "seconds"}'
Production Deployment Optimization
Kubernetes Optimization
apiVersion: apps/v1
kind: Deployment
metadata:
name: moondream
spec:
replicas: 3
selector:
matchLabels:
app: moondream
template:
metadata:
labels:
app: moondream
spec:
containers:
- name: moondream
image: moondream:latest
resources:
requests:
memory: "2Gi"
cpu: "500m"
limits:
memory: "4Gi"
cpu: "2000m"
env:
- name: MOONDREAM_WORKERS
value: "2"
ports:
- containerPort: 20200
livenessProbe:
httpGet:
path: /
port: 20200
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /
port: 20200
initialDelaySeconds: 5
periodSeconds: 5
Troubleshooting Performance Issues
High Latency
Symptoms: Response times >2 seconds
Solutions:
- Check MoonDream service logs for bottlenecks
- Verify GPU/CPU utilization
- Check network latency between services
- Implement request queuing for peak loads
Memory Issues
Symptoms: Out of memory errors, service crashes
Solutions:
- Reduce MOONDREAM_WORKERS
- Enable memory optimization flags
- Monitor memory usage patterns
- Implement memory limits
Low Throughput
Symptoms: Can't handle request volume
Solutions:
- Scale up workers horizontally
- Optimize model loading
- Implement request batching
- Use faster hardware
Cache Inefficiency
Symptoms: Low cache hit rates, repeated detections
Solutions:
- Analyze cache usage patterns
- Adjust cache invalidation strategy
- Implement smarter cache warming
- Optimize cache storage format
Performance Benchmarks
Expected Performance by Hardware
| Hardware | Throughput (req/s) | Latency (ms) | Memory (GB) |
|---|---|---|---|
| Apple M1/M2 | 15-25 | 200-400 | 1.8 |
| NVIDIA RTX 3060 | 20-35 | 150-300 | 2.2 |
| NVIDIA RTX 4080 | 30-50 | 100-250 | 2.5 |
| CPU (8 cores) | 2-5 | 800-2000 | 1.5 |
| CPU (16 cores) | 3-8 | 600-1500 | 1.5 |
Optimization Impact
| Optimization | Performance Gain | Memory Impact | Complexity |
|---|---|---|---|
| Multi-worker | 2-4x throughput | +2GB per worker | Low |
| Mixed precision | 1.5-2x speed | -0.5GB | Medium |
| Image compression | 1.2-1.5x speed | Minimal | Low |
| Coordinate caching | 5-10x for repeats | +100MB | Low |
| Request batching | 1.3-2x throughput | Minimal | High |
Continuous Optimization
Automated Monitoring
// Set up automated performance monitoring
setInterval(async () => {
const metrics = await getVisionMetrics();
if (metrics.averageResponseTime > 1000) {
console.warn('High latency detected, investigating...');
// Log detailed diagnostics
// Consider scaling up workers
}
if (metrics.errorRate > 0.05) {
console.error('High error rate detected');
// Alert administrators
// Check MoonDream service health
}
}, 60000); // Check every minute
Adaptive Scaling
// Implement adaptive worker scaling
async function adjustWorkersBasedOnLoad() {
const metrics = await getSystemMetrics();
const currentWorkers = getCurrentWorkerCount();
if (metrics.cpuUsage > 80 && currentWorkers < MAX_WORKERS) {
await scaleWorkers(currentWorkers + 1);
} else if (metrics.cpuUsage < 30 && currentWorkers > MIN_WORKERS) {
await scaleWorkers(currentWorkers - 1);
}
}
Performance Regression Testing
// Automated regression tests
async function runPerformanceRegression() {
const baseline = await loadBaselineMetrics();
const current = await measureCurrentPerformance();
const regressions = findRegressions(baseline, current);
if (regressions.length > 0) {
console.error('Performance regressions detected:', regressions);
// Alert team
// Create issue for investigation
}
}
This comprehensive optimization guide ensures MoonDream Vision delivers maximum performance for your automation needs while maintaining reliability and accuracy. Regular monitoring and tuning based on your specific use case will help maintain optimal performance over time. 🚀