MoonDream Optimization

Performance tuning and optimization strategies for MoonDream Vision in the FYI Automation Tool.

Hardware Acceleration

MPS (Apple Silicon)

MoonDream automatically detects and optimizes for Apple Silicon:

# Verify MPS acceleration
python -c "import torch; print('MPS available:', torch.backends.mps.is_available())"

Performance Characteristics:

Memory Usage: ~1.8GB per model instance
Throughput: 15-25 requests/second
Warm-up Time: 5-10 seconds
Precision: Full FP32 precision

CUDA (NVIDIA GPUs)

For NVIDIA GPUs with CUDA support:

# Check CUDA availability
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
python -c "import torch; print('GPU:', torch.cuda.get_device_name(0))"

Performance Characteristics:

Memory Usage: ~2.2GB per model instance (with optimizations)
Throughput: 20-40 requests/second
Warm-up Time: 3-8 seconds
Precision: Mixed FP16/FP32 for optimal performance

CPU Fallback

When GPU acceleration isn't available:

# Check CPU configuration
python -c "import torch; print('CPU threads:', torch.get_num_threads())"

Performance Characteristics:

Memory Usage: ~1.5GB per model instance
Throughput: 2-5 requests/second
Warm-up Time: 10-20 seconds
Precision: Full FP32 precision

Model Optimization

Multi-Worker Configuration

Scale performance by running multiple model instances:

# moondream/.env
MOONDREAM_WORKERS=4

Worker Scaling Guidelines:

1 Worker: Basic usage, development
2-4 Workers: Moderate load, CI/CD pipelines
8+ Workers: High-throughput production
Memory: ~2GB × number of workers

Memory Optimization

# In main.py, add memory optimizations
import os

# CUDA memory optimization
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:512"

# CPU memory optimization
os.environ["OMP_NUM_THREADS"] = "4"  # Limit OpenMP threads

Model Precision Tuning

# Use mixed precision for CUDA
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-06-21",
    trust_remote_code=True,
    device_map=device_map,
    torch_dtype=torch.float16,  # Half precision for speed
    low_cpu_mem_usage=True
)

Image Processing Optimization

Image Preprocessing

Optimize images before sending to MoonDream:

// Resize large images
const MAX_SIZE = 1024;
const optimizedImage = await resizeImage(screenshot, MAX_SIZE);

// Compress images
const compressedImage = await compressImage(optimizedImage, {
  quality: 0.8,
  format: 'jpeg'
});

Batch Processing

Process multiple detections efficiently:

// Instead of individual requests
const results = await Promise.all([
  detectObject(screenshot, "button1"),
  detectObject(screenshot, "button2"),
  detectObject(screenshot, "input1")
]);

// Use batch API when available
const batchResult = await detectObjectsBatch(screenshot, [
  "button1", "button2", "input1"
]);

Coordinate Caching Strategy

Cache Configuration

// Configure cache behavior
const CACHE_CONFIG = {
  ttl: 24 * 60 * 60 * 1000,        // 24 hours
  maxEntries: 10000,               // Per flow-device combination
  minConfidence: 0.8,              // Minimum confidence to cache
  compression: true                // Compress cached data
};

Intelligent Cache Invalidation

// Invalidate cache when UI changes
async function invalidateOnUIChange(deviceId: string, flowId: string) {
  const currentScreenshot = await takeScreenshot(deviceId);
  const previousScreenshot = await getLastScreenshot(deviceId, flowId);

  if (previousScreenshot) {
    const similarity = await compareScreenshots(currentScreenshot, previousScreenshot);
    if (similarity < 0.9) { // Less than 90% similar
      await invalidateFlowCache(flowId, deviceId);
    }
  }
}

Cache Warming

Pre-populate cache for common elements:

const COMMON_ELEMENTS = [
  "login button", "search button", "menu button",
  "username field", "password field", "submit button"
];

async function warmCache(deviceId: string, flowId: string) {
  const screenshot = await takeScreenshot(deviceId);

  for (const element of COMMON_ELEMENTS) {
    try {
      const result = await detectObject(screenshot, element);
      if (result.found && result.confidence > 0.8) {
        await storeCoordinateCache(flowId, deviceId, element, result);
      }
    } catch (error) {
      console.warn(`Failed to cache ${element}:`, error);
    }
  }
}

Network Optimization

Connection Pooling

// Configure HTTP connection pooling
const agent = new https.Agent({
  keepAlive: true,
  maxSockets: 10,
  maxFreeSockets: 5,
  timeout: 30000
});

// Use in requests
const response = await fetch(MOONDREAM_URL, {
  agent,
  method: 'POST',
  body: formData
});

Request Batching

// Group similar requests
class RequestBatcher {
  private queue: Map<string, Promise<any>> = new Map();
  private batchSize = 5;
  private delay = 100; // ms between batches

  async batchRequest<T>(
    key: string,
    requestFn: () => Promise<T>
  ): Promise<T> {
    if (this.queue.has(key)) {
      return this.queue.get(key)!;
    }

    const promise = this.processBatch(key, requestFn);
    this.queue.set(key, promise);

    promise.finally(() => {
      this.queue.delete(key);
    });

    return promise;
  }

  private async processBatch<T>(
    key: string,
    requestFn: () => Promise<T>
  ): Promise<T> {
    // Rate limiting logic
    await this.rateLimit();
    return requestFn();
  }
}

Detection Strategy Optimization

Confidence Threshold Tuning

// Dynamic confidence thresholds based on element type
const CONFIDENCE_THRESHOLDS = {
  button: 0.85,
  input: 0.80,
  icon: 0.90,
  text: 0.75,
  image: 0.70
};

function getConfidenceThreshold(elementType: string): number {
  return CONFIDENCE_THRESHOLDS[elementType] || 0.80;
}

Multi-Attempt Detection

async function detectWithFallbacks(
  screenshot: string,
  element: string,
  maxAttempts = 3
): Promise<DetectionResult> {
  const variations = generateVariations(element);

  for (let attempt = 0; attempt < Math.min(maxAttempts, variations.length); attempt++) {
    try {
      const result = await detectObject(screenshot, variations[attempt]);

      if (result.found && result.confidence > 0.7) {
        return result;
      }
    } catch (error) {
      console.warn(`Detection attempt ${attempt + 1} failed:`, error);
    }
  }

  // Return best result or failure
  return { found: false, confidence: 0 };
}

Element Description Optimization

function generateVariations(element: string): string[] {
  const base = element.toLowerCase().trim();

  return [
    base,
    `${base} button`,
    `${base} icon`,
    `the ${base}`,
    `${base} field`,
    `${base} input`,
    base.replace(/button|icon|field|input/g, '').trim(),
    // Add context-specific variations
    ...generateContextVariations(base)
  ].filter((v, i, arr) => arr.indexOf(v) === i); // Remove duplicates
}

Database Optimization

Coordinate Cache Indexing

-- Optimize coordinate cache queries
CREATE INDEX idx_coordinate_cache_lookup
ON flow_coordinate_cache(flow_id, device_id, target_element);

CREATE INDEX idx_coordinate_cache_usage
ON flow_coordinate_cache(last_used_at, hit_count DESC);

CREATE INDEX idx_coordinate_cache_valid
ON flow_coordinate_cache(is_valid, updated_at DESC);

Query Optimization

// Use indexed queries
const cachedCoords = await prisma.flowCoordinateCache.findUnique({
  where: {
    flowId_deviceId_targetElement: {
      flowId,
      deviceId,
      targetElement
    }
  }
});

// Batch coordinate updates
await prisma.flowCoordinateCache.updateMany({
  where: {
    flowId,
    deviceId,
    isValid: true
  },
  data: {
    lastUsedAt: new Date()
  }
});

Monitoring and Profiling

Performance Metrics

class VisionPerformanceMonitor {
  private metrics = {
    requestCount: 0,
    totalResponseTime: 0,
    cacheHits: 0,
    cacheMisses: 0,
    errors: 0
  };

  recordRequest(responseTime: number, cached: boolean) {
    this.metrics.requestCount++;
    this.metrics.totalResponseTime += responseTime;

    if (cached) {
      this.metrics.cacheHits++;
    } else {
      this.metrics.cacheMisses++;
    }
  }

  getStats() {
    return {
      averageResponseTime: this.metrics.totalResponseTime / this.metrics.requestCount,
      cacheHitRate: this.metrics.cacheHits / this.metrics.requestCount,
      errorRate: this.metrics.errors / this.metrics.requestCount,
      totalRequests: this.metrics.requestCount
    };
  }
}

Profiling Tools

# Profile Python performance
python -m cProfile main.py

# Monitor GPU usage
nvidia-smi --query-gpu=utilization.gpu,utilization.memory --format=csv

# Check memory usage
ps aux | grep python | head -5

# Network monitoring
netstat -tlnp | grep :20200

Benchmarking

async function benchmarkDetection(
  screenshot: string,
  elements: string[],
  iterations = 10
) {
  const results = [];

  for (const element of elements) {
    const times = [];

    for (let i = 0; i < iterations; i++) {
      const start = Date.now();
      await detectObject(screenshot, element);
      times.push(Date.now() - start);
    }

    results.push({
      element,
      averageTime: times.reduce((a, b) => a + b) / times.length,
      minTime: Math.min(...times),
      maxTime: Math.max(...times)
    });
  }

  return results;
}

Load Testing

Concurrent Request Testing

# Test with multiple concurrent requests
for i in {1..10}; do
  curl -X POST "http://localhost:20200/v1/point" \
    -F "object=button" \
    -F "init_image=@test.png" &
done
wait

Stress Testing Script

#!/bin/bash
# stress_test.sh

CONCURRENT_REQUESTS=20
TOTAL_REQUESTS=100

for i in $(seq 1 $TOTAL_REQUESTS); do
  curl -X POST "http://localhost:20200/v1/point" \
    -F "object=test_button_$i" \
    -F "init_image=@screenshot.png" \
    -w "%{time_total}\n" \
    -o /dev/null \
    -s &
done | awk '{sum+=$1} END {print "Average response time:", sum/NR, "seconds"}'

Production Deployment Optimization

Kubernetes Optimization

apiVersion: apps/v1
kind: Deployment
metadata:
  name: moondream
spec:
  replicas: 3
  selector:
    matchLabels:
      app: moondream
  template:
    metadata:
      labels:
        app: moondream
    spec:
      containers:
      - name: moondream
        image: moondream:latest
        resources:
          requests:
            memory: "2Gi"
            cpu: "500m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        env:
        - name: MOONDREAM_WORKERS
          value: "2"
        ports:
        - containerPort: 20200
        livenessProbe:
          httpGet:
            path: /
            port: 20200
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /
            port: 20200
          initialDelaySeconds: 5
          periodSeconds: 5

Troubleshooting Performance Issues

High Latency

Symptoms: Response times >2 seconds

Solutions:

Check MoonDream service logs for bottlenecks
Verify GPU/CPU utilization
Check network latency between services
Implement request queuing for peak loads

Memory Issues

Symptoms: Out of memory errors, service crashes

Solutions:

Reduce MOONDREAM_WORKERS
Enable memory optimization flags
Monitor memory usage patterns
Implement memory limits

Low Throughput

Symptoms: Can't handle request volume

Solutions:

Scale up workers horizontally
Optimize model loading
Implement request batching
Use faster hardware

Cache Inefficiency

Symptoms: Low cache hit rates, repeated detections

Solutions:

Analyze cache usage patterns
Adjust cache invalidation strategy
Implement smarter cache warming
Optimize cache storage format

Performance Benchmarks

Expected Performance by Hardware

Hardware	Throughput (req/s)	Latency (ms)	Memory (GB)
Apple M1/M2	15-25	200-400	1.8
NVIDIA RTX 3060	20-35	150-300	2.2
NVIDIA RTX 4080	30-50	100-250	2.5
CPU (8 cores)	2-5	800-2000	1.5
CPU (16 cores)	3-8	600-1500	1.5

Optimization Impact

Optimization	Performance Gain	Memory Impact	Complexity
Multi-worker	2-4x throughput	+2GB per worker	Low
Mixed precision	1.5-2x speed	-0.5GB	Medium
Image compression	1.2-1.5x speed	Minimal	Low
Coordinate caching	5-10x for repeats	+100MB	Low
Request batching	1.3-2x throughput	Minimal	High

Continuous Optimization

Automated Monitoring

// Set up automated performance monitoring
setInterval(async () => {
  const metrics = await getVisionMetrics();

  if (metrics.averageResponseTime > 1000) {
    console.warn('High latency detected, investigating...');
    // Log detailed diagnostics
    // Consider scaling up workers
  }

  if (metrics.errorRate > 0.05) {
    console.error('High error rate detected');
    // Alert administrators
    // Check MoonDream service health
  }
}, 60000); // Check every minute

Adaptive Scaling

// Implement adaptive worker scaling
async function adjustWorkersBasedOnLoad() {
  const metrics = await getSystemMetrics();
  const currentWorkers = getCurrentWorkerCount();

  if (metrics.cpuUsage > 80 && currentWorkers < MAX_WORKERS) {
    await scaleWorkers(currentWorkers + 1);
  } else if (metrics.cpuUsage < 30 && currentWorkers > MIN_WORKERS) {
    await scaleWorkers(currentWorkers - 1);
  }
}

Performance Regression Testing

// Automated regression tests
async function runPerformanceRegression() {
  const baseline = await loadBaselineMetrics();
  const current = await measureCurrentPerformance();

  const regressions = findRegressions(baseline, current);

  if (regressions.length > 0) {
    console.error('Performance regressions detected:', regressions);
    // Alert team
    // Create issue for investigation
  }
}

This comprehensive optimization guide ensures MoonDream Vision delivers maximum performance for your automation needs while maintaining reliability and accuracy. Regular monitoring and tuning based on your specific use case will help maintain optimal performance over time. 🚀

Hardware Acceleration​

MPS (Apple Silicon)​

CUDA (NVIDIA GPUs)​

CPU Fallback​

Model Optimization​

Multi-Worker Configuration​

Memory Optimization​

Model Precision Tuning​

Image Processing Optimization​

Image Preprocessing​

Batch Processing​

Coordinate Caching Strategy​

Cache Configuration​

Intelligent Cache Invalidation​

Cache Warming​

Network Optimization​

Connection Pooling​

Request Batching​

Detection Strategy Optimization​

Confidence Threshold Tuning​

Multi-Attempt Detection​

Element Description Optimization​

Database Optimization​

Coordinate Cache Indexing​

Query Optimization​

Monitoring and Profiling​

Performance Metrics​

Profiling Tools​

Benchmarking​

Load Testing​

Concurrent Request Testing​

Stress Testing Script​

Production Deployment Optimization​

Kubernetes Optimization​

Troubleshooting Performance Issues​

High Latency​

Memory Issues​

Low Throughput​

Cache Inefficiency​

Performance Benchmarks​

Expected Performance by Hardware​

Optimization Impact​

Continuous Optimization​

Automated Monitoring​

Adaptive Scaling​

Performance Regression Testing​

Hardware Acceleration

MPS (Apple Silicon)

CUDA (NVIDIA GPUs)

CPU Fallback

Model Optimization

Multi-Worker Configuration

Memory Optimization

Model Precision Tuning

Image Processing Optimization

Image Preprocessing

Batch Processing

Coordinate Caching Strategy

Cache Configuration

Intelligent Cache Invalidation

Cache Warming

Network Optimization

Connection Pooling

Request Batching

Detection Strategy Optimization

Confidence Threshold Tuning

Multi-Attempt Detection

Element Description Optimization

Database Optimization

Coordinate Cache Indexing

Query Optimization

Monitoring and Profiling

Performance Metrics

Profiling Tools

Benchmarking

Load Testing

Concurrent Request Testing

Stress Testing Script

Production Deployment Optimization

Kubernetes Optimization

Troubleshooting Performance Issues

High Latency

Memory Issues

Low Throughput

Cache Inefficiency

Performance Benchmarks

Expected Performance by Hardware

Optimization Impact

Continuous Optimization

Automated Monitoring

Adaptive Scaling

Performance Regression Testing