Generating one AI image is easy. Generating 12 simultaneously while handling rate limits, API failures, memory constraints, and cost tracking is an engineering problem that will break your system if you haven't planned for it. Here's what we learned running batch generation in production.
Rate Limiting Is Your Primary Enemy
Cloud AI APIs aggressively rate-limit requests to manage GPU capacity. Replicate, the most common API for adult content generation, will return HTTP 429 (Too Many Requests) when you exceed their concurrency limits. Without proper handling, a batch of 12 simultaneous requests might see 8 succeed and 4 fail with rate limit errors.
The solution is exponential backoff with jitter:
// Retry with exponential backoff
for (int attempt = 0; attempt < maxRetries; attempt++)
{
try { return await callReplicateApi(prompt); }
catch (RateLimitException)
{
int delay = (int)Math.Pow(2, attempt) * 1000; // 1s, 2s, 4s, 8s, 16s
delay += random.Next(0, 500); // Add jitter to prevent thundering herd
await Task.Delay(delay);
}
}
In production, we use 5 retry attempts with delays of 2–32 seconds. This handles burst rate limits while keeping the user experience acceptable.
Concurrency Throttling
Even with retry logic, firing 12 API calls simultaneously is wasteful when the API will only process 3–5 at a time. Use a semaphore or concurrency limiter to control how many requests are in-flight:
- 3 concurrent requests is a safe default for Replicate free-tier accounts
- 5–8 concurrent requests for paid Replicate plans
- Unlimited for self-hosted GPU inference (limited only by your hardware)
This means a batch of 12 images takes 3–4 rounds of parallel generation rather than one massive burst. It's slower but dramatically more reliable.
Memory Pressure in Docker
Each generation request holds data in memory: the prompt, the API response, the generated image binary (1–5 MB), and metadata. With concurrent generations in a containerized environment, memory usage spikes fast:
- 12 simultaneous generations: 50–100 MB peak memory just for image data
- With base64 encoding: Each image is ~33% larger in base64 than binary, pushing memory higher
- Database writes: Concurrent EF Core / database operations during batch saves can cause context threading issues
A critical lesson we learned: never perform database writes inside a parallel batch operation. Collect all results from the parallel generation step first, then write to the database sequentially. Shared DbContext objects in EF Core are not thread-safe and will throw concurrency exceptions under parallel writes.
Token/Credit Deduction
If your platform uses a credit system for AI generation, batch operations need careful transaction handling:
- Deduct credits before starting generation (reserve the cost)
- If generation fails, refund credits for failed images
- Use database transactions to ensure atomicity — a partial batch failure shouldn't result in lost credits
- Track deductions and refunds as separate transaction records for audit trails
User Experience During Batch Generation
A batch of 12 images takes 15–60 seconds depending on the model and concurrency limits. Users need feedback:
- Progress indicators: Show “Generating 4 of 12...” with individual image status
- Stream results: Display each image as it completes rather than waiting for all 12. This reduces perceived wait time dramatically
- Failure communication: If 2 of 12 images fail, show the 10 that succeeded and offer to retry the failures. Don't fail the entire batch for partial errors
Architecture Recommendation
For production batch generation, use an async job queue pattern:
- User initiates batch → create a job record in the database
- Background worker picks up the job and processes images with throttled concurrency
- Frontend polls or uses WebSocket for real-time progress
- Each completed image is saved individually to S3 and database
- Job marked complete when all images are done or max retries exhausted
This decouples the user-facing request from the generation workload, prevents HTTP timeouts on long batches, and enables horizontal scaling by adding more workers.







