Engenharia8 min de leitura

Como construímos infraestrutura cloud escalável para empresas em crescimento

Como projetamos nossa plataforma cloud para lidar com picos de tráfego 10× maiores. Uma análise aprofundada sobre auto-scaling, balanceamento de carga e otimização de custos.

Ioana Dragomir

Equipe de Marketing · 18 de março de 2026

IT professional examining data servers in a modern data center

Foto de Christina Morillo · Pexels

The Challenge

When we started offering managed hosting and cloud services at GRADAX, most of our clients ran predictable workloads. A local restaurant's website would get steady traffic, maybe a small spike on weekends. An accountant's portfolio site barely moved the needle on server resources. We provisioned conservatively, and it worked. But as we onboarded e-commerce brands, SaaS startups, and media outlets, the traffic patterns became wildly unpredictable.

One of our clients, a Romanian fashion marketplace, ran a flash sale that drove 14x their normal traffic in under ninety seconds. Their previous host buckled. They came to us the next day asking for a guarantee that it would never happen again. That single conversation forced us to rethink our entire infrastructure philosophy. We could no longer design for averages — we had to design for extremes.

The core challenge was threefold: handle sudden traffic spikes without manual intervention, keep costs reasonable during quiet periods, and maintain sub-200ms response times across European and global audiences. Solving any one of these in isolation is straightforward. Solving all three simultaneously required a fundamentally different architecture.

Architecture Decisions

We evaluated several approaches before settling on our current stack. A fully serverless architecture was tempting but introduced cold-start latencies that violated our performance targets, especially for server-rendered Next.js applications that our clients rely on. A traditional VM-based setup gave us control but lacked the elasticity we needed. We landed on a hybrid model: a baseline of dedicated compute backed by on-demand cloud server pools that can spin up in under four seconds.

Our control plane runs on Kubernetes, orchestrating workloads across multiple availability zones within each data center region. Each client's application is containerized with a standardized runtime image that includes our monitoring agent, log forwarder, and health check endpoint. This uniformity is what makes auto-scaling possible, every container is interchangeable, so the orchestrator can add or remove instances without worrying about environment-specific quirks.

For the data layer, we adopted a tiered storage strategy. Hot data lives on NVMe SSDs with replication across two nodes. Warm data moves to standard SSDs after seven days. Cold data archives to object storage after thirty. This alone reduced our storage costs by 40% without any perceptible impact on client performance.

Auto-Scaling Strategy

Auto-scaling sounds simple in theory: when demand goes up, add capacity; when it drops, remove it. In practice, the timing and thresholds matter enormously. Scale too late and users hit errors. Scale too early and you burn money on idle resources. We spent three months tuning our scaling policies before we were confident enough to run them in production without human oversight.

Our system uses a predictive scaling model layered on top of reactive triggers. The predictive layer analyzes historical traffic patterns, day of week, time of day, seasonal trends, and known events like product launches or marketing campaigns that clients schedule through our dashboard. It pre-provisions capacity fifteen minutes before anticipated demand. The reactive layer watches real-time metrics: CPU utilization, memory pressure, request queue depth, and P95 response latency. If any metric crosses its threshold, new containers spin up within seconds.

We also built a custom cooldown mechanism. After a spike subsides, we don't immediately tear down extra capacity. Instead, we hold it for a configurable window (default: ten minutes) because traffic spikes often come in waves. This prevents the costly churn of repeatedly spinning containers up and down during sustained high-traffic events like Black Friday or viral social media moments.

Load Balancing

Load balancing at GRADAX operates at two layers. The outer layer is a global anycast network that routes users to the nearest data center based on latency, not just geography. A user in Munich might be routed to our Frankfurt node or our Amsterdam node depending on real-time network conditions. This decision is re-evaluated every sixty seconds using active health probes sent from each edge location.

The inner layer distributes traffic across containers within a data center. We moved away from simple round-robin early on because it doesn't account for container health or current load. Our balancer uses a weighted least-connections algorithm that factors in each container's current request count, CPU utilization, and recent error rate. Containers that are struggling get fewer requests, giving them time to recover without being pulled entirely out of rotation.

We also implemented connection draining for graceful shutdowns. When the auto-scaler decides to remove a container, it first stops sending new connections to it, then waits up to thirty seconds for existing requests to complete. This eliminated the 502 errors that plagued our earlier iterations during scale-down events.

Cost Optimization

Cloud infrastructure can become expensive fast, especially when you're running auto-scaling across multiple regions. We took a disciplined approach to cost management from day one, treating infrastructure spend as a product metric just as important as uptime or latency. Every architectural decision goes through a cost-impact review before it reaches production.

Our biggest savings came from right-sizing. We built an analysis tool that samples container resource usage over fourteen-day windows and recommends optimal CPU and memory allocations. Most clients were over-provisioned by 30-50% based on our initial estimates. After right-sizing, we reduced baseline compute costs by 35% across the fleet. We also negotiated reserved capacity agreements with our upstream providers for the baseline workload, locking in rates 45% below on-demand pricing.

For clients who want maximum cost efficiency, we offer a "burstable" tier that runs on shared infrastructure during normal hours and scales to dedicated resources only during peak periods. This tier costs 60% less than our standard offering and is ideal for businesses with predictable low-traffic periods, like B2B SaaS products that see almost no usage outside business hours.

Monitoring & Observability

You cannot manage what you cannot measure, and in a distributed auto-scaling system, visibility is everything. We built our observability stack around three pillars: metrics, logs, and traces. Every container emits structured metrics at five-second intervals covering CPU, memory, disk I/O, network throughput, request count, error rate, and response latency percentiles. These flow into a time-series database that retains high-resolution data for seven days and downsampled data for thirteen months.

Distributed tracing was the piece that took the longest to get right but delivered the most value. Every incoming request receives a trace ID that follows it through the load balancer, application container, cache layer, and database. When a client reports slow page loads, we can pull up the exact trace and see precisely where time was spent. Nine times out of ten, the bottleneck is an unoptimized database query or a missing cache key — not infrastructure.

Our alerting system uses tiered severity levels tied to business impact, not just technical thresholds. Combined with our website security monitoring, a single container running hot is informational. A sustained error rate above 1% across a client's fleet is a warning. Any availability drop below 99.9% over a five-minute window pages the on-call engineer immediately. This approach reduced alert noise by 70% compared to our initial threshold-based system.

Results

After eighteen months of running this architecture in production, the numbers speak for themselves. We maintained 99.97% uptime across all client workloads, including during three major traffic events that each exceeded 50x baseline load. The platform handled over 2.8 billion requests in the last quarter alone, with a global P95 response time of 142 milliseconds. Not a single client experienced downtime due to scaling failures.

On the cost side, our clients collectively save an estimated 40% compared to equivalent configurations on major public cloud providers. The auto-scaling system prevented an estimated €180,000 in unnecessary compute spend last year by scaling down during off-peak hours rather than running hot 24/7. For our fashion marketplace client, the one whose flash sale started this whole journey, their monthly infrastructure bill dropped by 52% after migrating to our platform, despite handling 3x more traffic than before.

The operational improvements were equally significant. Mean time to detection for performance issues dropped from 12 minutes to under 90 seconds. Mean time to resolution fell from 45 minutes to 8 minutes on average. Our infrastructure team, which previously spent 60% of their time on reactive firefighting, now spends 80% of their time on proactive improvements and new feature development.

Looking Forward

We are not done. The next phase of our cloud infrastructure roadmap focuses on three areas: edge computing, AI-driven capacity planning, and carbon-aware scheduling. We are currently testing edge compute nodes in twelve additional cities that would allow latency-sensitive workloads like image optimization and API responses to run closer to end users. Early benchmarks show a 40% reduction in P99 latency for users more than 1,000 kilometers from our nearest data center.

On the AI front, we are training a capacity forecasting model on two years of historical data across all client workloads. The goal is to predict traffic patterns 24 hours in advance with enough confidence to pre-provision capacity before spikes occur, eliminating even the brief latency increase that happens during reactive scale-up events. Initial results are promising, with the model achieving 89% accuracy on next-day peak traffic predictions.

Finally, we are exploring carbon-aware scheduling that would shift non-urgent workloads like backups, batch processing, and pre-rendering to times when the local electricity grid has the highest proportion of renewable energy. This aligns with our broader commitment to sustainable infrastructure and gives our environmentally conscious clients a tangible metric they can point to. If you want to learn how these improvements can benefit your business, get in touch with our team. We expect to roll out all three capabilities before the end of 2026.