During an online sale that includes promotion to a list, brands worry about downtime due to “unexpected high load”. We believe that while every system has a rated limit, there can be flexibility added to your infrastructure. With some planning, including autoscale and tuning, a successful sale event can be a very likely outcome. See how we handled one such sale without a hitch.
In our experience promotions via SMS generates very high peaks. In this article we will show our recent experience with a sale day that generated a peak of 30 hits per second to Magento.
On Saturday 21st March, 2020 one of our customers had a sale event. The merchant – a fashion retail brand – is an omni channel retailer in India with about 145 stores across 45 cities. There have been previous sale events, but this one was different as most stores were closed due to the corona virus measures in place. This meant that more customers were likely to shop online than visit the store.
The sale was promoted via sms and email. The SMS list was about 1 million. Promotional messages were sent in batches of 200k every half hour. Marketing was to start at 10 AM local time.
Monitoring the website
We monitor the nginx log file, analyze and display information on our dashboard. A key component of our dashboard is the plot of number of hits / minute (we count only php hits, static hits like css, js and images are excluded). We also calculate avg response time per minute as well as standard deviation per minute. Both of these have a unit of seconds. We also have alerts setup for many parameters including slowdown and a 5xx error response.
The graph below is for a 24 hour period, including the sale time, which lasted about 5 hours with the effect keeping the site more busy than usual afterwards. Note these are server logs graphs – they look quite different from google analytics!
The traffic & response
The graph gives as an idea of the traffic to the website. The turquoise line is hits per minute served by php.
- Traffic peaked a about 2000 hits per minute – about 30 hits per second.
- 3476 items were added to cart during the sale.
- Average server response was mostly below 1 second.
- There were 3 errors – all of them related to checkout function (“This payment method is not available”)
Mysql query cache and response time
The store is a Magento 1 website. Just as the traffic / hits to the website started peaking, we got a slow alert – represented by the large black line close to the first traffic peak around 10.45 am in the previous figure. Our prior experience with the website showed that the mysql query cache plays a role in the site performance. Before the sale started, mysql query cache was turned off. On seeing the slowdown alert, we turned on the mysql query cache and saw an immediate improvement in performance. Both avg response per minute and standard deviation per minute improved – for some time.
However, standard deviation – higher number indicates some visitors faced slower response to the website than others – deteriorated with time. After the sale (10pm local time), mysql query cache was turned off – resulting in improved standard deviation, but slightly worse avg response time.
The following graph plots AWS autoscale instances over the same 24 hour period. A time based minimum 5 instances was planned during the sale from 9.15 AM to 7.30PM. At the peak 7 instances were required – our autoscale policy adds 2 instances and removes 1 at a time.
luroConnect Architecture for AWS and controls
Our standard scalable architecture uses nginx as a load balancer. The communication between the load balancer to php servers is using fcgi – individual php servers do not have nginx. A NFS server is used to share folders that need sharing – media and var folders in Magento 1. Magento cache and sessions are served from redis. Code is also shared using nfs.
A custom lambda function communicates between autoscale lifecycle hooks and the nginx load balancer adding and remove instances as the AWS autoscale policy decides.
Our cache controls including the ability to turn on and off mysql cache, resize memory allocated to redis cache, resize memory allocated to php-fpm opcache.
Interested in knowing about the advanced architecture of luroConnect?
Fill the form below and we will contact you.