in interview architecture high-load scalability amazon services ~ read.

Crafting architecture solution at tech interview

From time to time at technical interviews, your can face with architecture tasks. The main aim of such tasks to find out your ability to create scalable and solid solutions with attention to bottlenecks. Moreover, such task gives an opportunity to understand of your outlook with modern tools and frameworks. In addition, this type of problems in a real world is coming with a lot of unclear/unknown facts and numbers, the same you should expect at interview task. I have faced such kind of task and going to show an example of possible solution.

What is the task?

It is needed to create service for "recording" user actions on customers' sites and then replay user action on backoffice side of service for customers to allow then understand how to improve UX of the site. Service should be ready to track 100K sites (it is a level of top service with the same functionality).

A simple version of the recording feature looks like an easy task at first glance. It is not so.

High-level feature requirements

  • Customer uses service to record user actions at target website,
  • Customer should be able to replay user actions at service side at any time,
  • Service should collect as many actions from website as possible (mouse events, user inputs),
  • No information about how many users and records service should be able to service. To be defined. (One of the main challenge of such kind of tasks)

Collecting events from customer’s site

For capturing and sending user actions JS-script can be used. The customer should include it on the tracked site.

Collecting events from customer’s site

JS-script should track and transmit next events:

  • Mouse and keyboard events over the whole document (onmousemove).
  • Sniff input fields (JQuery bind function can do this).
  • Mutation observers for every DOM element at a page (tracking of DOM changes).
  • Track end of a user session (a number and onbeforeunload events).

All events can be easily captured via JS and send to service. Events should be tracked with some latency (~50-100 ms will be ok). It reduces the number of events and decreases the browser load.

Replaying event by user requests

  1. After picking up all the session's events load site in a headless browser on backend side.
  2. Put transparent canvas in a foreground over the whole web page.
    3.1. With an image of a mouse cursor drawn on canvas emulate mouse moves by moving image according to saved events.
    3.2. Clicks on hyperlinks can be done as loading of another page to the headless browser.
    3.3. Input values can be emulated via JS with filling input fields.
  3. Screenshots should be taken every ~50-100ms (approximately 10-20 screenshots per second) via API of the headless browser during the emulation process. Collected images will be used to produce a video for showing user action at the web page at customer's backoffice.

PhantomJS allows using the headless browser to load target web page, manipulate with its DOM and create screenshots of a loaded webpage. Next, a video can be created using a server-side utility (e.g. ffmpeg with command ffmpeg -f image2 -i image%d.jpg video.mpg).

Architecture solution

Let's try to estimate possible load for such kind of service. There are no requirements in original task, so let's assume load similar to one of the most popular services with the same functionality (~100K sites).

Assumptions about load

  • Let's use concept 80/20 and consider that ~20k of sites are in Alexa top 1M.
  • Alexa top 1M sites visited per day with ~10-100K times per day.
  • Assume that on average, these 20K sites have 50K users per day.
  • 80K of sites have significantly fewer visitors, let’s take 50 users/day on average.
  • As similar service, let's provide some free plan allowing a customer to capture 100 sessions per account.
  • Let's collect events once per 100ms.
  • Service should be scale ready to extend a number of processed sites and increased number of stored sessions for non-free plans.

Peak load

  • Every site from top visited 20K produces ~0.6 sessions/s (50k visitors per day/24 * 3600)
  • Top 20k sites in total produce: 0.6 sessions/s * 20k site = ~12k sessions/s
  • Each visitor could produce at max 10 events/s: 12k sessions/s * 10 events/s = 120k incoming events/s
  • Rest 80k site will produce: 80k sites * 50 sessions/day / 24 hours * 3600 s = 47 sessions/s => 47 sessions/s * 10 events/s = 470 events/s
  • In total peak request number: 120.5k requests/s (it is still ~120k request/s)

What’s about data?

Let's estimate Event object size:

  • session hash (e.g. UUID) - 16 bytes
  • page address - ~500 bytes
  • object event (mouse event, click, input, terminating of a session) - ~100 bytes
  • other possible information, object structure overhead

So, let's roughly estimate the size of one event ~1Kb.

Incoming peak traffic will be 1Kb * 120K events/s = 120MB/s.
In case unlimited number of collected sessions it will give us ~10Tb of incoming data/day

In the case of limit 100 sessions per site:

Assume that on average users clean old records once a day, then:

  • Day sessions: 100K site * 100 sessions/day = 10M sessions/day.
  • Average time spends on site by user ~1 min.
  • Day events: total events for one session: ~60s * 10 events/s = 600 events per session.
  • Day load: 10M sessions per/day * 600 events = 6 BN events/day (~65-70k events/s).
  • Data size per day: 6 BN * 1Kb = 6Tb of incoming data/day = 70 MB/s.
  • The peak of incoming traffic still ~120 MB/s.

Incoming events should be considered as a temporary data needed for creating videos. It means there is no need to persist events. Videos should be persisted on disk storage system. The path to the video and some basic information are the only data persisted in Database.

An in-memory database (e.g. Redis) is the most suitable place for temporary storing events. 120Mb/s (120K events/s) is a normal load for an in-memory database hosted with one medium-level server. Let's consider that max lifetime of stored events will be ~1 hour, then in-memory database then should have capacity 120Mb/s * 3600s ~ 500GB.

Let's estimate amount of data written at database layer:

Session object size:

  • a path to video - 200 bytes
  • session id - 16 bytes
  • some additional info (e.g. date stamps, )

In total let's consider it as ~ 0.5 Kb

Incoming day data on database side: 10M sessions/day * 0.5 Kb = ~5Gb/day.
Next day stored amount would not grow due to limit in 100 sessions. Deleted sessions will free disk space.
Amount of data written per second: ~100 sessions/second => ~50 Kb/s will be written on DB layer

These are very comfortable numbers for DB layer. A regular server can manage with incoming data.

Disk space consumed by video:

  • Let's consider average size of 1 minute session's video ~5Mb (could be less, need to investigate video compression algorithms).
  • Daily data: 10M sessions * 5Mb = 50 Tb/day

It is a big amount of data: however, it would not increase, due to 100 session/site limit.

Summary of math

  • Servers: able to process 120K requests/s
  • In-Memory database: capacity 500GB, write speed 120MB/s (120K puts in cache/s) + approximately same read speed
  • Database: start with amount 5GB/day, not increasing heavily
  • Disk: start amount ~50TB, not increasing heavily, write throughout ~600MB/s

The real-life load will be several times lower, but the setup should be ready to increase capacity up to these numbers.

A server at app-layer could be run with disruptor framework. From my experience, it gives an opportunity to serve about 15k request/s with complex business logic. Here, we have quite simple logic: performance could be much higher (I would expect ~50k events/s, with some optimisation could increase performance significantly).

The in-memory database could manage with example load. Redis hosted at a medium level server can process up to 500k puts&gets/s.

Database: no worries. We are safe.

Disk storage: Write throughput (~600 Mb/s) could be reached by routeing writing processes to different disks.

Algorithm of processing

  1. Incoming events are picked up by app servers, which puts events in the in-memory DB using session id as a key.
  2. Once received end-session event app server performs get operation by session id to in-memory DB to get all corresponding events, pack events as job object and put it in a job queue.
  3. Worker servers pick up job objects and perform loading website in a headless browser, emulate user actions, taking screenshots and rendering a video to disk storage.

Architecture of service

According to limitations described above service architecture should have 5 main components:

  1. App servers for processing incoming events (scalability important)
  2. In-memory DB (scalability important)
  3. Worker servers to create videos (scalability important)
  4. Disk storage for created videos (scalability & resilience important)
  5. Database to store meta info about videos (resilience important)

How to connect all these components?

  1. All incoming requests should be routed by load balancer between app servers.
  2. According to math, it will be enough to have 1-2 active servers to process all incoming events (need to validate with load tests). Can be scaled easily.
  3. To make in-memory database scalable solutions like Redis Cluster or Amazon ElasticCache can be used. They will take care of scalability and resilience out of the box.
  4. A Message queue can store ready job objects.
  5. Stateless workers can be easily scaled to process incoming jobs for creating videos according to incoming events.
  6. Database layer for storing metadata about videos does not seem to be a part of the system where expected high-load. It looks like usual database server will be enough to store all expected data with huge reserves.
  7. Load to disk storage system should be split because potential writing speed can be several times more than writing speed of modern SSD disks. Amazon EBS can be used, where max throughput can be scaled up automatically to 1.2 GB/s. The same solution can also be organised without Amazon: the main idea is to split writing threads across several physical disks. It can be done at the moment of putting a job in a message queue (just putting disk name as a field of job object) or using a load balancer between worker servers and disk storage system.

alt

Resilience is critical for database and disk storage system (it is could be easily archived by using replication servers), however, all other components are not so critical. In worst case saved user sessions currently in processing will be lost. Considering there is a limit for 100 sessions per site it is not so critical for the system - there is no aim to save all user sessions, service should just give a set of sessions for analysis. Lost sessions will be replaced with new incoming sessions once service components are started.

At the conclusion I'd like to mention, it is clear that calculated numbers give an opportunity to find out bottleneck paths and design architecture solution taking these paths into account. However, to estimated definite number of servers for application cluster, a number of cache instances, capacity and latency of system some experimental research should be done. Please, remember that possible peak values were calculated with load pessimistic scenario (there are no unused accounts, dead/unvisited sites, customers capture sessions all the time, etc).

During preparation, I have followed knowledge that obtained from:

highscalability.com