The challenge of choosing the best solution
O2 is a fibre and mobile operator with a value proposition focused on offering its customers excellent service at a very competitive price. The company, part of Telefónica España, has been operating since its launch in 2018 with the philosophy of being an invisible part of its customers’ lives, thanks to its commitment to providing the best fibre and mobile networks in the Spanish market, never bothering its customers with commercial calls and offering an ethical and straightforward service. At Paradigma, we are fortunate to work hand in hand with them to host the infrastructure for their online sales and contract channel solutions.
As a digital operator, O2 was challenged to make it easy for customers to order products online and to provide a highly available solution that could handle customer order intake campaigns during peak periods. To achieve this, O2 turned to Paradigma Digital to help them design a cloud solution tailored to their needs and to guide them through the deployment of a robust, reliable and elastic architecture in AWS.
Building the pillars
The main challenge we faced was to design a reliable, highly available infrastructure that would be able to cope elastically with peaks in contracting and optimise resource consumption when it fell. For this reason, it was decided to deploy the infrastructure on AWS, following the principles and using some of the services referenced in the AWS Well-Architected Framework.
Paradigma and O2 analysed the needs to be met by the infrastructure and agreed on the technology stack on which to deploy the solution, focusing on operational excellence and automating the deployment of the solution components.
Aligned with the pillars of the AWS Well-Architected Framework, a solution was proposed that included:
- Deployment of infrastructure as code.
- Integration of the code with version control tools.
- Integration of solution deployment with continuous integration and deployment tools.
- Infrastructure monitoring and business metrics.
- Deployment of solution security tools.
- Solution scalability and high availability.
A quality solution
The solution was designed with three different, logically separated environments. There are two pre-production environments where new functionality is developed, tested and deployed. Before any functionality is deployed into production, an exhaustive battery of tests and code quality analysis is performed to ensure that all new functionality meets security and quality standards.
Once the component to be promoted to production has been analysed, the new functionality is deployed to production in an agile and efficient manner.
Infrastructure implementation details
O2 has built the infrastructure for its platform on AWS. This solution was chosen because of the principles of high availability and scalability offered by cloud infrastructures. This type of model provides multiple scalable services, making maintenance and availability a major advantage at this point.
The infrastructure has three logically separate environments per network, separating the production environment from the non-production environment.
The pre-production environments are smaller replicas of the production environment, with virtually all services deployed on a smaller scale. This is critical for the development team to be able to test the new functionality they are developing in environments that are as similar as possible to production.
Common components of the environments
- A dedicated isolated (VPC) network.
- A bastion server that provides connectivity to the other instances in the environment.
- An autoscaling cluster of servers on which the application is deployed.
- A load balancer associated with a DNS record managed in Route 53, which is presented to the autoscaling clusters.
- Databases deployed on the AWS RDS service, to which the ASG servers have access.
- A Redis cluster managed by the ElastiCache service to cache operations that will be processed asynchronously later.
Security measures on the elements of the account:
- Automatic backups of all assets with relevant information (file versioning in S3, disk copies, DB copies, etc.) using the AWS Backup. managed service.
- Segregation and isolation of services and resources by environment at the network level.
- Encrypt services containing information using KMS keys.
- Detect and block potentially malicious connections using WAF.
High availability and scalability with AWS
AWS
AWS services used in the solution
- Network resources and content delivery: services related to network elements, communications, resource access and publication: VPC, API Gateway, Route 53, AWS WAF, Elastic Load Balancing
- Databases: database configuration, operation and scaling services:
RDS - Computing: services for secure, capacity-adjustable workload execution: EC2, Auto Scaling, Lambda, ECR, ECS
- Storage: services designed to store files and data needed for the solution:
EFS, S3, AWS Backup
- Security: Services designed to maintain the security and integrity of the infrastructure: Certificate Manager, IAM, KMS, WAF
- Administration and governance: Dedicated AWS account management and governance services: CloudTrail
- Cache management: Cache and event queue management services for subsequent asynchronous processing. The main service at this point is:
ElastiCache for Redis Service
Reliable, resilient, secure and cost-effective architecture
Since its inception, O2 has relied on Paradigma Digital to deliver its infrastructure and applications. Over the years, a relationship of trust has developed between Paradigma and O2, evolving from a customer-supplier relationship to a partnership.
Thanks to this model of trust, Paradigma has been able to help O2 build and then operate a robust, reliable and elastic infrastructure in the cloud on which to deploy an application that follows the same premises of reliability and efficiency.
Following the principles of the AWS Well-Architected Framework O2’s architecture meets the standards defined in the pillars of the AWS Well-Architected Framework. Below are examples of how the pillars of the AWS operational framework apply to the specific case of O2:
Reliability
The infrastructure deployed at O2 has the following features to ensure a reliable, resilient and fault-tolerant architecture:
- Automatic failover: The key to successful automatic failover is the monitoring and automation of recovery actions. O2 monitors specific KPIs applied to both service performance and business metrics. If the defined thresholds for these KPIs are exceeded, the infrastructure has a series of automations that are executed to recover the systems from these potential failures. Through this use of monitoring and automation, it is possible to prevent and anticipate failures before they occur.
- Test procedures for system recovery: At O2, performance and load tests are carried out to ensure that both the monitoring and the automated responses to failures are correctly defined and adapted to the reality of the service.
- Horizontal scaling: Horizontal scaling of resources is undoubtedly the main strategy for coping with peak loads or increased requests during periods of stress. The O2 infrastructure has defined several auto-scaling groups for the different components that house each part of the solution. These auto-scaling groups have machines with resources adapted to the usual system load and are based on the monitoring and measurement of certain KPIs for the execution of automations that allow scaling in the group of machines that make up each ASG.
Security
The architecture defined for O2 follows the basic security principles defined by AWS in the AWS Well-Architected Framework. Work is currently underway to implement security measures in the architecture to comply with those defined by the major industry standards.
Identity and access management: A team of administrators is responsible for managing and validating all processes related to the registration, cancellation and modification of users. Procedures are followed to periodically review both users and their assigned privileges, always following the principle of least privilege.
Infrastructure protection: All access to the components used in the infrastructure is secured through the use of various measures:
- Security groups to restrict access to only permitted ranges of resources.
- Encryption of data in transit through the use of secure protocols and certificates on the balancers placed in front of the resource pools.
- Encryption of data at rest, whether on machine-attached disks, databases, or files stored in buckets, using the managed keys of the AWS KMS service.
Threat detection: O2’s infrastructure is powered by the AWS WAF managed service, which detects and blocks malicious access attempts to the various resources that O2 presents the Internet.
Data protection: O2 has a defined backup strategy for the various services used by the infrastructure. This point is implemented by using the strategies defined by the AWS Backup service.
Operational excellence:
In keeping with the best practices defined by the AWS Well-Architected Framework, O2’s infrastructure has been defined to follow the operational excellence pillar through the following points:
- Run operations as code: The entire O2 infrastructure is deployed as code to ensure the integrity and availability of resources in the event of a recovery scenario.
- Frequent refinement of operational procedures: The operational procedures executed on the infrastructure are continuously reviewed and modified by the team of administrators who execute them. This is an essential part of the service to ensure efficiency and adaptability to infrastructure and customer needs and requirements.
- Anticipating failures: O2 has extensive monitoring of all services and resources that are part of the infrastructure. The monitoring is reviewed, refined and modified according to project requirements and infrastructure changes in order to detect and anticipate any infrastructure failures that may occur.
- Use of managed services: The infrastructure deployed at O2 relies on the use of managed services and automated processes to deliver operational excellence. Examples include:
- AWS Backup: Managed service for backups and backup recovery testing.
- AWS WAF: Managed service for vulnerability detection service and blocking of suspicious or malicious requests.
- AWS RDS: Managed database service with a choice of database engines based on the project requirements.
Robust, flexible and highly available architecture
In conclusion, the architecture used for O2 is an example of robustness and resilience, where the high availability and elasticity of the solution combine to provide service even at times of peak load, making it a reliable and adaptable solution to the different needs of the project at each stage.
“The success is due to the commitment and knowledge of the Paradigma team for O2, who have worked with us as if they were part of our team.
Ignacio Ceña Tutor
Operations Manager, O2 Spain
numbers