Google Zero Trust Architecture Practice-Chief Security Officer

Background introduction

The author of this article is Chen Zhijie, who was fortunate enough to participate in Google’s production environment from 2015 to 2020Zero trust(Zero Trust in Production Environments) theory and practice. The Binary Authorization for Borg (BAB) system developed in this context has achieved full coverage in Google's production environment: before anyone can run any package as any service in the production environment, it must establish a sufficient authorization for the target service. Strong BAB security policy. Programs that do not comply with BAB security policies will not be allowed to run as the corresponding service.

In the process of implementing and promoting this zero-trust production environment, the BAB team took many detours, but also gained a lot of experience. Starting in 2017, the BAB team began to turn these practical experiences into theory and successively released a series of white papers ( BeyondProd, Binary Authorization for Borg, SLSA: Supply-chain Levels for Software Artifacts), books ( Building Secure and Reliable Systems) and reports ( Evolve to zero trust security model with Anthos security, Zero Touch Prod). At the same time, this zero-trust concept has also begun to be promoted to more application scenarios, including zero-trust on the public cloud, zero-trust on the public cloud's own infrastructure, and zero-trust in the development of Android and Chrome themselves and their apps.

The content shared this time is all based on the above information that Google has made public, and does not leak Google company secrets or violate any confidentiality agreement. The conclusions in this article represent only the personal views of the author and are not necessarily the official views of Google.

What is zero trust

What is zero trust? Different people are likely to give different answers. Some say zero trust is Workload micro-segmentation; some say zero trust is Continous threat monitoring; some say zero trust is replacing trust in the network with trust at the peer end (Trust endpoints, not the network); others say that zero trust is two-way TLS authentication (mTLS). These all make sense, but are obviously not comprehensive.

Here, the author would like to borrow a joke about machine learning: "Machine learning is glorified statistics." Similarly, Zero trust is glorified least authority. privilege). This is obviously also inaccurate, because both machine learning and zero trust place more emphasis on specific problems and application scenarios than statistics and least privilege. Machine learning and zero trust are not a single discipline or theory, but innovative practices in multiple fields (such as mathematics, computer architecture, distributed systems, storage, networks, etc.) developed to solve practical problems.

It can be seen that for zero trust, we do not need to be too rigid in definitions and theories, but should combine it with practice and start from the specific problems and application scenarios to be solved. Therefore, we temporarily combine the problems and application scenarios in this article to define zero trust as: starting from the data and permissions to be protected, the comprehensive reduction and reconsruction of trust in production with regard to protected data and privileges).

A more important question than defining zero trust is: Why do you want to do zero trust?

Why do we need zero trust?

Why do we need zero trust? The most essential reason is that we have some important data or permissions that need to be protected, and the existing security system can no longer provide sufficient protection in the cloud native era. Users' personal privacy data, employee salary data, permissions to change passwords, permissions to shut down key systems, etc. are all protected objects. These data and permissions are initially protected by network access control based on perimeter security, such as the company intranet and VPN. When a device is connected to the company intranet, it inherits the ability to obtain these data and permissions. right. Later, people began to introduce IP-based or username and password-based access control, as well as more fine-grained access control based on identity and role (Identity and Role). However, these are no longer able to meet the data and permission protection requirements in a cloud-native environment, especially in complex enterprise IT systems like Google.

Cloud native brings the following challenges: First, enterprise IT systems have evolved from a single computer room model to a cross-cloud hybrid (multi and hybrid cloud) model, which has led to the blurring of boundaries and the inefficiency of security based on boundary protection; second, Computing based on containers (Containers) leads to host uncertainty. The same service may be migrated from one physical host to another physical host at any time without being offline. This makes host-based security protection no longer possible. Further realize in-depth protection based on service logic; third, microservices (Microservices) and fine-grained APIs increase the attack surface. Identity and role are no longer just the identity and role of the end user, but more important aspects of the interaction between microservices. Fine-grained identity and access control.

Starting from Google’s situation at the time, around 2015, we had internally implemented a relatively mature rights management system based on keys and identity authentication, but two security threats caught our attention: First, post-Snowden In the era of communication between machines in the data center, how to establish mutual trust and how to ensure that a compromised host or compromised service will not affect other parts of the production environment? The second is when we have a good authorization and audit system for one-off data and permission access, how to deal with large-scale data leakage caused by batch (Batch) data and permission access like machine learning? Hidden danger? It is urgent to establish an insider risk management and control system based on zero trust.

Of course, all this is based on a solid traditional security fundamentals: network and host protection systems, trusted boot, key management systems, identity authentication and management systems, etc. Zero Trust is a superstructure built on this security fundamental.

The three elements of Zero Trust: Chain of Trust, Identity 2.0 and Continuous Access Control

Chain of trust, identity 2.0 and continuous access control are the three major elements of zero trust.

chain of trust

Zero trust does not mean the absence of trust at all, but a clear process of reconstructing the chain of trust (Chain of Trust) starting from a few basic minimal roots of trust (Root of Trust). Several typical examples include: Multi-Factor Authentication (MFA) is the root of trust for a person’s identity; Trusted Platform Module (TPM) and Trusted Boot (Trusted Boot) are the machine’s identity. Root of trust; source code and trusted build (Trusted Build) are the root of trust for software. Trust in a huge IT system starts from these most basic roots of trust and establishes a complete chain of trust through a series of standardized processes (some call it the Tree of Trust or the Web of Trust). Trust).

Identity 2.0

Identity 2.0 is the standardization of the above trust chain so that the information collected during the trust establishment process can be used in security access policies. In Identity 2.0, all entities have identities, users have user identities, employees have employee identities, machines have machine identities, and software also has software identities; in Identity 2.0, all accesses have multiple identities ( Also known as Access Context), for example, access to a row of data in a database will have something like "In order to help the user solve a technical problem, employee A requested access on machine D through software C with the authorization of user B." Visit background.

Continuous access control

With the rich identity and access background information provided by Identity 2.0, we can establish a continuous access control system (Continous Access Control) based on this. Continuous access control continuously controls access in all aspects of software development and operation. Several typical examples include: requiring multi-factor authentication when employees log in; requiring that software be built from a trusted source code library in a secure environment and undergo code review (Code Review) when deploying software; When establishing a connection between microservices, both parties are required to provide host integrity certificates; when the microservice obtains specific user data, the user's authorization token (Authorization Token) is required.

Zero Trust Deployment Example

In this section, we provide two specific zero-trust deployment examples: the first is how users obtain their own data, and the second is how developers change the data access behavior in the production environment by modifying the source code. Google's data access control in both cases follows the principle of zero trust.

Users access their own data

When a user accesses their data through Google's services, the request first reaches GFE through an encrypted connection (TLS) between the user and Google Front End (GFE). GFE switches to more efficient and secure protocols and data structures to distribute user requests to various back-end services to jointly complete user requests. For example, TLS will be replaced byApplication Layer TLS(ATLS). User-facing passwords are converted into more secure End User Context Tickets (EUC). These permutations are designed to reduce the permissions of internal connections and tokens based on the actual request, so that specific ATLS and EUC can only access the data and permissions limited to this request.

The following areBeyondProdOriginal picture, the Developer in the picture should actually be the User:

Developers change software data access behavior

When a developer wants to change the data and permission access behavior of a service by changing the service code (the developer cannot obtain any command execution permissions on the production host, such as SSH), the code modification will go through a series of processes to affect the development process. The trust of the author and his team is transformed into trust in the new service: all personnel involved in the process establish their identities through multi-factor authentication, and code modifications will be evaluated by one or more people with approval authority. Only with enough authorization are obtained Only then will the code be merged into the central code base. The code in the central code base will be centrally built, tested and signed by a trusted build service that is also protected by BAB. The built software will pass the BAB policy certification of the target service during the deployment process. (For example, GMail's service can only run code that has been fully evaluated and tested in a specific location of the code base). When the software is running, it will also be isolated according to the corresponding BAB security policy: services governed by different BAB service policies will be run in Different quarantine areas.

The following areBeyondProdOriginal picture, please see the original text for details:

Practical experience and lessons

Put people first and build trust from the process

When using BeyondCorp to implement zero trust in the development environment (Corp Environment), we put people first and conduct identity management and multi-factor authentication for employees. At the same time, we have established a process for managing company devices, and each company device is equipped with TPM modules and operating system integrity verification. These two efforts ensure that the right person, using the right device, provides trustworthy authentication information. Finally, we use this authentication information to continuously control employee access to the development environment.

When using BeyondProd to implement zero trust in the production environment (Prod Environment), we also try to use people and processes as the foundation of trust. The problem BeyondProd faces is that the production environment does not have direct interaction with people, so we have established a set of traceability software in the production environment to the developers of these software (Software Provenance), from the certification and development of developers Hardening of the process begins to ensure that no single person can change the behavior of the software in the production environment (No Unilateral Change).

security rule level

Rome was not built in a day, and promoting zero trust is also a gradual process. To quantify and incentivize safety improvements, we use safety rule ratings to measure whether one safety rule is "safer" than another. For example, in the Binary Authorization for Borg system, we have introduced the following security levels:

Security level 0: No protection. This is the lowest security level, which means that the service is not protected by BAB at all. This may be because the system does not have any sensitive permissions, or it may be because the system uses other protection equivalent to BAB.

Security level one: auditable code. This level of security rules ensures that the software used by the corresponding service is built from known source code in a secure and verifiable environment.

Security Level 2: Trusted Code. In addition to ensuring security level 1, this level of security rules also ensures that the software used by the corresponding service is built from code that has been code reviewed and tested in a specific code base (such as Gmail's own code base) Come. As of February 2020, this level is the default protection level for all Google services.

Security level three: Trusted code and configuration. In addition to ensuring security level 2, this level of security rules also ensures that the configuration files used by the corresponding services have gone through the same security process (Configuration as Code) like code. As of February 2020, this level is the default level for all Google's key protection services.

Alarm system and authorization system

In the process of promoting zero trust, in order to provide all parties with a smooth migration experience, we do not directly prohibit all access that does not comply with security rules, but provide two modes: alarm system and authorization system in the security rules themselves. Under the alarm system, access that violates security rules will not be blocked, but will be recorded and alarmed to relevant personnel. Under the authorization system, access that violates security rules will be blocked immediately. The existence of this dual system not only gives people the opportunity to continuously iterate and improve non-compliant behaviors based on alarms, but also provides an effective mechanism to tighten security rules and prevent regression after non-compliant behaviors are eliminated.

Be safe and stable

The complexity of zero trust determines that it will also face new challenges in maintaining system stability (Reliability). In the process of practicing zero trust, we provide emergency break-glass mechanism (Break-glass Mechanism) for most scenarios. This ensures that in an emergency, operators can break the limitations of the zero-trust system and perform some complex emergency operations. In order to continuously ensure security, once the emergency breach mechanism is called, the security team will immediately receive an alarm, and all operations under the breach mechanism will also be recorded in detail in the security log. These security logs are scrutinized to verify the need for breach. These security logs will also help design new zero-trust features to avoid invoking emergency breach mechanisms again in similar situations.

Pay attention to endogenous risks

From a defense perspective, endogenous risks are a superset of exogenous risks: when an attacker compromises the device of any insider (legitimate user or employee), the attacker becomes an insider, so whether it is an external attacker Whether it is internal violators or internal violators, they will eventually become endogenous risks. From this perspective, zero trust assumes that any host can be compromised.

Security infrastructure

The implementation of zero trust relies on a solid basic security architecture. Without the foundation, there is no superstructure. Google Zero Trust relies on the following infrastructure to provide basic security:

Data encryption and key management (Encryption and Key Management)
Identity and Access Management
Digital Human Resource
Digital Device Management
Data Center Security
Network Security
Host Security
Container Isolation (gVisor)
Trusted Boot
Verifiable Build
Software Integrity Verification
Mutual TLS (mTLS)
Service Access Policy
End User Context Tokens
Configuration as Code
Standard Development and Deployment

other

In addition to the above lessons learned, start small, then iterate, defend in depth, quantify security investment return (Quantify return over investiment), reduce costs through standardization (Lowering cost through homogeneity), security Shifting left, etc., are also principles we have accumulated in practice, so we will not go into details here.

in conclusion

To achieve zero trust well, 20% relies on theory and 80% relies on practice. The practical solution for zero trust is not unique. The author hopes that by sharing the above example of zero trust practice, it can serve as a starting point. Welcome everyone’s criticism and correction!

This article is from a contribution, does not represent the position of the Chief Security Officer, if reproduced, please specify the source: https://www.cncso.com/en/googles-zero-trust-architecture.html

Google Zero Trust Architecture Practice

Background introduction

What is zero trust

Why do we need zero trust?