This blog brings first insight into usage of real bare metal Kubernetes clusters for application workloads from networking point of view. A special thanks goes to Lachlan Evenson and his colleagues from Lithium for collaboration on this post and providing real use cases.
Since the last OpenStack Summit in Tokyo last November we realized the magnitude of impact the containers have on a global community. Everyone has been speaking about using containers and Kubernetes instead of standard virtual machines. There are couple of reasons for that, especially because it is lightweight nature, easy and fast deploys, and developers love this. They can easily develop, maintain, scale and roll-update their applications. We at tcp cloud focus on building private cloud solutions based on open source technologies wanted to get Kubernetes and see if it can really be used in production setup along or within the OpenStack powered virtualization.
Kubernetes brings a new way to manage container-based workloads and enables similar features like OpenStack for VMs for start. If you start using Kubernetes you will soon realize that you can deploy easily it in AWS, GCE or Vagrant, but what about your on-premise bare-metal deployment? How to integrate it into your current OpenStack or virtualized infrastructure? All blog posts and manuals document small clusters running in virtual machines with sample web applications, but none of them show real scenario for bare-metal or enterprise performance workload with integration in current network design. To properly design networking is the most difficult part of architectural design, just like with OpenStack. Therefore we have defined following networking requirements:
- Multi tenancy - separation of containers workload is basic requirement for every security policy standard. e.g. default Flannel networking only provides flat network architecture.
- Multi-cloud support - not every workload is suiteble for containers and you still need to put heavy loads like databases in VMs or even on bare metals. For this reason single control plane for the SDN is the best option.
- Overlay - is related to multi-tenancy. Almost every OpenStack Neutron deployment uses some kind of overlays (VXLAN, GRE, MPLSoverGRE, MPLSoverUDP) and we have to be able inter-connect them.
- Distributed routing engine - East-West and North-South traffic cannot go through one central software service. Network traffic has to go directly between OpenStack compute nodes and Kubernetes nodes. Optimal is to provide routing on routers instead of proprietary gateway appliances.
Based on these requirements we have decided to start using OpenContrail SDN first and our mission was to integrate OpenStack workload with Kubernetes, then find a suitable application stack for the actual load testing.
OpenContrail is open source SDN & NFV solution, with tight ties to OpenStack since Havana. It was one of the first production ready Neutron plugins along with Nicira (now VMware NSX-VH) and last summit’s survey showed it is the second most deployed solution after OpenVwitch and first of the Vendor based solutions. OpenContrail has integrations to OpenStack, VMware, Docker and Kubernetes.
Kubernetes network plugin kube-network-manager was under development since OpenStack summit at Vancouver last year and first announcement was released in end of year.
The kube-network-manager process uses the kubernetes controller framework to listen to changes in objects that are defined in the API and add annotations to some of these objects. Then it creates network solution for the application using the OpenContrail API that define objects such as virtual-networks, network interfaces and access control policies. More information is available at this blog
We started testing with two independent Contrail deployments and then set up BGP federation. The reason for federation is keystone authentication of kube-network-manager. When contrail-neutron-plugin is enabled, contrail API uses keystone authentication and this feature is not yet implemented at kubernetes plugin. The Contrail federation is described in more later in this post.
The following schema shows high level architecture, where on left side is OpenStack cluster and Kubernetes cluster is on the right side. OpenStack and OpenContrail are deployed in fully High Available best practice design, which can be scaled up to hundreds of compute nodes.
The following figure shows federation of two Contrail clusters. In general, this feature enables Contrail controllers connection between different sites of a Multi-site DC without the need of a physical gateway. The control nodes at each site are peered with other sites using BGP. It is possible to stretch both L2 and L3 networks across multiple DCs this way.
This design is usually used for two independent OpenStack cloud or two OpenStack Region. All components of Contrail including vRouter are exactly the same. Kube-network-manager and neutron-contrail-plugin just translate API requests for different platforms. The core functionality of the networking solution remains unchanged. This brings not only robust networking engine, but analytics too.
Lets have a look at typical scenario. Our developers gave us docker compose.yml , which is use for development and local tests on their laptop. This situation is easier, because our developers already know docker and application workload is docker-ready. This application stack contains following components:
- Database - PostgreSQL or MySQL database cluster.
- Memcached - it is for content caching.
- Django app Leonardo - Django CMS Leonardo was used for application stack testing.
- Nginx - web proxy.
- Load balancer - HAProxy load balancer for containers scaling.
When we want to get it into production, we can transform everything into kubernetes replication controllers with services, but as we mentioned at beginning not everything is suitable for containers. Therefore we separate database cluster to OpenStack VMs and rewrite rest into kubernetes manifests.
This section describes workflow for application provisioning on OpenStack and Kubernetes.
At the first step, we have launched Heat database stack on OpenStack. This created 3 VMs with PostgreSQL and database network. Database network is private tenant isolated network.
# nova list +--------------------------------------+--------------+--------+------------+-------------+-----------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+-----------------------+ | d02632b7-9ee8-486f-8222-b0cc1229143c | PostgreSQL-1 | ACTIVE | - | Running | leonardodb=10.0.100.3 | | b5ff88f8-0b81-4427-a796-31f3577333b5 | PostgreSQL-2 | ACTIVE | - | Running | leonardodb=10.0.100.4 | | 7681678e-6e75-49f7-a874-2b1bb0a120bd | PostgreSQL-3 | ACTIVE | - | Running | leonardodb=10.0.100.5 | +--------------------------------------+--------------+--------+------------+-------------+-----------------------+
At kubernetes side we have to launch manifests with Leonardo and Nginx services. All of them can be displayed there.
In order for it to run successfully with networking isolation, look at the following sections.
- leonardo-rc.yaml - Replication Controller for Leonardo app with replicas 3 and virtual network leonardo
apiVersion: v1 kind: ReplicationController ... template: metadata: labels: app: leonardo name: leonardo # label name defines and creates new virtual network in contrail ...
- leonardo-svc.yaml - leonardo service expose application pods with virtual IP from cluster network on port 8000.
apiVersion: v1 kind: Service metadata: labels: name: ftleonardo name: ftleonardo spec: ports: - port: 8000 selector: name: leonardo # selector/name matches label/name in replication controller to receive traffic for this service ...
- nginx-rc.yaml - NGINX replication controller with 3 replicas and virtual network nginx and policy allowing traffic to leonardo-svc network. This sample does not use SSL.
apiVersion: v1 kind: ReplicationController ... template: metadata: labels: app: nginx uses: ftleonardo # uses creates policy to allow traffic between leonardo service and nginx pods. name: nginx # creates virtual network nginx with policy ftleonardo ...
- nginx-svc.yaml - creates service with cluster vip IP and public virtual IP to access application from Internet.
apiVersion: v1 kind: Service metadata: name: nginx labels: app: nginx name: nginx ... selector: app: nginx # selector/name matches label/name in RC to receive traffic for the svc type: LoadBalancer # this creates new floating IPs from external virtual network and associate with VIP IP of the service. ...
Lets run all manifests by calling kubeclt
kubectl create -f /directory_with_manifests/
This creates following pods and services in Kubernetes.
# kubectl get pods NAME READY STATUS RESTARTS AGE leonardo-369ob 1/1 Running 0 35m leonardo-3xmdt 1/1 Running 0 35m leonardo-q9kt3 1/1 Running 0 35m nginx-jaimw 1/1 Running 0 35m nginx-ocnx2 1/1 Running 0 35m nginx-ykje9 1/1 Running 0 35m
# kubectl get service NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE ftleonardo 10.254.98.15 <none> 8000/TCP name=leonardo 35m kubernetes 10.254.0.1 <none> 443/TCP <none> 35m nginx 10.254.225.19 22.214.171.124 80/TCP app=nginx 35m
Only Nginx service has public ip 126.96.36.199, which is floating ip configured as LoadBalancer. All traffic is now balanced by ECMP on Juniper MX.
To get cluster fully working, there must set routing between leonardo virtual network in Kubernetes Contrail and database virtual network in OpenStack Contrail. Go into both Contrail UI and set same Route Target for both networks. This can be automated too through contrail heat resources.
The following figure shows how should look final production application stack. At top there are 2 Juniper MXs with Public VRF, where are floating IPs propagated. The traffic is ballanced through ECMP to MPLSoverGRE tunnel to 3 nginx pods. Nginx proxies request to Leonardo application server, which stores sessions and content into PostgreSQL database cluster running at OpenStack VMs. Connection between PODs and VMs is direct without any routed central point. Juniper MXs are used only for outgoing connection to Internet. Thanks to storing application session into database (normally is memcached or redis), we do not need specific L7 load balancer and ECMP works without any problem.
This section shows other interesting outputs from application stack. Nginx service description with LoadBalancer shows floating IP and private cluster IP. Then 3 IP addresses of nginx pods. Traffic is distributed through vrouter ecmp.
# kubectl describe svc/nginx Name: nginx Namespace: default Labels: app=nginx,name=nginx Selector: app=nginx Type: LoadBalancer IP: 10.254.225.19 LoadBalancer Ingress: 188.8.131.52 Port: http 80/TCP NodePort: http 30024/TCP Endpoints: 10.150.255.243:80,10.150.255.248:80,10.150.255.250:80 Session Affinity: None
Nginx routing table shows internal routes between pods and route 10.254.98.15/32, which points to leonardo service.
The previous route 10.254.98.15/32 is inside of description for leonardo service.
# kubectl describe svc/ftleonardo Name: ftleonardo Namespace: default Labels: name=ftleonardo Selector: name=leonardo Type: ClusterIP IP: 10.254.98.15 Port: <unnamed> 8000/TCP Endpoints: 10.150.255.245:8000,10.150.255.247:8000,10.150.255.252:8000
The routing table for leonardo looks similar like nginx except routes 10.0.100.X/32, whose points to OpenStack VMs in different Contrail.
The last output is from Juniper MXs VRF showing multiple routes to nginx pods.
184.108.40.206/32 @[BGP/170] 00:53:48, localpref 200, from 10.0.170.71 AS path: ?, validation-state: unverified > via gr-0/0/0.32782, Push 20 [BGP/170] 00:53:31, localpref 200, from 10.0.170.71 AS path: ?, validation-state: unverified > via gr-0/0/0.32778, Push 36 [BGP/170] 00:53:48, localpref 200, from 10.0.170.72 AS path: ?, validation-state: unverified > via gr-0/0/0.32782, Push 20 [BGP/170] 00:53:31, localpref 200, from 10.0.170.72 AS path: ?, validation-state: unverified > via gr-0/0/0.32778, Push 36 #[Multipath/255] 00:53:48, metric2 0 > via gr-0/0/0.32782, Push 20 via gr-0/0/0.32778, Push 36
We have proved that you can use single SDN solution for OpenStack, Kubernetes, Bare metal and VMware vCenter. The more important thing is that this use case can be actually used for production environments.
If you are more interested in this topic, you can vote for our session Multi-cloud Networking for OpenStack Summit at Austin.
Currently we are working on requirements for Kubernetes networking stacks and then provide detailed comparison between different Kubernetes network plugins like Weave, Calico, OpenVSwitch, Flannel and Contrail at scale of 250 bare metal servers.
We are also working on OpenStack Magnum with Kubernetes backend to bring developers self-service portal for simple testing and development. Then they will be able to prepare application manifests inside of OpenStack VMs, a then push changes of final production definitions into git and at the end use them at production.
Special thanks go to Pedro Marques from Juniper for his support and contribution during testing.
Jakub Pavlik, Marek Celoud & tcp cloud team