Sunday, February 5, 2017

Debate on database in a container versus database service 

Containers have been widely accepted as the new paradigm for software services and Platform as a service. To quote some trends from Datadog HQ: 
  1. Real Docker adoption is up 30% in one year 
  1. Docker now runs on 10% of the hosts we monitor 
  1. Larger companies are leading adoption 
  1. 2/3 of the companies that try Docker adopt it 
  1. Adopters 5x their container count within 9 months 
  1. The most widely used images are still registry, NGINX and Redis.  MySQL moved up from its position at #9. Postgre is the second most widely used open source database became the new #9. Running databases in containers is therefore popular. 
  1. Docker hosts often run five containers at a time. 
  1. VMs live 6x longer than Containers 

Container use for a database is suggested only along the following lines: 
  1. use of volume API for the persisted storage layer. The reason volumes are efficient is that they bypass the otherwise layered architecture that stack up to form the unified view of the image. 
  1. use of a specified directory on the host mounted into a specified location into the container. 
  1. use of a shared data volume container dedicated to a database 

As we can see from this methodology, this is rather limited in scope, size and scale of growth for a database and particularly for any production readiness or availability. The data storage/access is independent of whether the Mysql software runs in a Docker container competing with other applications or not.  Moreover the data should not be made local to the container file-system versus a volume because it will add overhead.  In the end, compute and storage requirements are different for applications and database. 

On the other hand, the same automation of failover, cluster based replication and availability can be achieved with database cluster, replica set, multi-DC failover, connection pooling etc. which comes with the topology for deployment of a database server. 

If it makes no difference to data access, then the database server can be moved to its own container with availability during container restarts. 

MySQL did a performance study on the container with regard to the above-mentioned three usages. They measured I/O and network overhead and compared the results to a stock instance of MySQL. Heavy I/O bound load gave rather even results There is neither I/O nor network overhead in this case.  Then the buffer pool size was scaled up and there was significant overhead from container usage as compared to the stock instance. 

There is a lot of inefficiency introduced with the container networking when the data does not reside local to the host, which is usually the case when the data grows larger than most storage on host virtual machines. First there is the overhead from the bridged network. Second there is the overhead for access of a remote volume over a mount point via the distributed file system. Compare this to the direct access of a database via a gateway-routed connection. Sure if the user can tolerate a minute of delay from service or container restarts and the usages are sporadic or shallow, then these don’t matter. For more involved access to a database, where the query execution in the order of milliseconds, matter disk access latencies are nothing in comparison to the monstrous network delay we introduce for relays of the data over the network.  Why then should we not push the database server closer to the data? 

No comments:

Post a Comment