Netflix built its own monitoring system – and why you probably shouldn’t – by Roy Rapoport at QConLondon
This was the last keynote of QConLondon. The talk was divided into two parts: the first part dealt with the question « Should we build our own system (NIH (Not Invented Here) syndrome)? » while the second part was a short status update of what Roy’s team has done and where they are going with their own monitoring system. First, Roy started with a check-list of questions you should ask yourself when you are facing a problem:
- are you the first to have this problem?
- are you the first to care about this problem?
- are you really sure you are the first to have this problem?
If you’re not the first:
- does a solution exist?
- do you have to pay for an existing solution?
- do you have to build it yourself?
- have you taken in account the financial cost?
- are there any technical incompatibilities with existing solutions?
- are the products overqualified to address your issue? Are you sure they may have thought about solutions to issues you haven’t figured out yet?
Then Roy examined the reasons of the NIH syndrome which he would tend to qualify more as « Not invented by us ». Mainly this is a question of trust:
- technical trust: I don’t trust the product that could solve my problem (performance issues, etc.)
- « organisational » trust:
- you’re selling me something, can I really trust you?
- I’m not your only customer: will you addressed my needs quickly?
- I’m not an important customer: will you addressed my needs quickly?
- you don’t care about your customers
- will you be able to build your product fast enough to meet my requirements (« unpredictable velocity »)?
But it’s also a matter of honesty: is it really important to you? From which perspective(s)? Technical, personal glory, etc. In the end, you may notice asking yourself the question « May be I should build my own system? » If you’re a big company, while you cannot afford if you’re not. Whatever the decision you made, Roy concluded that you have to fight for what you are doing…
Microservices: software that fits in your head – by Dan North at QConLondon
Earlier in the morning at QConLondon, the manager of the track “Taming microservices” told us that Dan North was an awesome speaker and that we should attend this track… he was right. The talk was more focused on what may be changing developers’ minds on what the code is really and on design considerations than on micro services. Dan started with a few questions: what is the purpose of software development? What is the goal of software development? Is the purpose of software development to create a positive business impact? The goal of software development is sustainably minimizing the lead time to business impact. So, the goal is NOT to produce software: in other words, the code is NOT the asset. It’s the opposite, the code is the COST:
- writing code costs
- waiting for code costs
- changing code costs
- understanding code costs (there are 3 types of code: the code I know, the code everyone knows (well documented APIs, lots of code examples)… and the code no-one knows which is probably the biggest part! :))
Thus, code is not the asset but the cost. From this perspective, code should be either stabilised or killed off fast! To achieve this, Dan sees two complementary patterns:
- short software half-life
- fits in my head
« Short software half-life » is “inspired” of the half-life of radioactive elements. The half-life of a radioactive element is the time needed for half of the material to decay (internet source here). As physics does for radioactive elements, we should consider the half-life of our code: at an instant, when we look at code, we should try to think about its half-life. What will the code become? How will it survive? We have to take in account some design considerations when coding:
- writing discrete components
- defining component boundary
- defining component purpose (why this code is here?)
- defining component responsability (what happens if I remove or change this code?)
In addition, we must consider:
- writing tests and docs (… but for the “survival” code only. So, sometimes we may wait for the code to be stable before starting this)
- optimising for replaceability
- expecting to invest on stabilising
- building stable teams (with shared knowledge and practices)
« Fits in my head » is the avaibility to reason on code. If code is hard to reason, it won’t fit in my head. But there are multiple scales: you can reason at a method scale, at a module scale, and so on. « Fits in my head » is also about « What would James do? ». Under this question hide the issue to create a common shared knowledge across the team and a contextual consistency on which the team and the code could rely on. Contextual consistency means actually « given the same context, given the same constraints, we’d likely to make the same decisions ». So, once the team agrees guiding principles and idioms, on doing the things in the same way, you can make statements on « difference is data ». In summary, if the code is consistent but a piece of code is different, one can infer that there are special constraints that lead the team to code differently. At last, the team should strive for simplicity not familiarity. Familiarity is « being used to do something » and this « something » can contain lots of inconsistency or bad practices. While simplicity is « we’ve been committed and we’ve simplified it, we’ve made it obvious ». These principles can lead to a replaceable components architecture. And microservices can be this kind of architecture as long as we choose for replaceability and consistency: « smaller is not necessary better, more replaceable is better » The talk was great and I suggest you watch it if you get a chance. What I’ll remember will be: « code is not the assert but the cost », « kill code fearlessly » and « more replaceable is better »!
Reactive application design for high volume multi-dimensional temporal data series – by Stuart Williams at QConLondon
This talk was about how Spring Integration, Spring Expression Language (aka SpEL), Reactor with the LMAX Disruptor could address finance issues such as processing billion of events per day.
- Spring Integration is a Spring project that addresses the Enterprise Integration Patterns by providing to Spring-based application a lightweight messaging system.
- Spring Expression Language (SpEL) is a part of the Spring core libraries and enables to evaluate expressions
- Reactor is a library (supported by Pivotal) whose foundation is reactive programming patterns and implements the Reactive Streams Specification. This latter one being born from the Reactive Manifesto
- LMAX Disruptor (LMAX is a company) is a concurrency framework that has been used in finance projects to achieve to handle a high number of events in a short time
The talk gave an overview of the Reactive Manifesto whose the four pillars are:
- Responsiveness: the system must responds in a timely manner if at all possible
- Resiliency: the system must stay responsive in the face of failure
- Scalability / Elasticity: the system must stay responsive under varying workload
- Events / Message Driven: enables responsiveness, resiliency and elasticity
Stuart showed us some codes that uses Reactor. Then he mentioned some issues about performances that have been fixed thanks to the SpEL team that improved drastically this library. And at last, he went through some pieces of code to demonstrate how all these technologies worked well together.
Protocols of Interaction: Best Current Practices – by Todd Montgomery at QConLondon
This talk was in the track « Taming microservices » but actually was not so much related to micro-services. Except if we consider we need protocols to make microservices talk to each others. It was a pretty interesting one. But quite difficult to transcribe. So, if you are interesting in designing or implementing protocols, I would advice you to watch the talk or at least read the slides…
No Free Lunch, Indeed: Three Years of Micro-services at SoundCloud – by Phil Calçado at QConLondon
Phil started his presentation by a reference of a Martin Fowler’s article (« You must be this tall to use microservices »). To be able to achieve a microservices architecture, you must have at least:
- a rapid provisioning (of servers / containers)
- a basic monitoring
- a rapid application deployment
These are three pillars of the DevOps culture. Then, Phil presented the context at the time they were working on a microservices architecture in 2011:
- they were switching from nodejs, Go and Ruby to Clojure, Scala and JRuby for the implementation of microservices
- Docker, Kubernetes, and consors… didn’t exist yet
So, to handle their microservices architecture, they use:
- for provisioning: lxc containers and doozer. They tried to address best practices stated by the 12factors.net
- for telemetry / monitoring: statsD, Graphite, Nagios and PagerDuty. But they were not satisfied by this and decided to build their own monitoring system (NIH see Roy Rapoport keynote). Here was born Promotheus, an open-source monitoring system and time series database. It comes with some nice dashboards, an alerting system, a query language and polyglot client libraries. Promotheus replaces statsD, Graphite and Nagios in the monitoring architecture of SoundCloud. Icinga and PagerDuty complete the tool suite used for the monitoring
- for deployment: Jenkins that produces Deb packages that can be deployed on AWS, dockers and so on. Before, they use Jenkins with 7 shell scripts but it was tedious and could not fit the explosion of microservices they expected
Since, technologies have evolved (especially for containers), SoundCloud could have thought to migrate to Dockers (with Kubernetes). But right now, it’s not a priority and they prefer to see how things will evolve for these new technologies… It was interesting to discover which tooling is used at SoundCloud to address their microservices architecture (especially the monitoring tooling)…
Building a Modern Microservices Architecture at Gilt: The Essentials – by Yoni Goldberg at QConLondon
Gilt is a fashion e-commerce website. At the origin, it was a monolithic Rails application with Postgres databases. But they had issues with that:
- 1000s Ruby processes and lots of « crashes »
- databases were overloaded
- development pain
- the lines of code increased dramatically and it became hard to maintain code
- lots of contributors for the code modules with no ownership
To fix these issues, they decided to:
- jump into the JVM world by switching from Ruby to Scala and Play for the dev of what they call LOSA: Lots of Small (Web) Applications
- use dedicated data stores
- switch to a microservices architecture
- teams is now empowered and the ownership of the code is clear
- smaller scopes decrease the complexity and enable easier deployment and rollbacks
- each microservice manages its own datastore (Postgres, Mongo, RDS, …) chosen to fit its needs. So more flexibility for the developers but beware of the complexity that can come from the evolution of the data schema for instance
90 % of their scaling issues was solved. But there were challenges with these changes:
- no one wants to compile hundred of microservices to develop his own microservice
- you need to manage the development and integration environment
- service discoverability is needed to enable the services to talk to each others
- monitoring is a master piece in an microservices architecture
Gilt solves partially the first and second challenges:
- by developing an SBT plugin called gilt-sbt-build
- by using Ion-Cannon + SBT for continuous delivery
- by using Docker containers and ElasticBeanStalk for the environments
The last challenges has been addressed by adopting New Relic, Boundary and Cave. In summary,
The Good:
The Bad:
Team ownership
Requires a lot of tooling
Easier to release and manage code in larger teams
Integration environment are difficult to maintain
Team Ops and support for unique HW needs
Monitoring challenges
Team boundaries defined by APIs
Dependencies / Network IO
Break complex problems to smaller ones
So, there are still challenges to face… It was an interesting talk: you learn a lot from the challenges met by other people and how they face them…
Operating microservices – by Michael Brunton-Spall at QConLondon
The last talk of QConLondon. And it ended quite well this QConLondon. Michael Brunton-Spall worked at the Guardian (which is quite famous for having switched to Scala technologies and microservices architecture) and is now working for the British government. With some humor, he distilled his experience and thoughts about what is a microservices architecture and how DevOps culture may help to support it (plus a few anecdotes about how it is “challenging” to work for a government institution…). According to Michael, everyone has an opinion about microservices:
- vertical aligned stacks that communicate via simple and standard interfaces
- team ownership of the code and the runtime: the team own the full stack from code to deployment
- small is beautiful paradigm: small sytem can be updated more easily and deployed more frequently. Team can move faster… but also break stuff
Because all of that, infrastructure guys may see this with fear or distrust (it looks like a personal experience ;)). As matter of fact, microservices may be a new organisational model as much as a new architecture model: the role of engineering and operations is changing. It is changing because microservices tends to focus you on delivering business value. How to start with microservices? First, you may start in small. To do so, you need:
- to automate your infrastructure or containers. Build a combo of « a base image + a “cloud init” (all the environment libraries needed) + your deployed application » that should start in few minutes
- to monitor. Use monitoring tools that are easy to hook into with one golden rule:
« If it moves, graph it. If it doesn’t move, graph it anyway just in case it does » @lozzd
- to automate log aggregation. ELK (ElasticSearch, Logstash, Kibana) are used for this purpose at the Cabinet Office
- to use cloud-friendly databases (I’ve found this argument interesting because it was the first time a such argument is mentioned in a talk about microservices)
- to automate deployments (it seems they use an internal tool to achieve this). Pay attention to log the deployment and give the keys of deployment to developers
- to settle an alerting system
And as you hand the keys of deployment to developers and have an alerting mechanism, Michael stated that developers should get pagers: « Developers should be exposed to the pain they cause. » Once you have all of this, what about microservices in the large? There are some others considerations to take in account:
- microservices are going to fail (may be more often than their monolith equivalents) and thus, you need to embrace failure because microservices may be subjected to network failure or latency and timeouts
- complexity. With this new model of architecture, we switch from complicated monolith applications to complex microservices architectures. As a consequence, you need diagnosis tools and people who understand in the whole all the microservices and how they are supposed to interact
- more monitoring:
- shallow check to answer to the question “Is my service working?”
- deep check to answer to the question “Are the dependencies of my service working?”
- measure of not only the average response time but also the percentiles (90%, 95%, 99%) because microservices are more sensible to network versatility
Michael concluded his talk at QConLondon with some “hints” such as treating operations teams as consultants that would be like pioneers who pay the cost and then settlers who pave the roads (through documentation and automation). It was a nice talk and touches of humor were delighting :).