Secrets Management at Swiss Federal Railways (SBB) with HashiCorp Vault
Hear how Swiss Federal Railways, together with Adfinis, built a self-service HashiCorp Vault platform for their developers.
Swiss Federal Railways (SBB) chose HashiCorp Vault Enterprise to tackle their secrets management challenges, and in less than 6 months SBB went from zero to production.
What You’ll Learn
In a joint session by SBB and Adfinis, they present their learnings, challenges, and integrations in an enterprise environment. They will provide insights into their multi-cloud architecture on OpenShift following cloud native practices and how they address Vault governance using their self-service portal.
Hello, and welcome, everybody. Andreas and I will be talking about the secrets management journey at Swiss Federal Railways [SBB]. I’m Michael Hofer, also called Hofi by most colleagues, and the Head of Engineering at Adfinis, a global open source service provider and HashiCorp hyper-specialized partner. Andreas, who are you and what do you do?
Before I started in application security, I was several years a software engineer and then an architect. Now, I’m the product owner of the AppSec team. and we do the planning together. Sometimes I simply have to check that things are getting done. So Hofi, tell us a little bit more about Switzerland.
So yeah, Switzerland, you know it. I already see some faces smiling. There are some stereotypes about the beautiful Alps. We love hiking. Of course, chocolate. You might already have had a taste of it. And we like eating tons and tons of Swiss cheese. But also, watches, punctual watches. Handcrafted — that tells us the time. Tells us that we are on time actually. I might even say this as a Swiss citizen, we are a bit obsessed with being punctual. So Andreas, what else are we passionate about in Switzerland?
Yeah, you can already see it, this beautiful train going through the Swiss landscape. We love trains and especially those that arrive on time. So, SBB’s core business is passenger services, freight services, real estate. We are operating the Swiss railways infrastructure. By Swiss standards, we are quite a big company. We have one of the biggest IT departments in Switzerland. Every day, we are transporting more than 10% of the Swiss population. We are not only connecting Switzerland but also Europe. Therefore, we built the world’s longest train tunnel. Over 90% of our trains arrive on time.
This over 90% punctuality means a lot. To arrive on time and have this confidence. Can you give us some context?
At SBB, “punctual” means the train has less than three minutes delay. With this value, we are in the top three train companies in the world together with Japan and the host here, the Netherlands.
But Andreas, how do you keep up this performance on a daily basis? What do you need for this?
We need to be punctual, we need stability. You can see here the SBB strategy. We want to, of course, get more market share and be more efficient and flexible. At the bottom, you can see our success factors. Of course, our employees are very important to us. But we also drive innovation and technology and our systems are getting more and more interconnected. To interconnect securely, you need authentication, and therefore you need credentials and credential management.
Andreas, can you tell us a bit more about this secrets management journey? When did it start?
It started some years ago, in 2019. Many requests about secrets management, how to store them safely and securely came to my mailbox because at this time the AppSec team didn’t exist. You can believe me, I thought it through more than once, if I should take the hot potato and start something about credential management. I started by collecting requirements in the SBB IT. We defined some initial use cases. Then we went through an evaluation process — looked into the market, what’s there, which tools there are, which features they provide. Finally, we made a decision and it was HashiCorp Vault.
Why did we choose HashiCorp Vault? You can see here, our main three aspects we considered. First of all, SBB has the developer-first approach, as all of the HashiCorp products have as well. But also, the high availability is an important thing to SBB because we want our trains to go around the clock. And, if you are looking into the future, we want to have a broad feature set so we can onboard different use cases with one platform.
We had an initial plan for how we want to operate Vault at SBB. We looked for a decentralized architecture, which allows us to bring the responsibility of operating the different Vault clusters to the DevOps teams. And we even had an idea for how we can address the governance by OpenShift operators.
This is where you joined the journey, Hofi. You and your team from Adfinis had a look at our idea. What was your verdict?
Well together with the SBB colleagues, we had a look at the ecosystem back then. Now, let’s jump back two years from here. If I’m not mistaken, roughly back then, this was when Raft integrated storage hit GA. And when we looked at these different operators, we noticed that most of them were not yet supporting Raft. In addition to that, we also noticed some issues — for example, especially on OpenShift, where you have restricted or additional security requirements. So overall, even though we tried to contribute with pull requests, proposing Raft support, things like that, some bug fixes, in the end, it didn’t feel right at this time. Some operators also would have needed major refactoring to stateful sets deployments.
All of that led to the conclusion, we better ditch this approach for now. It looks like we are too early into this game. So then we basically folded back to a more traditional centralized secrets management approach.
But today, the ecosystem has changed a lot, so make sure to check out these quite interesting and cool projects. There’s also a new kid on the block. Red Hat’s Vault Config Operator. Very interesting. I would say really have a look at them.
Now Andreas, with that defined, the next step was to do the whole commercial topic and get this done. But was this as easy as buying a train ticket on your SBB mobile app?
Those of you who are working in a bigger enterprise know that’s not true. You just can’t do it like an online shop: pick the tools you like, put it in the basket, go to the checkout, and pay with the enterprise’s credit card. You have to assign compliance and governance. You have to follow the processes. You really can’t underestimate the effort and the time it takes until you can get the tool into the company.
One tip from my side: if you go into the discussion with your supplier, first think over how you want to use the platform and how the growth of client usage will look. This makes yourself more confident in the discussion and also the supplier knows where the journey will go.
Now with that in place and yeah, the baseline there, it was time to start the project. Now you were the SBB project lead. What were some first few things and important aspects you wanted to get out of the way?
First of all, if you want to do a great project, you need a great team. In our case, this meant we need a diverse team with different colleagues that on one side know the enterprise and on the other side also know the tool we want to implement, HashiCorp Vault. So we worked together with the identity and access management team, who has great experience in operating highly-available platforms. We from the AppSec team know the developers quite well. The guys from Adfinis helped us in understanding and integrating HashiCorp Vault because they have a lot of knowledge and experience in this topic.
You also have to define an initial scope and really make it small and the minimal viable product. Choose two, maybe three, primary use cases you want to cover and stick to it. Also, learn to say no.
The last topic here is to think over what availability requirements you have. As I said, we need high availability, so we had to plan from the beginning. We had to design the architecture for high availability, but also the processes and the responsibilities. You have to know who has to get up at three in the morning if something goes really wrong.
Now we are ready to start, right? We did our project in two phases. The first phase is called the platform readiness and service readiness. Hofi, tell us a bit more about that.
The platform readiness was all about getting Vault as a platform up and running in a production-ready state. This meant we looked at the environment and what options we had to deploy and choose the architecture for Vault. Together with the SBB colleagues, we then decided to use both AWS and the so-called Swiss Cloud, which is operated and provided by T-Systems, as a starting point.
On top of that, we consume existing OpenShift services provided by another SBB team. The nice thing is they are both already hardened, managed, and operated, so we can really focus on the Vault parts.
Now, before deploying Vault, we needed to get some prerequisites in place. US engineers, if you have Vault experience, you will know things like AWS S3, for example, for the Raft snapshots. We also consume AWS KMS for auto-unsealing with multiregional keys, and both of these and other additional services we needed we deployed with HashiCorp Terraform.
On top of that — again, not really a surprise — it was time to deploy Vault. For this we leverage what most organizations do these days, Argo CD, for example, Artifactory for caching all the artifacts we needed, like the official HashiCorp help charts. Again, Vault itself, we deployed and configured using Terraform.
So in this case really iterating the simple, secure, and reliable model for SBB was that we deployed a Vault primary and also a Vault DR secondary cluster for disaster recovery. We did this deployment twice, once for production and once for non-production.
That’s not all the components you need to successfully run Vault in production. Another topic is the whole system and audit log aggregation and monitoring. So for this, we integrated the SBB Splunk environment. On top of that, you need operational insights. So you might have heard the nice talk from Julia and her colleagues before about Prometheus and Vault, etc. So we use the Prometheus stack here as well to provide these operational insights to the colleagues who operate Vault.
Now with that in place, it was time for the so-called service readiness. So here in this stop, it was all about filling Vault with life, so to speak. When you deploy Vault, it’s still kind of a blank sheet. This is also one of the biggest challenges for most organizations to understand how they want to access, structure, and consume all these secrets. So together with the different teams and the use cases defined at the beginning, we defined an initial governance structure and approach. For the consumers. we onboarded Azure AD for single sign-on and the so-called SBB self-service portal. Regarding the primary use cases, we looked at a few, and one of them was secrets management for trains — super interesting use case, I would say, a really small data center on rails, you could say. Also, the different OpenShift platforms, as you might have seen at the beginning, this container-first strategy of SBB. So a lot of workload is deployed on Kubernetes and OpenShift. Integrating these platforms is really key.
In parallel, we already started onboarding the different operational colleagues and provided SRE training. This should not come at the very end but go fluently along with your project so they can already participate in certain activities. Now, Andreas, this ominous self-service portal, can you tell us a bit more about it? What are some things it does and what it doesn’t do?
Today we heard some different approaches for how to address governance. To wrap it up, there are two kinds of options you have to address this topic. On one hand, you could have a team that kind of acts like a gatekeeper and checks that everything is fine. Or on the other hand, you can have some kind of automation. And we chose the approach of a self-service portal that gets kind of an abstraction layer between the customers and Vault itself. This means if a developer or another customer wants to access Vault, he goes to the self-service portal and orders the access there.
As you mentioned, Hofi, if you start a Vault, it’s kind of like a blank sheet, or you also can say like an empty file, because everything involved, every feature, is addressed with a path. So you have to think over how you want to structure the path or the directories as you can compare it. So we chose the pattern of the application spaces.
This means the path has two parts. The first is the organizational unit and then the main application name. In the background of the sales service portal is produced code and checked in into Git. There you get the audit trail for free. From there, Terraform is triggered, which deploys the changes directly to the Vault platform.
Here you can see a screenshot of our self-service portal. And I must say, I’m quite proud of it because with this self-service portal, the developers can spin off a whole software project. They can order a Git repo. They can get access to a new OpenShift project. They even can get a Jira project or Confluence space, all with this self-service portal.
Now as we have Vault, they can also get access to Vault. But the self-service portal is only used to start and spin off the project. From then on, the developers work directly with the different systems and no longer over the self-service portal.
Andreas, looking at this, I think an interesting fact is that we used this ourselves for the resources we needed to deploy Vault. So just to rephrase or quote Dave from this morning’s keynote, “shoemaker’s children” or “eating your own dog food,” that was also helpful for us as consumers of the self-service portal to do a first integration there as well for Vault.
So now we have the self-service tool and the Vault platform up and running, so we can start right away. Or how is it Hofi?
Well yeah, almost. But before we actually went live, we wanted to really go through some due diligence activities. So checking again, are our SRE colleagues up to speed? Do they especially feel confident in operating and taking care of such a critical piece of infrastructure? So we conducted things like SRE fire drills, verified escalation paths, and what’s needed to really hand over the platform itself.
In addition to that, we heard a few details about this self-service portal, the governance. Another aspect to it, to really lower this entry barrier for the consumers and developers, is providing some simple things like documentation, code snippets. So in this case for Spring Boot, for example, for the SBB colleagues. And in addition to that, really raise awareness. That was something Andreas constantly did and is still doing: raise awareness, explain what Vault does and what it does not do so consumers really start on the right foot right away.
With all of that done, that’s how we went live in under six months, not excluding what happened before. (We don’t talk about that.) But so we started the project roughly last August, I think it was, right? And went live at the beginning of this year. So it’s still in a ramp-up phase. I think it is already a cool project, great learnings and a successful one.
Now your favorite question as PO, what’s next? What’s the roadmap?
You said it. I not only like to say no, I also like to look into the future. So we have two main topics that we want to address. On one hand, we want to optimize the infrastructure by simplifying the connectivity and also implement the higher-availability load balancing. You might say, yeah, well, there are a lot of cloud services that provide this functionality. But if you have two different cloud providers in the setup, it’s a bit more complicated.
We also are thinking about spinning up a third Vault platform besides non-prod and prod to do testing of new features in Vault or test new releases without affecting the developers that are working on the non-prod instance in the development stage. On the other hand, we want to onboard more use cases.
Hofi, you told us about the train, the deployment of our secure systems.This we want to follow further, but we also want to have a transparent OpenShift integration because we have some applications that cannot or are just not willing to change their code, to integrate Vault directly. Therefore, we need this integration.
Then the topic of database secrets. This is quite critical in my opinion. We not only want to have them statically safe in Vault but also have them dynamically. Therefore, we want to use the Vault database secrets engine.
If we are looking a bit further, then the certificate management we also want to bring into Vault.
Now, Andreas looking back, if you would have to do it again, what are some key takeaways you would like to give our audience and for yourself, if we would have to do it again?
Well, if you’re forgetting now everything about our speech, there are the three topics that should stick in your head.
The first one is the consumer onboarding is key. We already talked quite a lot about our self-service portal, which makes it easy to onboard the developers, but that’s not enough. So if I would start over again, I would even start earlier with this awareness program, with training for developers, how they can use Vault, how they can integrate, how they can integrate Vault without affecting their high-availability requirements they have. If you have a platform that nobody’s using in your IT department, even for cool stuff, it won’t help if nobody uses it.
Also, the preparation is an important aspect of bringing Vault to your company. Maybe you don’t need two years as I did, but really think over what are the critical use cases in your company, and learn to say no. Stick to your initial scope and bring the MVP up. From there on you can cover more use cases and bring more features. Hofi, what’s your takeaway for our audience?
Well, I would say for me, what was really key was again that Vault can also introduce some discussions into your organization that it will help also to push certain technical boundaries. For example, the multi-cloud load balancing topic was previously never discussed in detail, and Vault helped to bring this to the table, which in turn, if of course you share the discussions with the different teams, will also provide profit to other teams and services that can then consume from these new capabilities that were maybe previously not available.
It was really worth it to use the whole GitHub process from the beginning. So Terraform, Argo CD, etc, because for example, at some point on an afternoon, three colleagues who were not involved in the project all the time, they were able to quickly redeploy the DR clusters to a different cloud environment due to a migration scenario. So again, that was kind of the confirmation we are on a good track here, it’s actually helpful.
With that said, I would say we finish on time — punctual from an SBB perspective. We are right on track, but with that last bad train joke, we better get off stage. We thank you so much for having us here and please feel free to approach us at any time. Thank you so much.