2024 DevOps Lifecycle: Share your expertise on CI/CD, deployment metrics, tech debt, and more for our Feb. Trend Report (+ enter a raffle!).
Kubernetes in the Enterprise: Join our Virtual Roundtable as we dive into Kubernetes over the past year, core usages, and emerging trends.
The final step in the SDLC, and arguably the most crucial, is the testing, deployment, and maintenance of development environments and applications. DZone's category for these SDLC stages serves as the pinnacle of application planning, design, and coding. The Zones in this category offer invaluable insights to help developers test, observe, deliver, deploy, and maintain their development and production environments.
In the SDLC, deployment is the final lever that must be pulled to make an application or system ready for use. Whether it's a bug fix or new release, the deployment phase is the culminating event to see how something works in production. This Zone covers resources on all developers’ deployment necessities, including configuration management, pull requests, version control, package managers, and more.
The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).
A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
The Testing, Tools, and Frameworks Zone encapsulates one of the final stages of the SDLC as it ensures that your application and/or environment is ready for deployment. From walking you through the tools and frameworks tailored to your specific development needs to leveraging testing practices to evaluate and verify that your product or application does what it is required to do, this Zone covers everything you need to set yourself up for success.
DevOps
The DevOps movement has paved the way for CI/CD and streamlined application delivery and release orchestration. These nuanced methodologies have not only increased the scale and speed at which we release software, but also redistributed responsibilities onto the developer and led to innovation and automation throughout the SDLC.DZone's 2023 DevOps: CI/CD, Application Delivery, and Release Orchestration Trend Report explores these derivatives of DevOps by diving into how AIOps and MLOps practices affect CI/CD, the proper way to build an effective CI/CD pipeline, strategies for source code management and branching for GitOps and CI/CD, and more. Our research builds on previous years with its focus on the challenges of CI/CD, a responsibility assessment, and the impact of release strategies, to name a few. The goal of this Trend Report is to provide developers with the information they need to further innovate on their integration and delivery pipelines.
DZone's Annual DevOps Research — Join Us! [survey + raffle]
Getting Started With OpenTelemetry
In a few words, the idea of canary releases is to deliver a new software version to only a fraction of the users, analyze the results, and decide whether to proceed further or not. If results are not aligned with expectations, roll back; if they are, increase the number of users exposed until all users benefit from the new version. In this post, I'd like to detail this introduction briefly, explain different ways to define the fraction, and show how to execute it with Apache APISIX. Introduction to Canary Releases The term "canary" originates from the coal mining industry. When mining, it's not uncommon to release toxic gases. In a small enclosed space, it can mean quick death. Worse, the gas may be odorless, so miners would breathe it until it was too late to leave. Carbon monoxide is quite common in coal mines and is not detectable by human senses. For this reason, miners brought canaries with them underground. If the canary suddenly dropped dead, chances were high that such a gas pocket had been breached, and it was high time to leave the place. Years ago, we brought this approach to releasing a new software version. The analogy goes like this: miners are the Ops team deploying the version, the canary consists of all tools to measure the impact of the release, and the gas is a (critical) bug. The most crucial part is that you need to measure the impact of the release, including failure rates, HTTP status codes, etc., and compare them with those of the previous version. It's outside the scope of this post, but again, it's critical if you want to benefit from canary releases. The second most important part is the ability to roll back fast if the new version is buggy. Canary Releases vs. Feature Flags Note that canary releases are not the only way to manage the risk of releasing new code. For example, feature flags are another popular way: The canary approach delivers the complete set of features in the new component version Feature flags deploy the component as well, but dedicated configuration parameters allow activating and deactivating each feature individually. Feature flags represent a more agile approach (in the true sense of the word) toward rollbacks. If one feature out of 10 is buggy, you don't need to undeploy the new version; you only deactivate the buggy feature. However, this superpower comes at the cost of additional codebase complexity, regardless of whether you rely on third-party products or implement it yourself. On the other hand, canary requires a mature deployment pipeline to be able to deploy and undeploy at will. Approaches to Canary Releases The idea behind canary releases is to allow only a fraction of users to access the new version. Most canary definitions only define "fraction" as a percentage of users. However, there's more to it. The first step may be to allow only vetted users to check that the deployment in the production environment works as expected. In this case, you may forward only a specific set of internal users, e.g., testers, to the new version. If you know the people in advance, and the system authenticates users, you can configure it by identity; if not, you need to fallback to some generic way, e.g., HTTP headers - X-Canary: Let-Me-Go-To-v2. Remember that we must monitor the old and the new systems to look at discrepancies. If nothing shows up, it's an excellent time to increase the pool of users forwarded to the new version. I assume you eat your own dog food, i.e., team members use the software they're developing. If you don't, for example, an e-commerce site for luxury cars, you're welcome to skip this section. To enlarge the fraction of users while limiting the risks, we can now indiscriminately provide the new version to internal users. We can configure the system to forward to the new version based on the client IP to do this. At a time when people were working on-site, it was easy as their IPs were in a specific range. Remote doesn't change much since users probably access the company's network via a VPN. Again, monitor and compare at this point. The Whole Nine Yards At this point, everything should work as expected for internal users, either a few or all. But just as no plan survives contact with the enemy, no usage can mimic the whole diversity of a production workload. In short, we need to let regular users access the new version, but in a controlled way, just as we gradually increased the number of users so far: start with a small fraction, monitor it, and if everything is fine, increase the fraction. Here's how to do it with Apache APISIX. Apache APISIX offers a plugin-based architecture and provides a plugin that caters to our needs, namely the traffic-split plugin. The traffic-split Plugin can be used to dynamically direct portions of traffic to various Upstream services. This is done by configuring match, which are custom rules for splitting traffic, and weighted_upstreams which is a set of Upstreams to direct traffic to. — traffic-split Let's start with some basic upstreams, one for each version: YAML upstreams: - id: v1 nodes: "v1:8080": 1 - id: v2 nodes: "v2:8080": 1 We can use the traffic-split plugin to forward most of the traffic to v1 and a fraction to v2: YAML routes: - id: 1 uri: "*" #1 upstream_id: v1 plugins: traffic-split: rules: - weighted_upstreams: #2 - upstream_id: v2 #3 weight: 1 #3 - weight: 99 #3 Define a catch-all route Configure how to split traffic; here, weights Forward 99% of the traffic to v1 and 1% to v1. Note that the weights are relative to each other. To achieve 50/50, you can set weights 1 and 1, 3 and 3, 50 and 50, etc. Again, we monitor everything and make sure results are as expected. Then, we can increase the fraction of the traffic forwarded to v2, e.g.: YAML routes: - id: 1 uri: "*" upstream_id: v1 plugins: traffic-split: rules: - weighted_upstreams: - upstream_id: v2 weight: 5 #1 - weight: 95 #1 Increase the traffic to v2 to 5% Note that Apache APISIX reloads changes to the file above every second. Hence, you split traffic in near-real time. Alternatively, you can use the Admin API to achieve the same. More Controlled Releases In the above, I moved from internal users to a fraction of external users. Perhaps releasing to every internal user is too big a risk in your organization, and you need even more control. Note that you can further configure the traffic-split plugin in this case. YAML routes: - id: 1 uri: /* upstream_id: v1 plugins: traffic-split: rules: - match: - vars: [["http_X-Canary","~=","Let-Me-Go-To-v2"]] #1 - weighted_upstreams: - upstream_id: v2 weight: 5 - weight: 95 Only split traffic if the X-Canary HTTP header has the configured value. The following command always forwards to v1: Shell curl http://localhost:9080 The following command also always forwards to v1: Shell curl -H 'X-Canary: Let-Me-Go-To-v1' http://localhost:9080 The following command splits the traffic according to the configured weights, i.e., 95/5: Shell curl -H 'X-Canary: Let-Me-Go-To-v2' http://localhost:9080 Conclusion This post explained canary releases and how you can configure one via Apache APISIX. You can start with several routes with different priorities and move on to the traffic-split plugin. The latter can even be configured further to allow more complex use cases. The complete source code for this post can be found on GitHub. To Go Further CanaryRelease on Martin Fowler's bliki traffic-split Implementation of canary release solution based on Apache APISIX Canary Release in Kubernetes With Apache APISIX Ingress Smooth Canary Release Using APISIX Ingress Controller with Flagger Apache APISIX Canary Deployments
The latest IBM mainframe model, the z16, is fully compatible with the original IBM 360, although many improvements have been made over the 60 years the product line has been in production. Today, IBM mainframes host applications that run many of the world’s largest and most successful businesses. An estimated 10,000 mainframe systems are being used today for industries spanning banking, healthcare, insurance, retail, telecommunications, travel, and more. And mainframe applications are used to process credit card payments, stock trades, and other business-critical transactions. However, the cost of mainframe computing can be significant. As such, many modernization efforts are designed to reduce costs and modernize applications. These efforts can be aided using IBM Z mainframe specialty processors such as the IFL (Integrated Facility for Linux) and zIIP (System Z Integrated Information Processor). Workloads that run on these processors are less expensive than those that run on traditional IBM Z general-purpose processors. And finally, integrating cloud computing techniques and technology can be another key aspect of mainframe modernization efforts. Cloud computing can provide a platform for integrating and extending mainframe systems by providing cost savings and are more modern development paradigm using microservices and containers. The Impact and Benefits of Mainframe Specialty Processors Mainframe specialty processors provide another significant benefit for mainframe modernization initiatives. Which raises the question, “What is a specialty processor?” The IBM z16 mainframe runs on the Telum chip, which is used for general-purpose processing on the platform. The z16 can be augmented with specialty processors, as well. If so equipped, certain types of workloads are run on the specialty processor(s) instead of on the general-purpose CPU. Workload that runs on a specialty processor is not subject to licensed software charges, which can significantly decrease the software bill for mainframe customers. Although a thorough understanding of mainframe pricing and licensing can be complex, an organization’s bill for mainframe software is calculated monthly based on the peak average usage during the month. Software cost rises as the capacity and utilization of the mainframe rises. But if capacity can be run on a specialty processor, then that workload can be removed from software bill calculation. As the workload that is redirected to specialty processors increases, cost savings can also increase. However, specialty processors can only run certain, specific types of workloads, so not everything can run on them. There currently are three different types of mainframe specialty processors: 1. ICF: Internal Coupling Facility – used for processing coupling facility cycles in a data-sharing environment. 2. IFL: Integrated Facility for Linux – used for processing Linux on System Z workload on an IBM mainframe. 3. zIIP: Integrated Information Processor – used for processing certain, specific types of distributed workloads. The ICF, or Internal Coupling Facility, is designed to be used for processing coupling facility cycles in a mainframe Data Sharing environment. Data Sharing is enabled via a Parallel Sysplex, which is a feature of the IBM Z mainframe that enables up to 32 IBM z/OS systems to be connected and behave as a single, logical computing platform. A Parallel Sysplex requires at least one Coupling Facility (CF). The CF controls the manner in which data is shared between and among the connected systems, including lock management and synchronization, system availability, and other systems management facilities. A CF can run in a logical partition (LPAR), or it can run on an ICF. Most large organizations choose to use ICFs because they are tightly integrated into the mainframe architecture, provide lower-latency access to shared data, locks, and synchronization services, and the workload performed on the ICF is not chargeable. The IFL, or Integrated Facility for Linux, is a specialty processor that is optimized for running Linux-based workloads. Using an IFL, organizations can run Linux workloads utilizing the power and reliability of IBM mainframes. Traditional z/OS workloads are not charged for IFL capacity. Organizations are charged for IFL capacity based on the number of IFL processors they have allocated and the usage, generally as a fixed monthly fee or a pay-as-you-go model. Finally, the zIIP, or Integrated Information Processor, is a dedicated processor that operates asynchronously with mainframe general processors. Relevant workloads can be redirected to run on a zIIP instead of a general processor. Software charges are not imposed on the workload that runs on the zIIP. However, not all types of work can run on the zIIP, only “relevant workloads.” What is relevant is decided by IBM, but the types of workloads that run on the zIIP are typically newer functionality that might otherwise be built to run on non-mainframe platforms. The zIIP is a mechanism IBM uses to reduce the cost of such workload to encourage users to run it on the mainframe. It is also a target for organizations looking to modernize their mainframe processes. The exact types of workloads that can run on the zIIP are ever-evolving and documented by IBM on their website. The predominant workloads that can exploit zIIPs are Db2-distributed SQL requests and Java applications. As such, legacy modernization efforts that convert COBOL workloads to Java workloads may not only rejuvenate the program code into a more modern language understood by younger developers but can also result in a lower overall cost to run the mainframe by exploiting zIIPs. Java is one of the world’s most popular programming languages, especially for developing enterprise applications in large organizations. As we mentioned earlier, Java consistently ranks near the top of the TIOBE index, which ranks the popularity of programming languages. Furthermore, many organizations are modernizing their legacy applications to use Java as a part of digital transformation efforts. The ability to redirect Java workload from general-purpose CPUs to zIIPs can provide significant cost savings to organizations with heavily used Java applications. Therefore, using technology like IBM watsonx.ai with GenAI to convert from COBOL to Java is a growing and cost-effective component of mainframe modernization initiatives. As an additional consideration, all the specialty processors are less costly than the general-purpose mainframe CPU, and as such, not only can they reduce software cost as discussed in the earlier portions of this section, specialty processors also can reduce hardware cost. By running workload on the lower cost specialty processors, organizations can utilize lower cost general-purpose processors and delay costly system upgrades. Cloud Computing in Mainframe Modernization Cloud computing can be another significant enabler in modernization efforts, fostering cloud-native applications and innovative strategies while addressing migration and integration challenges. A tactic taken by many organizations that rely on mainframe computing is a hybrid multi-cloud approach. But what is meant by hybrid multi-cloud? The term hybrid indicates heterogeneity, composed of multiple components. The term multi-cloud means using more than one cloud computing service. So, a hybrid multi-cloud is an IT infrastructure that uses a mix of on-premises computing with private and public cloud from multiple providers. With the hybrid cloud approach, organizations connect mainframe systems with cloud services. This allows them to leverage cloud resources as needed while still using their mainframes for core applications. Organizations that rely on mainframes tend to adopt this approach because they can continue to benefit from the significant investment they have made in mainframe applications and systems. At the same time, they adopt and integrate cloud services so they can take advantage of their potential for reduced cost and scalability. Cloud computing provides a platform for integrating and extending mainframe systems. In some cases, organizations may move non-core workloads from the mainframe to the cloud. This offloads processing and storage demands from the mainframe, reducing costs and increasing scalability. The cloud also can be used to store and backup mainframe data, providing scalable and cost-effective storage solutions. Data can be easily accessed and recovered from the cloud when and as needed. Another approach to integrating cloud with mainframe applications is through application integration. In this case, cloud-based services can be used to build new applications and interfaces that interact with the mainframe, enabling a modern user experience without extensive changes to the mainframe applications. Deploying microservices as part of a mainframe modernization project can be an important aspect of modernizing legacy mainframe systems. Microservices are a software development approach wherein a complex application is broken down into smaller, independent, and loosely coupled services that communicate using well-defined APIs. Each microservice is designed to perform a specific function or business capability. By contrast, most mainframe applications were built using a traditional monolithic architecture, where all application functionality is tightly integrated into a single codebase. Microservices are well-suited for cloud-native applications and distributed systems. They require architectural planning, strong governance, and effective software tools to manage the complexity and coordination among services. When implemented correctly, microservices can offer improved maintainability, scalability, and agility for software applications. Furthermore, microservices are associated with DevOps, continuous integration, and continuous deployment (CI/CD) pipelines. Adopting DevOps can enable more rapid development, testing, and deployment. However, it also requires a change in culture as changes are integrated and adopted more frequently than is the norm for mainframe systems. Frequently, adopting microservices with DevOps means embracing containerization. A container is a lightweight, standalone, and executable package that includes everything needed to run a piece of software, including the code, runtime, libraries, and system tools. Containers provide a consistent and isolated environment for applications to operate, making them highly portable across different computing environments. This is often associated with containerization technologies like Docker. IBM z/OS Container Extensions (zCX) delivers integrated container technology, specifically Docker containers, into the mainframe environment. This enables organizations to run Linux-based containers on IBM Z mainframes alongside traditional mainframe workloads. Using zCX, you can isolate the mainframe and container workloads, allowing both to run side by side on the same hardware. This ensures that mainframe operations are not disrupted by the presence of containers and vice versa. Container workloads running on zCX share the same underlying hardware resources as the mainframe, thereby using the mainframe's high-performance computing capabilities. This resource efficiency can be a critical success factor when modernizing legacy mainframe applications. Furthermore, zCX containers can be easily moved between different IBM Z mainframes and even to other platforms where Docker containers are supported. This portability enhances the flexibility of deploying applications. And zCX supports DevOps practices by allowing mainframe applications and containerized applications to be developed, tested, and deployed together. IBM zCX integrates with the broader container ecosystem, including Docker tools, Kubernetes, and container orchestration platforms. This allows organizations to leverage familiar container technologies. Modernization: ZCX can be a valuable tool in the modernization of legacy mainframe applications. It allows organizations to containerize parts of the mainframe application, enabling more flexible development and deployment practices. In summary, ZCX for containers on IBM Z mainframes is a technology that bridges the gap between traditional mainframe workloads and modern containerized applications. It provides a path for organizations to modernize their mainframe environments by adopting containerization and integrating them into their existing mainframe infrastructure. This can offer greater agility, scalability, and flexibility in running workloads on IBM Z systems. While taking advantage of the benefits of microservices architecture. Here's a step-by-step guide on how to deploy microservices in the context of mainframe modernization: Scalability: Cloud resources can be dynamically scaled to handle peak loads, ensuring that the mainframe system doesn't become a bottleneck during periods of high demand. Leveraging the cloud for mainframe modernization is a complex process that requires careful planning and execution. Modernizing legacy systems is never as simple as wholesale replacement of existing processes, many of which have run for decades. Adopting a hybrid cloud approach can help to deliver modernization at a lower cost while taking advantage of the best features of multiple computing architectures and platforms. Conclusion While banking on the robust capabilities, expansive scalability, and unwavering reliability of the IBM mainframe, organizations are actively engaged in the integration of contemporary technologies and architectures into their IT framework. Initiatives to revamp their mainframe systems are underway, presenting notable challenges due to the critical role these systems play in managing the infrastructure of some of the globe's most substantial enterprises. The examination in this article encompasses various technologies aimed at facilitating the success of organizations in the modernization of their mainframe systems and applications. To support converted code and achieve cost reduction, organizations can employ mainframe specialty processors. Additionally, by adopting a hybrid multi-cloud approach, organizations can advance their mainframe modernization efforts, capitalizing on the architectural strengths and best practices inherent in both mainframe and cloud computing.
Back in December of 2022, I started a series taking you on a tour of the Perses project. These articles covered this fairly new open dashboard and visualization project targeting cloud-native environments. I used a "getting started" workshop to guide you through this series and to provide a hands-on experience for those new to visualizing observability data. In the previous article, I kicked off with an introduction to and installation of Perses and provided links to the actual online workshop content. In this article, I'll share a guided tour of the Perses application programming interface (API) and how you can interact with your Perses instance through its API. Let's dive right in, shall we? I'm going to preview the Perses API as covered in my workshop lab before diving into the actual hands-on with the project. You can find the free workshop here online: Now let's dive into the Perses project API lab. Note this article is only a short summary, so please see the complete lab found online as lab 3 to work through the API experience yourself: The following is a short overview of what is in this specific lab of the Perses workshop. Perses API One of the most interesting aspects of any project for developers is the amount and ease of access that a project API provides. Rest assured, everything you ever might want to do with your Perses instance is covered, allowing you to perform operations on your resources. You'll see that a resource is part of the Perses dashboard specification and you can perform all of the following with your resources known as CRUD (create, read, update, delete): Creating a resource Retrieving a resource Updating a resource Deleting a resource Retrieving a list of resources After installing Perses in the previous lab, you have a dashboard open in your browser. There is an application programming interface (API) available for you to gain insights into the configuration applied to your Perses instance. You can find out more about the projects, dashboards, and data sources that have been created. This can be handy to start with existing resources and modify them for your own versions. The pre-installed project is called WorkshopProject. This is how we get started exploring some of the provided API operations for handling a Perses resource, which can be one of the following: Project Dashboard Data source You learn that you can use a standard browser and the RestAPI (URL) to leverage the API operations, such as how you can view the configured projects on your Perses server with a browser, just enter this URL: http://localhost:8080/api/v1/projects This shows you the unformatted JSON output of the available projects. You should see the single WorkshopProject listed: Next, you explore an example REST API tool and query the same operation: See the results now embedded and formatted nicely for human consumption: Finally, you'll start using the provided Perses command line executable known as percli. If you built the project from the source code, you have built this executable, but if you are using the container image installation, then you will find an executable percli for your operating system using the pointers provided in the lab. The rest of this lab is dedicated to exploring how to use this command line tool and you explore all of the following commands: login - Log in to an instance of the Perses API get - Request a response from the Perses API project - Select a project to be used as the default describe - Request details for a specific resource delete - Delete a specific resource apply - Create or update existing resources using JSON or YAML file You'll learn by using each one of these commands and exploring the output they provide. For example, if you use the command line tool to get all the projects in your Perses instance, it would look like this in your console: This lab leaves you one step closer to creating your own dashboards and visualizations. More To Come Next up, I'll continue working with you through the Perses project with a look at the Perses specification. Stay tuned for more insights into a real practical experience as my cloud-native o11y journey continues.
Jenkins pipelines can be employed for Continuous Deployment (CD) capabilities. But just because you can doesn’t mean you should. It's not that it's impossible; some teams manage to make it work. However, the critical question is whether you should. More often than not, the answer tends to be no. While Jenkins excels in Continuous Integration (CI) scenarios, utilizing it for CI/CD proves to be a less-than-ideal workflow, often leading to more complications than initially anticipated. We have seen this scenario repeatedly occur over the past two years of talking to customers: An engineering team uses Jenkins for testing code and building deployable artifacts (their CI process). They decide to employ Jenkins pipelines for executing CD, and while it seems promising initially, it introduces a host of problems and technical debt to the entire CI/CD process. In the following sections, we'll delve into why Jenkins remains a popular solution, where it falls short in CD, and why it's advisable to move away from using Jenkins for CD capabilities. Why Is Jenkins So Popular in the First Place? Let's be clear that we don't discredit Jenkins as a tool. Jenkins stands out as an open-source automation server equipped with plugins for streamlined building, deploying, and automating projects. It remains a highly favored tool in the DevOps realm, boasting over 15 million downloads. What many developers appreciate about Jenkins is its maturity and proven resilience. Designed with flexibility in mind, Jenkins provides an extensive Software Development Kit (SDK) covering virtually every imaginable aspect. Whether you're looking to oversee an AWS environment or integrate with GitHub, there's a plugin tailored for that specific purpose. While this adaptability has its drawbacks, it is the primary factor behind Jenkins' popularity and its success in numerous organizations. Where Does Jenkins Fall Short? To be blunt, Jenkins is starting to show its age as software. Originally developed in 2004, it wasn't designed to be cloud-native. Although developers and DevOps teams have found ways to make it work, its architecture, rooted in a master node and multiple build agents, reflects an outdated model from the era of data centers and static servers. While this approach isn't necessarily bad, the reliance on plugins for workarounds introduces additional configurations, often seen as a hassle. For instance, configuring containers with the Jenkins Kubernetes plugin involves working with YAML inside a Groovy file, increasing the risk of errors. The variety in plugin syntax further complicates matters. Managing Jenkins is considered challenging, with its requirement for Groovy scripting adding complexity compared to more modern tools. Despite the usefulness of Jenkins' SDKs, some plugins are limited and just have basic functionality. That’s fine until you realize that most plugins are often not well documented and you need to script your way through a majority of use cases. On top of that most users complain about buggy plugins that are a hassle to manage from a dependency standpoint. Some plugins unfortunately can’t be patched for security vulnerabilities due to inter-dependencies. The user interface is notably outdated, demanding significant human resources for operation. Despite attempts to reduce script reliance, Jenkins remains script-centric. This results in the DevOps team being burdened with maintaining infrastructure, updating plugins, and troubleshooting. In reality, it’s estimated that sustaining Jenkins requires the daily efforts of 2 to 5 engineers. While valuable and proficient in various tasks, the limitations of relying solely on Jenkins become evident as operations expand, leading to increased maintenance, operational challenges, and potential disruptions. Why Does It Make Sense To Move Your CD Process Off Jenkins? We firmly believe that software delivery and deployment should be separate from your CI provider, which is usually purpose-built for running unit tests and compiling build artifacts.There are some major motivations for this separation, among them being: Deployment is frequently a much longer running process than a standard code change, encompassing staged releases across multiple environments with multiple rounds of integration testing. A separate CD system can make it easy to manage drift in microservice dependencies by allowing them to be tested in tandem and rolled back independently. Separating CD enhances your security posture by not requiring master creds in a building service like Jenkins or a cloud service like Circle-CI. A Central CD service drastically simplifies managing polymorphic infrastructure providers, as a pull-based architecture naturally can extend to multi-cloud or on-prem delivery models. Jenkins functions as a CI server, heavily dependent on scripts. The Jenkins pipeline functionality, consisting of a suite of plugins, facilitates the implementation and integration of CD pipelines within Jenkins. This situation presents challenges in establishing a smooth and comprehensive CI/CD pipeline. Putting aside Jenkins plugins, let’s quickly break down a few core concepts of a deployment pipeline: Artifacts Applications or services Secrets Deployment workflow Stages Ability to rollback Software release strategy (Blue/Green, Canary, etc.) Tests and tooling (security and monitoring) User Groups & Users (RBAC) Jenkins pipeline functionality contains none of those out-of-the-box. Instead, you have to write custom scripting to manage or perform all the above tasks. You get the point, Jenkins pipelines are just another way of hardcoding together a solution with scripts. Your and your team's time is far too valuable to spend it maintaining deployment pipelines. The reality is that scripting together your pipeline introduces more complexity, which increases the chances of pipelines failing. Sure, if you want to you can manually script a Canary deployment to a Kubernetes cluster by heavily editing some Jenkinsfiles, but it’s not easy and ultimately a pain. Remember what was mentioned earlier; just because you can, doesn’t mean you should.The CD shouldn't be challenging. You shouldn’t be coding and maintaining your CD process as your applications and services evolve. There are plenty of CD tools on the market that integrate and complement Jenkins for CI.
This guide delves into the meticulous steps of deploying a Spring MVC application on a local Tomcat server. This hands-on tutorial is designed to equip you with the skills essential for seamless deployment within your development environment. Follow along to enhance your proficiency in deploying robust and reliable Spring MVC apps, ensuring a smooth transition from development to production. Introduction In the preliminary stages, it's crucial to recognize the pivotal role of deploying a Spring MVC application on a local Tomcat server. This initial step holds immense significance as it grants developers the opportunity to rigorously test their applications within an environment closely mirroring the production setup. The emphasis on local deployment sets the stage for a seamless transition, ensuring that the application, when deemed ready for release, aligns effortlessly with the intricacies of the production environment. This strategic approach enhances reliability and mitigates potential challenges in the later stages of the development life cycle. Prerequisites To get started, ensure you have the necessary tools and software installed: Spring MVC Project: A well-structured Spring MVC project. Tomcat Server: Download and install Apache Tomcat, the popular servlet container. Integrated Development Environment (IDE): Use your preferred IDE (Eclipse, IntelliJ, etc.) for efficient development. Configuring the Spring MVC App Initiating the deployment process entails meticulous configuration of your Spring MVC app development project. Navigate to your project within the Integrated Development Environment (IDE) and focus on pivotal files such as `web.xml` and `dispatcher-servlet.xml` These files house crucial configurations that dictate the behavior of your Spring MVC application. Pay meticulous attention to details like servlet mappings and context configurations within these files. This configuration step is foundational, as it establishes the groundwork for the application's interaction with the servlet container, paving the way for a well-orchestrated deployment on the local Tomcat server. 1. Create the Spring Configuration Class In a typical Spring MVC application, you create a Java configuration class to define the application's beans and configuration settings. Let's call this class 'AppConfig'. Java import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; import org.springframework.web.servlet.config.annotation.EnableWebMvc; @Configuration @EnableWebMvc @ComponentScan(basePackages = "com.example.controller") // Replace with your actual controller package public class AppConfig { // Additional configurations or bean definitions can go here } Explanation '@Configuration': Marks the class as a configuration class. '@EnableWebMvc': Enables Spring MVC features. '@ComponentScan': Scans for Spring components (like controllers) in the specified package. 2. Create the DispatcherServlet Configuration Create a class that extends 'AbstractAnnotationConfigDispatcherServletInitializer' to configure the DispatcherServlet. Java import org.springframework.web.servlet.support.AbstractAnnotationConfigDispatcherServletInitializer; public class MyWebAppInitializer extends AbstractAnnotationConfigDispatcherServletInitializer { @Override protected Class<?>[] getRootConfigClasses() { return null; // No root configuration for this example } @Override protected Class<?>[] getServletConfigClasses() { return new Class[]{AppConfig.class}; // Specify your configuration class } @Override protected String[] getServletMappings() { return new String[]{"/"}; } } Explanation 'getServletConfigClasses()' : Specifies the configuration class (in this case, AppConfig) for the DispatcherServlet. 'getServletMappings()' : Maps the DispatcherServlet to the root URL ("/"). Now, you've configured the basic setup for a Spring MVC application. This includes setting up component scanning, enabling MVC features, and configuring the DispatcherServlet. Adjust the package names and additional configurations based on your application's structure and requirements. Setting up Tomcat Server Locally Transitioning to the next phase involves the establishment of a local Tomcat server. Start by downloading the latest version of Apache Tomcat from the official website and meticulously follow the installation instructions. Once the installation process is complete, the next pivotal step is configuring Tomcat within your Integrated Development Environment (IDE). If you're using Eclipse, for example, seamlessly navigate to the server tab, initiate the addition of a new server, and opt for Tomcat from the available options. This localized setup ensures a synchronized and conducive environment for the impending deployment of your Spring MVC application. Building the Spring MVC App As you progress, it's imperative to verify that your Spring MVC project is poised for a seamless build. Leverage automation tools such as Maven or Gradle to expedite this process efficiently. Integrate the requisite dependencies into your project configuration file, such as the `pom.xml` for Maven users. Execute the build command to orchestrate the compilation and assembly of your project. This step ensures that your Spring MVC application is equipped with all the necessary components and dependencies, laying a solid foundation for subsequent phases of deployment on the local Tomcat server. 1. Project Structure Ensure that your project follows a standard Maven directory structure: CSS project-root │ ├── src │ ├── main │ │ ├── java │ │ │ └── com │ │ │ └── example │ │ │ ├── controller │ │ │ │ └── MyController.java │ │ │ └── AppConfig.java │ │ └── resources │ └── webapp │ └── WEB-INF │ └── views │ ├── pom.xml └── web.xml /* Write CSS Here */ 2. MyController.java: Sample Controller Create a simple controller that handles requests. This is a basic example; you can expand it based on your application requirements. Java package com.example.controller; import org.springframework.stereotype.Controller; import org.springframework.ui.Model; import org.springframework.web.bind.annotation.RequestMapping; @Controller public class MyController { @RequestMapping("/hello") public String hello(Model model) { model.addAttribute("message", "Hello, Spring MVC!"); return "hello"; // This corresponds to the view name } } 3. View ('hello.jsp') Create a simple JSP file under 'src/main/webapp/WEB-INF/views/hello.jsp' Java Server Pages <%@ page contentType="text/html;charset=UTF-8" language="java" %> <html> <head> <title>Hello Page</title> </head> <body> <h2>${message}</h2> </body> </html> 4. 'AppConfig.java': Configuration Ensure that AppConfig.java scans the package where your controllers are located Java package com.example; import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; import org.springframework.web.servlet.config.annotation.EnableWebMvc; @Configuration @EnableWebMvc @ComponentScan(basePackages = "com.example.controller") public class AppConfig { // Additional configurations or bean definitions can go here } 5. 'web.xml': Web Application Configuration Configure the DispatcherServlet in web.xml XML <web-app xmlns="http://xmlns.jcp.org/xml/ns/javaee" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://xmlns.jcp.org/xml/ns/javaee http://xmlns.jcp.org/xml/ns/javaee/web-app_4_0.xsd" version="4.0"> <servlet> <servlet-name>dispatcher</servlet-name> <servlet-class>org.springframework.web.servlet.DispatcherServlet</servlet-class> <init-param> <param-name>contextConfigLocation</param-name> <param-value>/WEB-INF/dispatcher-servlet.xml</param-value> </init-param> <load-on-startup>1</load-on-startup> </servlet> <servlet-mapping> <servlet-name>dispatcher</servlet-name> <url-pattern>/</url-pattern> </servlet-mapping> </web-app> 6. 'dispatcher-servlet'.xml Create a dispatcher-servlet.xml file under src/main/webapp/WEB-INF/ to define additional Spring MVC configurations: XML <?xml version="1.0" encoding="UTF-8"?> <beans xmlns="http://www.springframework.org/schema/beans" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:context="http://www.springframework.org/schema/context" xmlns:mvc="http://www.springframework.org/schema/mvc" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd http://www.springframework.org/schema/mvc http://www.springframework.org/schema/mvc/spring-mvc.xsd"> <!-- Enables component scanning for the specified package --> <context:component-scan base-package="com.example.controller"/> <!-- Enables annotation-driven Spring MVC --> <mvc:annotation-driven/> <!-- Resolves views selected for rendering by @Controllers to .jsp resources in the /WEB-INF/views directory --> <bean class="org.springframework.web.servlet.view.InternalResourceViewResolver"> <property name="prefix" value="/WEB-INF/views/"/> <property name="suffix" value=".jsp"/> </bean> </beans> 7. Run the Application Run your application (this depends on your IDE). Access the hello endpoint at http://localhost:8080/your-app-context/hello. You should see the "Hello, Spring MVC!" message. Remember to replace "your-app-context" with the actual context path of your deployed application. War File Creation Transitioning to the packaging phase, it's time to create a deployable Web Application Archive (WAR) file for your Spring MVC application. This file serves as the standardized encapsulation of your Java web application. Utilize prevalent build tools like Maven to automate this process, simplifying the generation of the WAR file. Typically, you'll find this compiled archive neatly organized within the target directory. The WAR file encapsulates your Spring MVC app, ready to be seamlessly deployed onto the local Tomcat server, marking a pivotal step towards actualizing the functionality of your application in a real-world web environment. Deploying on Tomcat Embarking on the deployment phase, the excitement builds as you launch your application onto the local Tomcat server. This involves a straightforward process: copy the previously generated WAR file into the designated `webapps` directory within your Tomcat installation. This directory serves as the portal for deploying web applications. Subsequently, initiate or restart the Tomcat server and watch as it autonomously detects and deploys your Spring MVC application. This automated deployment mechanism streamlines the process, ensuring that your application is swiftly up and running on the local Tomcat server, ready for comprehensive testing and further development iterations. Testing the Deployed App Upon successful deployment, it's time to conduct a comprehensive test of your Spring MVC application. Open your web browser and enter the address `http://localhost:8080/your-app-context`, replacing `your-app-context` with the precise context path assigned to your deployed application. This step allows you to visually inspect and interact with your application in a real-time web environment. If all configurations align seamlessly, you should witness your Spring MVC app dynamically come to life, marking a pivotal moment in the deployment process and affirming the correct integration of your application with the local Tomcat server. Tips for Efficient Development To enhance your development workflow, consider the following tips: Hot swapping: Leverage hot-swapping features in your IDE to avoid restarting the server after every code change. Logging: Implement comprehensive logging to troubleshoot any issues during deployment. Monitoring: Utilize tools like JConsole or VisualVM to monitor your application's performance metrics. Conclusion In reaching this conclusion, congratulations are in order! The successful deployment of your Spring MVC app on a local Tomcat server marks a significant milestone. This guide has imparted a foundational understanding of the deployment process, a vital asset for a seamless transition to production environments. As you persist in honing your development skills, bear in mind that adept deployment practices are instrumental in delivering applications of utmost robustness and reliability. Your achievement in this deployment endeavor underscores your capability to orchestrate a streamlined and effective deployment pipeline for future projects. Well done!
I had to recently add UI tests for an application implemented with Swing library for the Posmulten project. The GUI does not do any rocket science. It does what the Posmulten project was created for, generating DDL statements that make RLS policy for the Postgres database, but with a user interface based on Swing components. Now, because the posmulten is an open-source project and the CI/CD process uses GitHub action, it would be worth having tests covering the UI application's functionality. Tests that could be run in a headless environment. Testing Framework As for testing purposes, I picked the AssertJ Swing library. It is effortless to mimic application users' actions. Not to mention that I could, with no effort, check application states and their components. Below is an example of a simple test case that checks if the correct panel will show up with the expected content after entering text and clicking the correct button. Java @Test public void shouldDisplayCreationScriptsForCorrectConfigurationWhenClickingSubmitButton() throws SharedSchemaContextBuilderException, InvalidConfigurationException { // GIVEN String yaml = "Some yaml"; ISharedSchemaContext context = mock(ISharedSchemaContext.class); Mockito.when(factory.build(eq(yaml), any(DefaultDecoratorContext.class))).thenReturn(context); List<SQLDefinition> definitions = asList(sqlDef("DEF 1", null), sqlDef("ALTER DEFINIT and Function", null)); Mockito.when(context.getSqlDefinitions()).thenReturn(definitions); window.textBox(CONFIGURATION_TEXTFIELD_NAME).enterText(yaml); // WHEN window.button("submitBtn").click(); // THEN window.textBox(CREATION_SCRIPTS_TEXTFIELD_NAME).requireText("DEF 1" + "\n" + "ALTER DEFINIT and Function"); // Error panel should not be visible findJTabbedPaneFixtureByName(ERROR_TAB_PANEL_NAME).requireNotVisible(); } You can find the complete test code here. Posmulten The library for which the GUI application was created is generally a simple DDL statement builder that makes RSL policy in the Postgres database. The generated RLS policies allow applications communicating with the Postgres database to work in Mutli-tenant architecture with the shared schema strategy. For more info, please check below links: Posmulten GUI module Shared Schema Strategy With Postgres Multi-tenancy Architecture With Shared Schema Strategy in Webapp Application Based on Spring-boot, Thymeleaf, and Posmulten-hibernate Maven Configuration It is worth excluding UI tests from unit tests. Although tests might not be fully e2e with mocked components, it is worth excluding from running together with unit tests because their execution might take a little longer than running standard unit tests. XML <profile> <id>swing-tests</id> <activation> <activeByDefault>false</activeByDefault> </activation> <build> <plugins> <plugin> <groupId>org.codehaus.gmavenplus</groupId> <artifactId>gmavenplus-plugin</artifactId> <version>1.5</version> <executions> <execution> <goals> <goal>compile</goal> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-surefire-plugin</artifactId> <version>2.22.1</version> <configuration> <includes> <include>**/*SwingTest.java</include> </includes> </configuration> </plugin> </plugins> </build> </profile> Full Maven file. To run tests locally, on an environment with a Graphics card, you need to execute tests with a maven wrapper, like below. Shell ./mvnw -pl :openwebstart '-DxvfbRunningTests=true' -P !unit-tests,swing-tests test GitHub Action Now, moving to the GitHub action, running the UI test on the environment with a Graphics card seems easy. However, there might be situations when some UI windows with a WhatsApp or MS Teams notification appear on the Desktop on which UI tests are executed, and our tests will fail. Tests should be repeated in such cases, but that is not the problem. Many more problems can occur when we try to execute tests on a headless environment, which is probably the default environment for every CI/CD pipeline. And we still need to run those tests and ensure they will pass no matter if they are executed in such an environment. When we ask how to execute UI tests in a headless environment, the first suggestion on the internet is to use the Xvfb. However, the contributors of AssertJ Swing suggest a different approach. Our tests maximize windows and do other stuff the default window manager of xvfb doesn't support. TightVNC makes it easy to use another window manager. Just add gnome-wm & (or the window manager of your choice) to ~/.vnc/xstartup and you're ready to run. GitHub So, I followed suggestions from the contributors' team and used the Tightvncserver. I had some problems with adding the gnome-wm. Instead, I used the Openbox. Below, you can see the step that runs UI tests. The full GitHub action file can be found here. The script files used to configure CI can be found here. YAML testing_swing_app: needs: [compilation_and_unit_tests, database_tests, testing_configuration_jar] runs-on: ubuntu-latest name: "Testing Swing Application" steps: - name: Git checkout uses: actions/checkout@v2 # Install JDKs and maven toolchain - uses: actions/setup-java@v3 name: Set up JDK 11 id: setupJava11 with: distribution: 'zulu' # See 'Supported distributions' for available options java-version: '11' - name: Set up JDK 1.8 id: setupJava8 uses: actions/setup-java@v1 with: java-version: 1.8 - uses: cactuslab/maven-toolchains-xml-action@v1 with: toolchains: | [ {"jdkVersion": "8", "jdkHome": "${{steps.setupJava8.outputs.path}"}, {"jdkVersion": "11", "jdkHome": "${{steps.setupJava11.outputs.path}"} ] - name: Install tightvncserver run: sudo apt-get update && sudo apt install tightvncserver - name: Install openbox run: sudo apt install openbox - name: Copy xstartup run: mkdir $HOME/.vnc && cp ./swing/xstartup $HOME/.vnc/xstartup && chmod +x $HOME/.vnc/xstartup - name: Setting password for tightvncserver run: ./swing/setpassword.sh - name: Run Swing tests id: swingTest1 continue-on-error: true run: ./mvnw -DskipTests --quiet clean install && ./swing/execute-on-vnc.sh ./mvnw -pl :openwebstart '-DxvfbRunningTests=true' -P !unit-tests,swing-tests test #https://www.thisdot.co/blog/how-to-retry-failed-steps-in-github-action-workflows/ # https://stackoverflow.com/questions/54443705/change-default-screen-resolution-on-headless-ubuntu - name: Run Swing tests (Second time) id: swingTest2 if: steps.swingTest1.outcome == 'failure' run: ./mvnw -DskipTests --quiet clean install && ./swing/execute-on-vnc.sh ./mvnw -pl :openwebstart '-DxvfbRunningTests=true' -P !unit-tests,swing-tests test Retry Failed Steps in GitHub Action Workflows As you probably saw in the GitHub action file, the last step, which executes the UI tests, is added twice. After correctly setting up Tightvncserver and Openbox, I didn't observe that the second step had to be executed, except at the beginning during deployment when there were not a lot of UI components. I used Xvfb, and sometimes, the CI passed only after the second step. So even if there is no problem with executing the test the first time right now, then it is still worth executing those tests the second time in case of failure. To check if the first step failed, we first have to name it. In this case, the name is "swingTest1". In the second step, we use the "if" property like the below: YAML if: steps.swingTest1.outcome == 'failure' And that is generally all for running the step a second time in case of failure. Check this resource if you want to check other ways to execute the step a second time. Summary Setting CI for UI tests might not be a trivial task, but it can benefit a project with any GUI. Not all things can be tested with unit tests.
According to several sources we queried, more than 33 percent of the world's web servers are running Apache Tomcat, while other sources show that it's 48 percent of application servers. Some of these instances have been containerized over the years, but many still run in the traditional setup of a virtual machine with Linux. Red Hat JBoss Web Server (JWS) combines a web server (Apache HTTPD), a servlet engine (Apache Tomcat), and modules for load balancing (mod_jk and mod_cluster). Ansible is an automation engine that provides a suite of tools for managing an enterprise at scale. In this article, we'll show how 1+1 becomes 11 by using Ansible to completely automate the deployment of a JBoss Web Server instance on a Red Hat Enterprise Linux 8 server. A prior article covered this subject, but now you can use the Red Hat-certified content collection for the JBoss Web Server, which has been available since the 5.7 release. In this article, you will automate a JBoss Web Server deployment through the following tasks: Retrieve the archive containing the JBoss Web Server from a repository and install the files on the system. Configure the Red Hat Enterprise Linux operating system, creating users, groups, and the required setup files to enable JBoss Web Server as a systemd service. Fine-tune the configuration of the JBoss Web Server server, such as binding it to the appropriate interface and port. Deploy a web application and start the systemd service. Perform a health check to ensure that the deployed application is accessible. Ansible fully automates all those operations, so no manual steps are required. Preparing the Target Environment Before you start the automation, you need to specify your target environment. In this case, you'll be using Red Hat Enterprise Linux 8 with Python 3.6. You'll use this setup on both the Ansible control node (where Ansible is executed) and the Ansible target (the system being configured). On the control node, confirm the following requirements: Shell $ cat /etc/redhat-release Red Hat Enterprise Linux release 8.7 (Ootpa) $ ansible --version ansible [core 2.13.3] config file = /etc/ansible/ansible.cfg configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python3.9/site-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/bin/ansible python version = 3.9.13 (main, Nov 9 2022, 13:16:24) [GCC 8.5.0 20210514 (Red Hat 8.5.0-15)] jinja version = 3.1.2 libyaml = True Note: The procedure in this article might not execute successfully if you use a different Python version or target operating system. Installing the Red Hat Ansible Certified Content Collection Once you have Red Hat Enterprise Linux 8 set up and Ansible ready to go, you need to install the Red Hat Ansible Certified Content Collection 1.2 for the Red Hat JBoss Web Server. Ansible uses the collection to perform the following tasks on the JBoss Web Server: Ensure that the required system dependencies (e.g., unzip) are installed. Install Java (if it is missing and requested). Install the web server binaries and integrate the software into the system (setting the user, group, etc.). Deploy the configuration files. Start and enable JBoss Web Server as a systemd service. To install the certified collection for the JBoss Web Server, you'll have to configure Ansible to use Red Hat Automation Hub as a Galaxy server. Follow the instructions on Automation Hub to retrieve your token and update the ansible.cfg configuration file in your project directory. Update the field with the token obtained from Automation Hub: YAML [galaxy] server_list = automation_hub, galaxy [galaxy_server.galaxy] url=https://galaxy.ansible.com/ [galaxy_server.automation_hub] url=https://cloud.redhat.com/api/automation-hub/api/galaxy/ auth_url=https://sso.redhat.com/auth/realms/redhat-external/protocol/openid-connect/token token=<your-token> Install the certified collection: Shell $ ansible-galaxy collection install redhat.jws Starting galaxy collection install process Process install dependency map Starting collection install process Downloading https://console.redhat.com/api/automation-hub/v3/plugin/ansible/content/published/collections/artifacts/redhat-jws-1.2.2.tar.gz to /root/.ansible/tmp/ansible-local-475cum49011/tmptmiuep63/redhat-jws-1.2.2-299_snr4 Installing 'redhat.jws:1.2.2' to '/root/.ansible/collections/ansible_collections/redhat/jws' Downloading https://console.redhat.com/api/automation-hub/v3/plugin/ansible/content/published/collections/artifacts/redhat-redhat_csp_download-1.2.2.tar.gz to /root/.ansible/tmp/ansible-local-475cum49011/tmptmiuep63/redhat-redhat_csp_download-1.2.2-tb4zjzut redhat.jws:1.2.2 was installed successfully Installing 'redhat.redhat_csp_download:1.2.2' to '/root/.ansible/collections/ansible_collections/redhat/redhat_csp_download' redhat.redhat_csp_download:1.2.2 was installed successfully Ansible Galaxy fetches and downloads the collection's dependencies. These dependencies include the redhat_csp_download collection, which helps facilitate the retrieval of the archive containing the JBoss Web Server server from either the Red Hat customer portal or a specified local or remote location. For more information about this step, please refer to the official documentation for Red Hat. Installing the Red Hat JBoss Web Server The configuration steps in this section include downloading JBoss Web Server, installing Java, and enabling JBoss Web Server as a system service (systemd). Downloading the Archive First, you need to download the archive for the JBoss Web Server from the Red Hat Customer Portal. By default, the collection expects the archive to be in the root folder of the Ansible project. The only remaining requirement is to specify the version of the JBoss Web Server being used (5.7) in the playbooks. Based on this information, the collection determines the path and the full name of the archive. Therefore, update the value of jws_version in the jws-article.yml playbook: YAML --- - name: "Red Hat JBoss Web Server installation and configuration" hosts: all vars: jws_setup: True jws_version: 5.7.0 jws_home: /opt/jws-5.7/tomcat … Installing Java JBoss Web Server is a Java-based server, so the target system must have a Java Virtual Machine (JVM) installed. Although Ansible primitives can perform such tasks natively, the redhat.jws collection can also take care of this task, provided that the jws_java_version variable is defined: YAML jws_home: /opt/jws-5.7/tomcat jws_java_version: 1.8.0 … Note: This feature works only if the target system's distribution belongs to the Red Hat family.Enabling JBoss Web Server as a system service (systemd) The JBoss Web Server server on the target system should run as a service system. The collection can also take care of this task if the jws_systemd_enabled variable is defined as True: YAML jws_java_version: 1.8.0 jws_systemd_enabled: True jws_service_name: jws Note: This configuration works only when systemd is installed, and the system belongs to the Red Hat family. Now that you have defined all the required variables to deploy the JBoss Web Server, finish the playbook: YAML ... jws_service_name: jws collections: - redhat.jws roles: - role: jws Running the Playbook Run the playbook to see whether it works as expected: Shell $ ansible-playbook -i inventory jws-article.yml PLAY [Red Hat JBoss Web Server installation and configuration] ******************************************************************************************************************************************************************************* TASK [Gathering Facts] *********************************************************************************************************************************************************************************************************************** ok: [localhost] TASK [redhat.jws.jws : Validating arguments against arg spec 'main'] ************************************************************************************************************************************************************************* ok: [localhost] TASK [redhat.jws.jws : Set default values] *************************************************************************************************************************************************************************************************** skipping: [localhost] TASK [redhat.jws.jws : Set default values (jws)] ********************************************************************************************************************************************************************************************* ok: [localhost] TASK [redhat.jws.jws : Set jws_home to /opt/jws-5.7/tomcat if not already defined] *********************************************************************************************************************************************************** skipping: [localhost] TASK [redhat.jws.jws : Check that jws_home has been defined.] ******************************************************************************************************************************************************************************** ok: [localhost] => { "changed": false, "msg": "All assertions passed" } TASK [redhat.jws.jws : Install required dependencies] **************************************************************************************************************************************************************************************** included: /root/.ansible/collections/ansible_collections/redhat/jws/roles/jws/tasks/fastpackage.yml for localhost => (item=zip) included: /root/.ansible/collections/ansible_collections/redhat/jws/roles/jws/tasks/fastpackage.yml for localhost => (item=unzip) TASK [redhat.jws.jws : Check arguments] ****************************************************************************************************************************************************************************************************** ok: [localhost] … TASK [redhat.jws.jws : Remove apps] ********************************************************************************************************************************************************************************************************** changed: [localhost] => (item=ROOT) ok: [localhost] => (item=examples) TASK [redhat.jws.jws : Create vault configuration (if enabled)] ****************************************************************************************************************************************************************************** skipping: [localhost] RUNNING HANDLER [redhat.jws.jws : Reload Systemd] ******************************************************************************************************************************************************************************************** ok: [localhost] RUNNING HANDLER [redhat.jws.jws : Ensure Jboss Web Server runs under systemd] **************************************************************************************************************************************************************** included: /root/.ansible/collections/ansible_collections/redhat/jws/roles/jws/tasks/systemd/service.yml for localhost RUNNING HANDLER [redhat.jws.jws : Check arguments] ******************************************************************************************************************************************************************************************* ok: [localhost] RUNNING HANDLER [redhat.jws.jws : Enable jws.service] **************************************************************************************************************************************************************************************** changed: [localhost] RUNNING HANDLER [redhat.jws.jws : Start jws.service] ***************************************************************************************************************************************************************************************** changed: [localhost] RUNNING HANDLER [redhat.jws.jws : Restart Jboss Web Server service] ************************************************************************************************************************************************************************** changed: [localhost] PLAY RECAP *********************************************************************************************************************************************************************************************************************************** localhost : ok=64 changed=15 unreachable=0 failed=0 skipped=19 rescued=2 ignored=0 As you can see, quite a lot happened during this execution. Indeed, the redhat.jws role took care of the entire setup: Deploying a base configuration Removing unused applications Starting the web server Deploying a Web Application Now that JBoss Web Server is running modify the playbook to facilitate the deployment of a web application: YAML roles: - role: jws tasks: - name: " Checks that server is running" ansible.builtin.uri: url: "http://localhost:8080/" status_code: 404 return_content: no - name: "Deploy demo webapp" ansible.builtin.get_url: url: 'https://people.redhat.com/~rpelisse/info-1.0.war' dest: "{{ jws_home }/webapps/info.war" notify: - "Restart Jboss Web Server service" The configuration uses a handler, provided by the redhat.jws collection to ensure that the JBoss Web Server is restarted once the application is downloaded. Automation Saves Time and Reduces the Chance of Error The Red Hat Ansible Certified Content Collection encapsulates, as much as possible, the complexities and the inner workings of Red Hat JBoss Web Server deployment. With the help of the collection, you can focus on your business use case, such as deploying applications, instead of establishing the underlying application server. The result is reduced complexity and faster time to value. The automated process is also repeatable and can be used to set up as many systems as needed.
Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems. In particular, site reliability engineers are responsible for ensuring that a given system’s behavior consistently meets business requirements for performance and availability. Furthermore, whereas traditional operations teams and development teams often have opposing incentives, site reliability engineers are able to align incentives so that both feature development and reliability are promoted simultaneously. Basic SRE Principles This article covers key principles that underlie SRE, provides some examples of those key principles, and includes relevant details and illustrations to clarify these examples. Principle Description Example Embrace risk No system can be expected to have perfect performance. It’s important to identify potential failure points and create mitigation plans. Additionally, it’s important to budget a certain percentage of business costs to address these failures in real time. A week consists of 168 hours of potential availability. The business sets an expectation of 165 hours of uptime per week to account for both planned maintenance and unplanned failures. Set service level objectives (SLOs) Set reasonable expectations for system performance to ensure that customers and internal stakeholders understand how the system is supposed to perform at various levels. Remember that no system can be expected to have perfect performance. The website is up and running 99% of the time. 99% of all API requests return a successful response. The server output matches client expectations 99% of the time. 99% of all API requests are delivered within one second. The server can handle 10,000 requests per second. Eliminate work through automation Automate as many tasks and processes as possible. Engineers should focus on developing new features and enhancing existing systems at least as often as addressing real-time failures. Production code automatically generates alerts whenever an SLO is violated. The automated alerts send tickets to the appropriate incident response team with relevant playbooks to take action. Monitor systems Use tools, to monitor system performance. Observe performance, incidents, and trends. A dashboard that displays the proportion of client requests and server responses that were delivered successfully in a given time period. A set of logs that displays the expected and actual output of client requests and server responses in a given time period. Keep things simple Release frequent, small changes that can be easily reverted to minimize production bugs. Delete unnecessary code instead of keeping it for potential future use. The more code and systems that are introduced, the more complexity created; it’s important to prevent accidental bloat. Changes in code are always pushed via a version control system that tracks code writers, approvers, and previous states. Outline the release engineering process Document your established processes for development, testing, automation, deployments, and production support. Ensure that the process is accessible and visible. A published playbook lists the steps to address reboot failure. The playbook contains references to relevant SLOs, dashboards, previous tickets, sections of the codebase, and contact information for the incident response team. Embrace Risk No system can be expected to have perfect performance. It’s important to create reasonable expectations about system performance for both internal stakeholders and external users. Key Metrics For services that are directly user-facing, such as static websites and streaming, two common and important ways to measure performance are time availability and aggregate availability. This article provides an example of calculating time availability for a service. For other services, additional factors are important, including speed (latency), accuracy (correctness), and volume (throughput). An example calculation for latency is as follows: Suppose 10 different users serve up identical HTTP requests to your website, and they are all served properly. The return times are monitored and recorded as follows: 1 ms, 3 ms, 3 ms, 4 ms, 1 ms, 1 ms, 1 ms, 5 ms, 3 ms, and 2 ms. The average response time, or latency, is 24 ms / 10 returns = 2.4 ms. Choosing key metrics makes explicit how the performance of a service is evaluated, and therefore what factors pose a risk to service health. In the above example, identifying latency as a key metric indicates average return time as an essential property of the service. Thus, a risk to the reliability of the service is “slowness” or low latency. Define Failure In addition to measuring risks, it’s important to clearly define which risks the system can tolerate without compromising quality and which risks must be addressed to ensure quality. This article provides an example of two types of measurements that address failure: mean time to failure (MTTF) and mean time between failures (MTBF). The most robust way to define failures is to set SLOs, monitor your services for violations in SLOs, and create alerts and processes for fixing violations. These are discussed in the following sections. Error Budgets The development of new production features always introduces new potential risks and failures; aiming for a 100% risk-free service is unrealistic. The way to align the competing incentives of pushing development and maintaining reliability is through error budgets. An error budget provides a clear metric that allows a certain proportion of failure from new releases in a given planning cycle. If the number or length of failures exceeds the error budget, no new releases may occur until a new planning period begins. The following is an example error budget. Planning cycle Quarter Total possible availability 2,190 hours SLO 99.9% time availability Error budget 0.1% time availability = 21.9 hours Suppose the development team plans to release 10 new features during the quarter, and the following occurs: The first feature doesn’t cause any downtime. The second feature causes downtime of 10 hours until fixed. The third and fourth features each cause downtime of 6 hours until fixed. At this point, the error budget for the quarter has been exceeded (10 + 6 + 6 = 22 > 21.9), so the fifth feature cannot be released. In this way, the error budget has ensured an acceptable feature release velocity while not compromising reliability or degrading user experience. Set Service Level Objectives (SLOs) The best way to set performance expectations is to set specific targets for different system risks. These targets are called service level objectives, or SLOs. The following table lists examples of SLOs based on different risk measurements. Time availability Website running 99% of the time Aggregate availability 99% of user requests processed Latency 1 ms average response rate per request Throughput 10,000 requests handled every second Correctness 99% of database reads accurate Depending on the service, some SLOs may be more complicated than just a single number. For example, a database may exhibit 99.9% correctness on reads but have the 0.1% of errors it incurs always be related to the most recent data. If a customer relies heavily on data recorded in the past 24 hours, then the service is not reliable. In this case, it makes sense to create a tiered SLObased on the customer’s needs. Here is an example: Level 1 (records within the last 24 hours) 99.99% read accuracy Level 2 (records within the last 7 days) 99.9% read accuracy Level 3 (records within the last 30 days) 99% read accuracy Level 4 (records within the last 6 months) 95% read accuracy Costs of Improvement One of the main purposes of establishing SLOs is to track how reliability affects revenue. Revisiting the sample error budget from the section above, suppose there is a projected service revenue of $500,000 for the quarter. This can be used to translate the SLO and error budget into real dollars. Thus, SLOs are also a way to measure objectives that are indirectly related to system performance. SLO Error Budget Revenue Lost 95% 5% $25,000 99% 1% $5,000 99.90% 0.10% $500 99.99% 0.01% $50 Using SLOs to track indirect metrics, such as revenue, allows one to assess the cost of improving service. In this case, spending $10,000 on improving the SLO from 95% to 99% is a worthwhile business decision. On the other hand, spending $10,000 on improving the SLO from 99% to 99.9% is not. Eliminate Work Through Automation One characteristic that distinguishes SREs from traditional DevOps is the ability to scale up the scope of a service without scaling the cost of the service. Called sublinear growth, this is accomplished via automation. In a traditional development-operations split, the development team pushes new features, while the operations team dedicates 100% of its time to maintenance. Thus, a pure operations team will need to grow 1:1 with the size and scope of the service it is maintaining: If it takes O(10) system engineers to serve 1000 users, it will take O(100) engineers to serve 10K users. In contrast, an SRE team operating according to best practices will devote at least 50% of its time to developing systems that remove the basic elements of effort from the operations workload. Some examples of this include the following: A service that detects which machines in a large fleet need software updates and schedules software reboots in batches over regular time intervals. A “push-on-green” module that provides an automatic workflow for the testing and release of new code to relevant services. An alerting system that automates ticket generation and notifies incident response teams. Monitor Systems To maintain reliability, it is imperative to monitor the relevant analytics for a service and use monitoring to detect SLO violations. As mentioned earlier, some important metrics include: The amount of time that a service is up and running (time availability) The number of requests that complete successfully (aggregate availability) The amount of time it takes to serve a request (latency) The proportion of responses that deliver expected results (correctness) The volume of requests that a system is currently handling (throughput) The percentage of available resources being consumed (saturation) Sometimes durability is also measured, which is the length of time that data is stored with accuracy. Dashboards A good way to implement monitoring is through dashboards. An effective dashboard will display SLOs, include the error budget, and present the different risk metrics relevant to the SLO. Example of an effective SRE dashboard (source) Logs Another good way to implement monitoring is through logs. Logs that are both searchable in time and categorized via request are the most effective. If an SLO violation is detected via a dashboard, a more detailed picture can be created by viewing the logs generated during the affected timeframe. Example of a monitoring log (source) Whitebox Versus Blackbox The type of monitoring discussed above that tracks the internal analytics of a service is called whitebox monitoring. Sometimes it’s also important to monitor the behavior of a system from the “outside,” which means testing the workflow of a service from the point of view of an external user; this is called blackbox monitoring. Blackbox monitoring may reveal problems with access permissions or redundancy. Automated Alerts and Ticketing One of the best ways for SREs to reduce effort is to use automation during monitoring for alerts and ticketing. The SRE process is much more efficient than a traditional operations process. A traditional operations response may look like this: A web developer pushes a new update to an algorithm that serves ads to users. The developer notices that the latest push is reducing website traffic due to an unknown cause and manually files a ticket about reduced traffic with the web operations team. A system engineer on the web operations team receives a ticket about the reduced traffic issue. After troubleshooting, the issue is diagnosed as a latency issue caused by a stuck cache. The web operations engineer contacts a member of the database team for help. The database team looks into the codebase and identifies a fix for the cache settings so that data is refreshed more quickly and latency is decreased. The database team updates the cache refresh settings, pushes the fix to production, and closes the ticket. In contrast, an SRE operations response may look like this: The ads SRE team creates a deployment tool that monitors three different traffic SLOs: availability, latency, and throughput. A web developer is ready to push a new update to an algorithm that serves ads, for which he uses the SRE deployment tool. Within minutes, the deployment tool detects reduced website traffic. It identifies a latency SLO violation and creates an alert. The on-call site reliability engineer receives the alert, which contains a proposal for updated cache refresh settings to make processing requests faster. The site reliability engineer accepts the proposed changes, pushes the new settings to production, and closes the ticket. By using an automated system for alerting and proposing changes to the database, the communication required, the number of people involved, and the time to resolution are all reduced. The following code block is a generic language implementation of latency and throughput thresholds and automated alerts triggered upon detected violations. Java # Define the latency SLO threshold in seconds and create a histogram to track LATENCY_SLO_THRESHOLD = 0.1 REQUEST_LATENCY = Histogram('http_request_latency_seconds', 'Request latency in seconds', ['method', 'endpoint']) # Define the throughput SLO threshold in requests per second and a counter to track THROUGHPUT_SLO_THRESHOLD = 10000 REQUEST_COUNT = Counter('http_request_count', 'Request count', ['method', 'endpoint', 'http_status']) # Check if the latency SLO is violated and send an alert if it is def check_latency_slo(): latency = REQUEST_LATENCY.observe(0.1).observe(0.2).observe(0.3).observe(0.4).observe(0.5).observe(0.6).observe(0.7).observe(0.8).observe(0.9).observe(1.0) quantiles = latency.quantiles(0.99) latency_99th_percentile = quantiles[0] if latency_99th_percentile > LATENCY_SLO_THRESHOLD: printf("Latency SLO violated! 99th percentile response time is {latency_99th_percentile} seconds.") # Check if the throughput SLO is violated and send an alert if it is def check_throughput_slo(): request_count = REQUEST_COUNT.count() current_throughput = request_count / time.time() if current_throughput > THROUGHPUT_SLO_THRESHOLD: printf("Throughput SLO violated! Current throughput is {current_throughput} requests per second.") Example of automated alert calls Keep Things Simple The best way to ensure that systems remain reliable is to keep them simple. SRE teams should be hesitant to add new code, preferring instead to modify and delete code where possible. Every additional API, library, and function that one adds to production software increases dependencies in ways that are difficult to track, introducing new points of failure. Site reliability engineers should aim to keep their code modular. That is, each function in an API should serve only one purpose, as should each API in a larger stack. This type of organization makes dependencies more transparent and also makes diagnosing errors easier. Playbooks As part of incident management, playbooks for typical on-call investigations and solutions should be authored and published publicly. Playbooks for a particular scenario should describe the incident (and possible variations), list the associated SLOs, reference appropriate monitoring tools and codebases, offer proposed solutions, and catalog previous approaches. Outline the Release Engineering Process Just as an SRE codebase should emphasize simplicity, so should an SRE release process. Simplicity is encouraged through a couple of principles: Smaller size and higher velocity: Rather than large, infrequent releases, aim for a higher frequency of smaller ones. This allows the team to observe changes in system behavior incrementally and reduces the potential for large system failures. Self-service: An SRE team should completely own its release process, which should be automated effectively. This both eliminates work and encourages small-size, high-velocity pushes. Hermetic builds: The process for building a new release should be hermetic, or self-contained. That is to say, the build process must be locked to known versions of existing tools (e.g., compilers) and not be dependent on external tools. Version Control All code releases should be submitted within a version control system to allow for easy reversions in the event of erroneous, redundant, or ineffective code. Code Reviews The process of submitting releases should be accompanied by a clear and visible code review process. Basic changes may not require approval, whereas more complicated or impactful changes will require approval from other site reliability engineers or technical leads. Recap of SRE Principles The main principles of SRE are embracing risk, setting SLOs, eliminating work via automation, monitoring systems, keeping things simple, and outlining the release engineering process. Embracing risk involves clearly defining failure and setting error budgets. The best way to do this is by creating and enforcing SLOs, which track system performance directly and also help identify the potential costs of system improvement. The appropriate SLO depends on how risk is measured and the needs of the customer. Enforcing SLOs requires monitoring, usually through dashboards and logs. Site reliability engineers focus on project work, in addition to development operations, which allows for services to expand in scope and scale while maintaining low costs. This is called sublinear growth and is achieved through automating repetitive tasks. Monitoring that automates alerting creates a streamlined operations process, which increases reliability. Site reliability engineers should keep systems simple by reducing the amount of code written, encouraging modular development, and publishing playbooks with standard operating procedures. SRE release processes should be hermetic and push small, frequent changes using version control and code reviews.
As teams moved their deployment infrastructure to containers, monitoring and logging methods changed a lot. Storing logs in containers or VMs just doesn’t make sense – they’re both way too ephemeral for that. This is where solutions like Kubernetes DaemonSet come in. Since pods are ephemeral as well, managing Kubernetes logs is challenging. That’s why it makes sense to collect logs from every node and send them to some sort of central location outside the Kubernetes cluster for persistence and later analysis. A DaemonSet pattern lets you implement node-level monitoring agents in Kubernetes easily. This approach doesn’t force you to apply any changes to your application and uses little resources. Dive into the world of DaemonSets to see how they work on a practical example of network traffic monitoring. What Is Kubernetes DaemonSet? Intro to Node-Level Monitoring in Kubernetes A DaemonSet in Kubernetes is a specific kind of workload controller that ensures a copy of a pod runs on either all or some specified nodes within the cluster. It automatically adds pods to new nodes and removes pods from removed nodes. This makes DaemonSet ideal for tasks like monitoring, logging, or running a network proxy on every node. DaemonSet vs. Deployment While a Deployment ensures that a specified number of pod replicas run and are available across the nodes, a DaemonSet makes sure that a copy of a pod runs on all (or some) nodes in the cluster. It’s a more targeted approach that guarantees that specific services run everywhere they’re needed. DaemonSets provide a unique advantage in scenarios where consistent functionality across every node is crucial. This is particularly important for node-level monitoring within Kubernetes. By deploying a monitoring agent via DaemonSet, you can guarantee that every node in your cluster is equipped with the tools necessary for monitoring its performance and health. This level of monitoring is vital for early detection of issues, load balancing, and maintaining overall cluster efficiency. An alternative approach – which involves manually deploying these agents or using other types of workload controllers like Deployments – could lead to inconsistencies and gaps in monitoring. For example, without a DaemonSet, a newly added node might remain unmonitored until it’s manually configured. This gap could pose a risk to both the performance and security of the entire cluster. The Benefits of DaemonSets DaemonSets automate this process, ensuring that each node is brought under the monitoring umbrella without any manual intervention as soon as it joins the cluster. Furthermore, DaemonSets aren’t just about deploying the monitoring tools. They also manage the lifecycle of these tools on each node. When a node is removed from the cluster, the DaemonSet ensures that the associated monitoring tools are also cleanly removed, keeping your cluster neat and efficient. In essence, Kubernetes DaemonSets simplify the process of maintaining a high level of operational awareness across all nodes. They provide a hands-off, automated solution that ensures no node goes unmonitored, enhancing the reliability and performance of Kubernetes clusters. This makes DaemonSets an indispensable tool in the arsenal of Kubernetes cluster administrators, particularly for tasks like node-level monitoring that require uniform deployment across all nodes. Head over to K8s docs for details about the Kubernetes DaemonSet feature. How Do DaemonSets Work? A DaemonSet is a Kubernetes object that is actively controlled by a controller. You can define whatever state you wish for it – for example, declare that a specific pod should be present on all nodes. The tuning control loop compares the intended state to what is currently happening. If a matching pod doesn’t exist on the monitored node, the DaemonSet controller will create one for you. This automated approach applies to both existing and newly created nodes. By default, the DaemonSet creates pods on all nodes. You can use the node selector to limit the number of nodes it can accept. The DaemonSet controller will only create pods on nodes that match the YAML file’s preset nodeSelector field. Here’s a DaemonSet example for creating nginx pods only on nodes that have `disktype=ssd` label: YAML ```yaml apiVersion: apps/v1 kind: DaemonSet metadata: labels: app: nginx name: nginx-daemonset spec: selector: matchLabels: name: nginx-pod template: metadata: labels: name: nginx-pod spec: containers: - image: nginx:latest name: nginx-container ports: - containerPort: 80 nodeSelector: disktype: ssd ``` When you add a new node to the cluster, that pod is also added to the new node. When a node is removed (or the cluster shrinks), Kubernetes automatically garbage-collects that pod. Network Traffic Monitoring With DaemonSet In the ever-evolving landscape of network management, understanding and overseeing network traffic is pivotal. Network traffic essentially refers to the amount and type of data moving across your network – this could be anything from user requests to data transfers. It’s the lifeblood of any digital environment, influencing the performance, security, and overall health of your network. The Role of DaemonSets in Traffic Monitoring How do you keep an eye on this in a Kubernetes environment? This is where DaemonSets come into play. As you already know, DaemonSets are a Kubernetes feature that allows you to deploy a pod on every node in your cluster. Why is that important for network traffic monitoring? Well, each node in your Kubernetes cluster can be involved in different kinds of network activities. By deploying a monitoring agent on every node, you get a comprehensive view of what’s happening across your entire cluster. You might be wondering now: Why not just use a Deployment and adjust the number of replicas to run on one or maybe two nodes to monitor the traffic of all nodes? It sounds simpler, but here’s the catch: Security and isolation: In Kubernetes, each node operates in its own isolated environment. This means that a pod on one node can’t directly monitor or access the network traffic of another node due to the security policies and Linux namespaces. These security measures are crucial for maintaining the integrity of your cluster. Accurate and localized data: By having a monitoring agent on each node, you get precise, localized data about the traffic. This level of granularity is essential for effective monitoring, as it helps in identifying specific issues and bottlenecks that might occur on individual nodes. Scalability and reliability: Using DaemonSets ensures that your monitoring setup scales with your cluster. As nodes are added or removed, the DaemonSet automatically adjusts, deploying or removing pods as needed. This dynamic scalability is a core requirement for maintaining a robust monitoring system in a growing or changing environment. As you can see, using DaemonSets for network traffic monitoring in a Kubernetes cluster isn’t just a matter of convenience; it’s a necessity for accurate, secure, and scalable network analysis. Each node has its own unique traffic patterns and potential issues, and DaemonSets ensures you don’t miss out on these critical insights. They empower you to maintain a high-performing and secure Kubernetes environment by providing a bird’s-eye view of your network traffic, node by node. Simplifying Network Traffic Monitoring in Kubernetes When it comes to keeping tabs on network traffic in your Kubernetes cluster, the road can be complex and challenging. Those keen on DIY approaches might consider building a custom solution. This could involve leveraging tools like `conntrack` to monitor each pod’s traffic, crafting intricate logic to process and store data, and continuously tackling a variety of potential issues that might arise along the way. While this approach offers flexibility, it’s often resource-intensive and riddled with complexities. A Streamlined Alternative to Network Monitoring Alternatively, what if you could bypass these hurdles and jump straight to an efficient, ready-to-use solution? That’s exactly what our open-source egressd tool offers. It’s designed to simplify network traffic monitoring in Kubernetes, providing a comprehensive and hassle-free approach. egressd consists of two main components: 1. Collector: A DaemonSet pod responsible for monitoring network traffic on nodes. 2. Exporter: A Deployment pod that fetches traffic data from each collector and export logs to HTTP or Prometheus. Here’s what our solution brings to the table: 1. Efficient Conntrack Monitoring egressd retrieves conntrack entries for pods on each node at a configured interval, defaulting to every five seconds. If you’re using Cilium, it fetches conntrack records directly from eBPF maps located in the host’s `/sys/fs/bpf` directory, which are created by Cilium. For setups using the Linux Netfilter Conntrack module, it leverages Netlink to obtain these records. 2. Intelligent Data Reduction The records are then streamlined, focusing on key parameters like source IP, destination IP, and protocol to provide a clear picture of network interactions. 3. Enhanced With Kubernetes Context We enrich the data by adding Kubernetes-specific context. This includes information about source and destination pods, nodes, node zones, and IP addresses, giving you a comprehensive view of your cluster’s network traffic. 4. Flexible Export Options The exporter in our solution is designed to be versatile, offering the capability to send logs either to an HTTP endpoint or to Prometheus for detailed analysis and alerting. Sidestep the Complexity of Building and Maintaining a Custom Solution With egressd You get a solid, ready-to-deploy system that seamlessly integrates into your Kubernetes environment, providing detailed, real-time insights into your network traffic. This means you can focus more on strategic tasks and less on the intricacies of monitoring infrastructure. Additionally, egressd provides you with two options: egressd can be installed as a standalone tool that will track your network traffic movements within the cluster, which you can then visualize in Grafana to get a better picture of your network: Alternatively, if you’re a CAST AI user, you can connect egressd to your dashboard to get all the benefits of our fancy cost reports. This way, you can see not only the amount of traffic within the cluster but also get more insights about workload-to-workload communication – and how much you pay for that traffic as it differentiates between different providers/regions/zones. Wrap Up Kubernetes DaemonSets come in handy for logging and monitoring purposes, but this is just the tip of the iceberg. You can also use them to tighten your security and achieve compliance by running CIS Benchmarks on each node and deploying security agents like intrusion detection systems or vulnerability scanners to run on nodes that handle PCI and PII-compliant data. And if you’re looking for more cost optimization opportunities, get started with a free cost monitoring report that has been fine-tuned to match the needs of Kubernetes teams: Breakdown of costs per cluster, workload, label, namespace, allocation group, and more. Workload efficiency metrics, with CPU and memory hours wasted per workload. Available savings report that shows how much you stand to save if you move your workloads to more cost-optimized nodes.
To log, or not to log? To log! Nowadays, we can’t even imagine a modern software system without logging subsystem implementation, because it’s the very basic tool of debugging and monitoring developers can’t be productive without. Once something gets broken or you just want to know what’s going on in the depths of your code execution, there’s almost no other way than just to implement a similar functionality. With distributed systems, and microservices architectures in particular, the situation gets even more complicated since each service can theoretically call any other service (or several of them at once), using either REST, gRPC, or asynchronous messaging (by means of numerous service buses, queues, brokers, and actor-based frameworks). Background processing goes there as well, resulting in entangled call chains we still want to have control over. In this article we will show you how to implement efficient distributed tracing in .NET quickly, avoiding the modification of low-level code as much as possible so that only generic tooling and base classes for each communication instrument are affected. Ambient Context Is The Core: Exploring The AsyncLocal Let’s start with the root which ensures the growth of our tree - that is, where the tracing information is stored. Because to log the tracing information, we need to store it somewhere and then get it somehow. Furthermore, this information should be available throughout the execution flow - this is exactly what we want to achieve. Thus, I’ve chosen to implement the ambient context pattern (you’re probably familiar with it from HttpContext): simply put, it provides global access to certain resources in the scope of execution flow. Though it’s sometimes considered an anti-pattern, in my opinion, the dependency injection concerns are a bit out of… scope (sorry for the pun), at least for a specific case where we don’t hold any business data. And .NET can help us with that, providing the AsyncLocal<T> class; as opposed to ThreadLocal<T>, which ensures data locality in the scope of a certain thread, AsyncLocal is used to hold data for tasks, which (as we know) can be executed in any thread. It’s worth mentioning that AsyncLocal works top down, so once you set the value at the start of the flow, it will be available for the rest of the ongoing flow as well, but if you change the value in the middle of the flow, it will be changed for the flow branch only; i.e., data locality will be preserved for each branch separately. If we look at the picture above, the following consequent use cases can be considered as examples: We set the AsyncLocal value as 0in the Root Task. If we don’t change it in the child tasks, it will be read as 0 in the child tasks’ branches as well. We set the AsyncLocalvalue as 1 in the Child Task 1. If we don’t change it in the Child Task 1.1, it will be read as 1 in the context of _Child Task 1 _and Child Task 1.1, but not in theRoot Task or Child Task 2 branch - they will keep 0. We set the AsyncLocal value as 2 in the Child Task 2. Similarly to #2, if we don’t change it in the Child Task 2.1, it will be read as 2 in the context of Child Task 2 and Child Task 2.1, but not in the Root Task or Child Task 1 branch - they will be 0 for Root Task, and 1 for Child Task 1 branch. We set the AsyncLocal value as 3 in the Child Task 1.1. This way, it will be read as 3 only in the context of Child Task 1.1, and not others’ - they will preserve previous values. We set the AsyncLocal value as 4 in the Child Task 2.1. This way, it will be read as 4 only in the context of Child Task 2.1, and not others’ - they will preserve previous values. OK, words are cheap: let’s get to the code! C# using Serilog; using System; using System.Threading; namespace DashDevs.Framework.ExecutionContext { /// /// Dash execution context uses to hold ambient context. /// IMPORTANT: works only top down, i.e. if you set a value in a child task, the parent task and other execution flow branches will NOT share the same context! /// That's why you should set needed properties as soon you have corresponding values for them. /// public static class DashExecutionContext { private static AsyncLocal _traceIdentifier = new AsyncLocal(); public static string? TraceIdentifier => _traceIdentifier.Value; /// /// Tries to set the trace identifier. /// /// Trace identifier. /// If existing trace ID should be replaced (set to true ONLY if you receive and handle traced entities in a constant context)! /// public static bool TrySetTraceIdentifier(string traceIdentifier, bool force = false) { return TrySetValue(nameof(TraceIdentifier), traceIdentifier, _traceIdentifier, string.IsNullOrEmpty, force); } private static bool TrySetValue( string contextPropertyName, T newValue, AsyncLocal ambientHolder, Func valueInvalidator, bool force) where T : IEquatable { if (newValue is null || newValue.Equals(default) || valueInvalidator.Invoke(newValue)) { return false; } var currentValue = ambientHolder.Value; if (force || currentValue is null || currentValue.Equals(default) || valueInvalidator.Invoke(currentValue)) { ambientHolder.Value = newValue; return true; } else if (!currentValue.Equals(newValue)) { Log.Error($"Tried to set different value for {contextPropertyName}, but it is already set for this execution flow - " + $"please, check the execution context logic! Current value: {currentValue} ; rejected value: {newValue}"); } return false; } } } Setting the trace ID is as simple as DashExecutionContext.TrySetTraceIdentifier(“yourTraceId”)with an optional value replacement option (we will talk about it later), and then you can access the value with DashExecutionContext.TraceIdentifier. We could implement this class to hold a dictionary as well; just in our case, it was enough (you can do this by yourself if needed, initializing a ConcurrentDictionary<TKey, TValue> for holding ambient context information with TValue being AsyncLocal). In the next section, we will enrich Serilog with trace ID values to be able to filter the logs and get complete information about specific call chains. Logging Made Easy With Serilog Dynamic Enrichment Serilog, being one of the most famous logging tools on the market (if not the most), comes with an enrichment concept - logs can include additional metadata of your choice by default, so you don’t need to set it for each write by yourself. While this piece of software already provides us with an existing LogContext, which is stated to be ambient, too, its disposable nature isn’t convenient to use and reduces the range of execution flows, while we need to process them in the widest range possible. So, how do we enrich logs with our tracing information? Among all the examples I’ve found that the enrichment was made using immutable values, so the initial plan was to implement a simple custom enricher quickly which would accept the delegate to get DashExecutionContext.TraceIdentifier value each time the log is written to reach our goal and log the flow-specific data. Fortunately, there’s already a community implementation of this feature, so we’ll just use it like this during logger configuration initialization: C# var loggerConfiguration = new LoggerConfiguration() ... .Enrich.WithDynamicProperty(“X-Dash-TraceIdentifier”, () => DashExecutionContext.TraceIdentifier) ... Yes, it's as simple as that - just a single line of code with a lambda, and all your logs now have a trace identifier! HTTP Headers With Trace IDs for ASP.NET Core REST API and GRPC The next move is to set the trace ID in the first place so that something valuable is shown in the logs. In this section, we will learn how to do this for REST API and gRPC communication layers, both server and client sides. Server Side: REST API For the server side, we can use custom middleware and populate our requests and responses with a trace ID header (don’t forget to configure your pipeline so that this middleware is the first one!). C# using DashDevs.Framework.ExecutionContext; using Microsoft.AspNetCore.Hosting; using Microsoft.AspNetCore.Http; using Serilog; using System.Threading.Tasks; namespace DashDevs.Framework.Middlewares { public class TracingMiddleware { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; private readonly RequestDelegate _next; public TracingMiddleware(RequestDelegate next) { _next = next; } public async Task Invoke(HttpContext httpContext) { if (httpContext.Request.Headers.TryGetValue(DashTraceIdentifier, out var traceId)) { httpContext.TraceIdentifier = traceId; DashExecutionContext.TrySetTraceIdentifier(traceId); } else { Log.Debug($"Setting the detached HTTP Trace Identifier for {nameof(DashExecutionContext)}, because the HTTP context misses {DashTraceIdentifier} header!"); DashExecutionContext.TrySetTraceIdentifier(httpContext.TraceIdentifier); } httpContext.Response.OnStarting(state => { var ctx = (HttpContext)state; ctx.Response.Headers.Add(DashTraceIdentifier, new[] { ctx.TraceIdentifier }); // there’s a reason not to use DashExecutionContext.TraceIdentifier value directly here return Task.CompletedTask; }, httpContext); await _next(httpContext); } } } Since the code is rather simple, we will stop only on a line where the response header is added. In our practice, we’ve faced a situation when in specific cases the response context was detached from the one we’d expected because of yet unknown reason, and thus the DashExecutionContext.TraceIdentifier value was null. Please, feel free to leave a comment if you know more - we’ll be glad to hear it! Client Side: REST API For REST API, your client is probably a handy library like Refit or RestEase. Not to add the header each time and produce unnecessary code, we can use an HttpMessageHandler implementation that fits the client of your choice. Here we’ll go with Refit and implement a DelegatingHandler for it. C# using System; using System.Net.Http; using System.Threading; using System.Threading.Tasks; using DashDevs.Framework.ExecutionContext; namespace DashDevs.Framework.HttpMessageHandlers { public class TracingHttpMessageHandler : DelegatingHandler { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) { if (!request.Headers.TryGetValues(DashTraceIdentifier, out var traceValues)) { var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } request.Headers.Add(DashTraceIdentifier, traceId); } return await base.SendAsync(request, cancellationToken); } } } Then you just need to register this handler as a scoped service in the ConfigureServices method of your Startup class and finally add it to your client configuration as follows. C# public void ConfigureServices(IServiceCollection services) { ... services.AddScoped(); ... services.AddRefitClient(). ... .AddHttpMessageHandler(); ... } Server Side: gRPC For gRPC, the code is generated from Protobuf IDL (interface definition language) definitions, which can use interceptors for intermediate processing. For the server side, we’ll implement a corresponding one that checks the request headers for the trace ID header. C# using DashDevs.Framework.ExecutionContext; using Grpc.Core; using Grpc.Core.Interceptors; using System; using System.Linq; using System.Threading.Tasks; namespace DashDevs.Framework.gRPC.Interceptors { public class ServerTracingInterceptor : Interceptor { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; public override Task UnaryServerHandler(TRequest request, ServerCallContext context, UnaryServerMethod continuation) { ProcessTracing(context); return continuation(request, context); } public override Task ClientStreamingServerHandler(IAsyncStreamReader requestStream, ServerCallContext context, ClientStreamingServerMethod continuation) { ProcessTracing(context); return continuation(requestStream, context); } public override Task ServerStreamingServerHandler(TRequest request, IServerStreamWriter responseStream, ServerCallContext context, ServerStreamingServerMethod continuation) { ProcessTracing(context); return continuation(request, responseStream, context); } public override Task DuplexStreamingServerHandler(IAsyncStreamReader requestStream, IServerStreamWriter responseStream, ServerCallContext context, DuplexStreamingServerMethod continuation) { ProcessTracing(context); return continuation(requestStream, responseStream, context); } private void ProcessTracing(ServerCallContext context) { if (string.IsNullOrEmpty(DashExecutionContext.TraceIdentifier)) { var traceIdEntry = context.RequestHeaders.FirstOrDefault(m => m.Key == DashTraceIdentifier.ToLowerInvariant()); var traceId = traceIdEntry?.Value ?? Guid.NewGuid().ToString(); DashExecutionContext.TrySetTraceIdentifier(traceId); } } } } To make your server calls intercepted, you need to pass a new instance of the ServerTracingInterceptor to the ServerServiceDefinition.Intercept method. The ServerServiceDefinition, in turn, is obtained by a call of the BindService method of your generated service. The following example can be used as a starting point. C# ... var server = new Server { Services = { YourService.BindService(new YourServiceImpl()).Intercept(new ServerTracingInterceptor()) }, Ports = { new ServerPort("yourServiceHost", Port, ServerCredentials.Insecure) } }; server.Start(); ... Client Side: GRPC ChannelExtensions.Intercept extension method comes to the rescue here - we will call it after channel creation, but at first we’re to implement the interceptor itself in the form of Func like it’s shown below. C# using DashDevs.Framework.ExecutionContext; using Grpc.Core; using System; namespace DashDevs.Framework.gRPC.Interceptors { public static class ClientInterceptorFunctions { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; public static Func TraceHeaderForwarder = (Metadata source) => { var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } source.Add(DashTraceIdentifier, traceId); return source; }; } } The usage is quite simple: Create the Channel object with specific parameters. Create your client class object and pass the Intercept method result of a Channel from p.1 using the InterceptorFunctions.TraceHeaderForwarder as a parameter for the client class constructor instead of passing the original Channel instance instead. It can be achieved with the following code as an example: C# … var channel = new Channel("yourServiceHost:yourServicePort", ChannelCredentials.Insecure); var client = new YourService.YourServiceClient(channel.Intercept(ClientInterceptorFunctions.TraceHeaderForwarder)); ... Base Message Class vs. Framework Message Metadata in Asynchronous Communication Software The next question is how to pass the trace ID in various async communication software. Basically, one can choose to use either framework-related features to pass trace ID further or go in a more straightforward manner with a base message. Both have pros and cons: The base message approach is ideal for communication where no features are provided to store contextual data, and it’s the least error-prone overall due to simplicity. On the other hand, if you have already defined a set of messages, backward compatibility may break if you just add another field depending on the serialization mechanism (so if you are to go this way, it’s better to do this from the very beginning and consider among other infrastructure features during design sessions), not mentioning that it may affect much code, which is better to be avoided. Setting framework metadata, if available, is a better choice, because you can leave your message processing code as it is with just a minor improvement, which will be automatically applied to all messaging across the whole system. Also, some software may provide features for additional monitoring of this data (e.g., in the dashboard). Next, we will provide you with some real-world examples. Amazon SQS One of the most widely used message queues is Amazon Simple Queue Service. Fortunately, it provides message metadata (namely, message attributes) out of the box, so we will gladly use it. The first step is to add trace ID to messages we send, so you can do something like this. C# public async Task SendMessageAsync(T message, CancellationToken cancellationToken, string? messageDeduplicationId = null) { var amazonClient = new AmazonSQSClient(yourConfig); var messageBody = JsonSerializer.Serialize(message, yourJsonOptions); return await amazonClient.SendMessageAsync( new SendMessageRequest { QueueUrl = "yourQueueUrl", MessageBody = messageBody, MessageDeduplicationId = messageDeduplicationId, MessageAttributes = new Dictionary() { { "X-Dash-TraceIdentifier", new MessageAttributeValue() { DataType = "String", StringValue = DashExecutionContext.TraceIdentifier, } } } }, cancellationToken); } The second step is to read this trace ID in a receiver to be able to set it for ambient context and continue the same way. C# public async Task> GetMessagesAsync(int maxNumberOfMessages, CancellationToken token) { if (maxNumberOfMessages < 0) { throw new ArgumentOutOfRangeException(nameof(maxNumberOfMessages)); } var amazonClient = new AmazonSQSClient(yourConfig); var asyncMessage = await amazonClient.ReceiveMessageAsync( new ReceiveMessageRequest { QueueUrl = "yourQueueUrl", MaxNumberOfMessages = maxNumberOfMessages, WaitTimeSeconds = yourLongPollTimeout, MessageAttributeNames = new List() { "X-Dash-TraceIdentifier" }, }, token); return asyncMessage.Messages; } Important note (also applicable to other messaging platforms): If you read and handle messages in the background loop one by one (not several at once) and wait for the completion of each one, calling the DashExecutionContext.TrySetTraceIdentifier with trace ID from metadata before message handling method with your business logic, then the DashExecutionContext.TraceIdentifier value always lies in the same async context. That’s why in this case it’s essential to use the override option in the DashExecutionContext.TrySetTraceIdentifiereach time: it’s safe since only one message is processed at a time, so we don’t get a mess anyhow. Otherwise, the very first metadata trace ID will be used for all upcoming messages as well, which is wrong. But if you read and process your messages in batches, the simplest way is to add an intermediate async method where the DashExecutionContext.TrySetTraceIdentifier is called and separate message from a batch is processed, so that you preserve an execution flow context isolation (and therefore trace ID) for each message separately. In this case, the override is not needed. Microsoft Orleans Microsoft Orleans provides its own execution flow context out of the box, so it’s extremely easy to pass metadata by means of the static RequestContext.Set(string key, object value) method, and reading it in the receiver with a RequestContext.Get(string key). The behavior is similar to AsyncLocal we’ve already learned about; i.e., the original caller context always preserves the value that is projected to message receivers, and getting responses doesn’t imply any caller context metadata changes even if another value has been set on the other side. But how can we efficiently interlink it with other contexts we use? The answer lies within Grain call filters. So, at first, we will add the outgoing filter so that the trace ID is set for calls to other Grains (which is an actor definition in Orleans). C# using DashDevs.Framework.ExecutionContext; using Microsoft.AspNetCore.Http; using Orleans; using Orleans.Runtime; using System; using System.Threading.Tasks; namespace DashDevs.Framework.Orleans.Filters { public class OutgoingGrainTracingFilter : IOutgoingGrainCallFilter { private const string TraceIdentifierKey = "X-Dash-TraceIdentifier"; private const string IngorePrefix = "Orleans.Runtime"; public async Task Invoke(IOutgoingGrainCallContext context) { if (context.Grain.GetType().FullName.StartsWith(IngorePrefix)) { await context.Invoke(); return; } var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } RequestContext.Set(TraceIdentifierKey, traceId); await context.Invoke(); } } } By default, the framework is constantly sending numerous service messages between specific actors, so it’s mandatory to move them out of our filters because they’re not subjects for tracing. Thus, we’ve introduced an ignore prefix so that these messages aren’t processed. Also, it’s worth mentioning that this filter is working for the pure client side, too. For example, if you’re calling an actor from the REST API controller by means of the Orleans cluster client, the trace ID will be passed from the REST API context further to the actors’ execution context and so on. Then we’ll continue with an incoming filter, where we get the trace ID from RequestContext and initialize our DashExecutionContext with it. The ignore prefix is used there, too. C# using DashDevs.Framework.ExecutionContext; using Orleans; using Orleans.Runtime; using System.Threading.Tasks; namespace DashDevs.Framework.Orleans.Filters { public class IncomingGrainTracingFilter : IIncomingGrainCallFilter { private const string TraceIdentifierKey = "X-Dash-TraceIdentifier"; private const string IngorePrefix = "Orleans.Runtime"; public async Task Invoke(IIncomingGrainCallContext context) { if (context.Grain.GetType().FullName.StartsWith(IngorePrefix)) { await context.Invoke(); return; } DashExecutionContext.TrySetTraceIdentifier(RequestContext.Get(TraceIdentifierKey).ToString()); await context.Invoke(); } } } Now let’s finish with our Silo (a Grain server definition in Orleans) host configuration to use the features we’ve already implemented, and we’re done here! C# var siloHostBuilder = new SiloHostBuilder(). ... .AddOutgoingGrainCallFilter() .AddIncomingGrainCallFilter() ... Background Processing Another piece of software you can use pretty often is a background jobs implementation. Here the concept itself prevents us from using a base data structure (which would look like an obvious workaround), and we’re going to review the features of Hangfire (the most famous background jobs software) which will help us to reach the goal of distributed tracing even for these kinds of execution units. Hangfire The feature which fits our goal most is the job filtering, implemented in the Attribute form. Thus, we need to define our own filtering attribute which will derive from the JobFilterAttribute, and implement the IClientFilter with IServerFilter interfaces. From the client side, we can access our DashExecutionContext.TraceIdentifier value, but not from the server. So, to be able to reach this value from the server context, we’ll pass our trace ID through the Job Parameter setting (worth mentioning that it’s not the parameter of a job method you write in your code, but a metadata handled by the framework). With this knowledge, let’s define our job filter. C# using DashDevs.Framework.ExecutionContext; using Hangfire.Client; using Hangfire.Common; using Hangfire.Server; using Hangfire.States; using Serilog; using System; namespace DashDevs.Framework.Hangfire.Filters { public class TraceJobFilterAttribute : JobFilterAttribute, IClientFilter, IServerFilter { private const string TraceParameter = "TraceIdentifier"; public void OnCreating(CreatingContext filterContext) { var traceId = GetParentTraceIdentifier(filterContext); if (string.IsNullOrEmpty(traceId)) { traceId = DashExecutionContext.TraceIdentifier; Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the parent job, " + "which means it's not a continuation"); } if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the {nameof(DashExecutionContext)} either. " + "Generated a new one."); } filterContext.SetJobParameter(TraceParameter, traceId); } public void OnPerforming(PerformingContext filterContext) { var traceId = SerializationHelper.Deserialize( filterContext.Connection.GetJobParameter(filterContext.BackgroundJob.Id, TraceParameter)); DashExecutionContext.TrySetTraceIdentifier(traceId!); } public void OnCreated(CreatedContext filterContext) { return; } public void OnPerformed(PerformedContext filterContext) { return; } private static string? GetParentTraceIdentifier(CreateContext filterContext) { if (!(filterContext.InitialState is AwaitingState awaitingState)) { return null; } var traceId = SerializationHelper.Deserialize( filterContext.Connection.GetJobParameter(awaitingState.ParentId, TraceParameter)); return traceId; } } } The specific case here is a continuation. If you don’t set the DashExecutionContext.TraceIdentifier, enqueue a regular job, and then specify a continuation. Then your continuations will not get the trace ID of a parent job. But in case you do set the DashExecutionContext.TraceIdentifier and then do the same, even though your continuations will share the same trace ID, in the particular case it may be considered as simple luck and a sort of coincidence, considering our job filter implementation and AsyncLocal principles. Thus, checking the parent is a must. Now, the final step is to register it globally so that it’s applied to all the jobs. C# GlobalJobFilters.Filters.Add(new TraceJobFilterAttribute()); Well, that’s it - your Hangfire jobs are now under control, too! By the way, you can compare this approach with the Correlate integration proposed by Hangfire docs. Summary In this article, we’ve tried to compose numerous practices and real-world examples for distributed tracing in .NET so that they can be used for most of the use cases in any software solution. We don’t cover automatic request/response and message logging directly here - it’s the simplest part of the story, so the implementation (i.e., if and where to add automatic request/response/message logging, and all other possible logs as well) should be made according to the specific needs. Also, in addition to tracing, this approach fits for any other data that you may need to pass across your system. As you can see, the DashExecutionContext class, relying on AsyncLocal features, plays the key role in transferring the trace identifier between different communication instruments in the scope of a single service, so it’s crucial to understand how it works. Other interlink implementations depend on the features of each piece of software and should be carefully reviewed to craft the best solution possible, which can be automatically applied to all incoming and outgoing calls without modifications to existing code. Thank you for reading!