2024 DevOps Lifecycle: Share your expertise on CI/CD, deployment metrics, tech debt, and more for our Feb. Trend Report (+ enter a raffle!).
Kubernetes in the Enterprise: Join our Virtual Roundtable as we dive into Kubernetes over the past year, core usages, and emerging trends.
The cultural movement that is DevOps — which, in short, encourages close collaboration among developers, IT operations, and system admins — also encompasses a set of tools, techniques, and practices. As part of DevOps, the CI/CD process incorporates automation into the SDLC, allowing teams to integrate and deliver incremental changes iteratively and at a quicker pace. Together, these human- and technology-oriented elements enable smooth, fast, and quality software releases. This Zone is your go-to source on all things DevOps and CI/CD (end to end!).
DZone's Annual DevOps Research — Join Us! [survey + raffle]
CI/CD Software Design Patterns and Anti-Patterns
Site reliability engineering (SRE) is a discipline in which automated software systems are built to manage the development operations (DevOps) of a product or service. In other words, SRE automates the functions of an operations team via software systems. The main purpose of SRE is to encourage the deployment and proper maintenance of large-scale systems. In particular, site reliability engineers are responsible for ensuring that a given system’s behavior consistently meets business requirements for performance and availability. Furthermore, whereas traditional operations teams and development teams often have opposing incentives, site reliability engineers are able to align incentives so that both feature development and reliability are promoted simultaneously. Basic SRE Principles This article covers key principles that underlie SRE, provides some examples of those key principles, and includes relevant details and illustrations to clarify these examples. Principle Description Example Embrace risk No system can be expected to have perfect performance. It’s important to identify potential failure points and create mitigation plans. Additionally, it’s important to budget a certain percentage of business costs to address these failures in real time. A week consists of 168 hours of potential availability. The business sets an expectation of 165 hours of uptime per week to account for both planned maintenance and unplanned failures. Set service level objectives (SLOs) Set reasonable expectations for system performance to ensure that customers and internal stakeholders understand how the system is supposed to perform at various levels. Remember that no system can be expected to have perfect performance. The website is up and running 99% of the time. 99% of all API requests return a successful response. The server output matches client expectations 99% of the time. 99% of all API requests are delivered within one second. The server can handle 10,000 requests per second. Eliminate work through automation Automate as many tasks and processes as possible. Engineers should focus on developing new features and enhancing existing systems at least as often as addressing real-time failures. Production code automatically generates alerts whenever an SLO is violated. The automated alerts send tickets to the appropriate incident response team with relevant playbooks to take action. Monitor systems Use tools, to monitor system performance. Observe performance, incidents, and trends. A dashboard that displays the proportion of client requests and server responses that were delivered successfully in a given time period. A set of logs that displays the expected and actual output of client requests and server responses in a given time period. Keep things simple Release frequent, small changes that can be easily reverted to minimize production bugs. Delete unnecessary code instead of keeping it for potential future use. The more code and systems that are introduced, the more complexity created; it’s important to prevent accidental bloat. Changes in code are always pushed via a version control system that tracks code writers, approvers, and previous states. Outline the release engineering process Document your established processes for development, testing, automation, deployments, and production support. Ensure that the process is accessible and visible. A published playbook lists the steps to address reboot failure. The playbook contains references to relevant SLOs, dashboards, previous tickets, sections of the codebase, and contact information for the incident response team. Embrace Risk No system can be expected to have perfect performance. It’s important to create reasonable expectations about system performance for both internal stakeholders and external users. Key Metrics For services that are directly user-facing, such as static websites and streaming, two common and important ways to measure performance are time availability and aggregate availability. This article provides an example of calculating time availability for a service. For other services, additional factors are important, including speed (latency), accuracy (correctness), and volume (throughput). An example calculation for latency is as follows: Suppose 10 different users serve up identical HTTP requests to your website, and they are all served properly. The return times are monitored and recorded as follows: 1 ms, 3 ms, 3 ms, 4 ms, 1 ms, 1 ms, 1 ms, 5 ms, 3 ms, and 2 ms. The average response time, or latency, is 24 ms / 10 returns = 2.4 ms. Choosing key metrics makes explicit how the performance of a service is evaluated, and therefore what factors pose a risk to service health. In the above example, identifying latency as a key metric indicates average return time as an essential property of the service. Thus, a risk to the reliability of the service is “slowness” or low latency. Define Failure In addition to measuring risks, it’s important to clearly define which risks the system can tolerate without compromising quality and which risks must be addressed to ensure quality. This article provides an example of two types of measurements that address failure: mean time to failure (MTTF) and mean time between failures (MTBF). The most robust way to define failures is to set SLOs, monitor your services for violations in SLOs, and create alerts and processes for fixing violations. These are discussed in the following sections. Error Budgets The development of new production features always introduces new potential risks and failures; aiming for a 100% risk-free service is unrealistic. The way to align the competing incentives of pushing development and maintaining reliability is through error budgets. An error budget provides a clear metric that allows a certain proportion of failure from new releases in a given planning cycle. If the number or length of failures exceeds the error budget, no new releases may occur until a new planning period begins. The following is an example error budget. Planning cycle Quarter Total possible availability 2,190 hours SLO 99.9% time availability Error budget 0.1% time availability = 21.9 hours Suppose the development team plans to release 10 new features during the quarter, and the following occurs: The first feature doesn’t cause any downtime. The second feature causes downtime of 10 hours until fixed. The third and fourth features each cause downtime of 6 hours until fixed. At this point, the error budget for the quarter has been exceeded (10 + 6 + 6 = 22 > 21.9), so the fifth feature cannot be released. In this way, the error budget has ensured an acceptable feature release velocity while not compromising reliability or degrading user experience. Set Service Level Objectives (SLOs) The best way to set performance expectations is to set specific targets for different system risks. These targets are called service level objectives, or SLOs. The following table lists examples of SLOs based on different risk measurements. Time availability Website running 99% of the time Aggregate availability 99% of user requests processed Latency 1 ms average response rate per request Throughput 10,000 requests handled every second Correctness 99% of database reads accurate Depending on the service, some SLOs may be more complicated than just a single number. For example, a database may exhibit 99.9% correctness on reads but have the 0.1% of errors it incurs always be related to the most recent data. If a customer relies heavily on data recorded in the past 24 hours, then the service is not reliable. In this case, it makes sense to create a tiered SLObased on the customer’s needs. Here is an example: Level 1 (records within the last 24 hours) 99.99% read accuracy Level 2 (records within the last 7 days) 99.9% read accuracy Level 3 (records within the last 30 days) 99% read accuracy Level 4 (records within the last 6 months) 95% read accuracy Costs of Improvement One of the main purposes of establishing SLOs is to track how reliability affects revenue. Revisiting the sample error budget from the section above, suppose there is a projected service revenue of $500,000 for the quarter. This can be used to translate the SLO and error budget into real dollars. Thus, SLOs are also a way to measure objectives that are indirectly related to system performance. SLO Error Budget Revenue Lost 95% 5% $25,000 99% 1% $5,000 99.90% 0.10% $500 99.99% 0.01% $50 Using SLOs to track indirect metrics, such as revenue, allows one to assess the cost of improving service. In this case, spending $10,000 on improving the SLO from 95% to 99% is a worthwhile business decision. On the other hand, spending $10,000 on improving the SLO from 99% to 99.9% is not. Eliminate Work Through Automation One characteristic that distinguishes SREs from traditional DevOps is the ability to scale up the scope of a service without scaling the cost of the service. Called sublinear growth, this is accomplished via automation. In a traditional development-operations split, the development team pushes new features, while the operations team dedicates 100% of its time to maintenance. Thus, a pure operations team will need to grow 1:1 with the size and scope of the service it is maintaining: If it takes O(10) system engineers to serve 1000 users, it will take O(100) engineers to serve 10K users. In contrast, an SRE team operating according to best practices will devote at least 50% of its time to developing systems that remove the basic elements of effort from the operations workload. Some examples of this include the following: A service that detects which machines in a large fleet need software updates and schedules software reboots in batches over regular time intervals. A “push-on-green” module that provides an automatic workflow for the testing and release of new code to relevant services. An alerting system that automates ticket generation and notifies incident response teams. Monitor Systems To maintain reliability, it is imperative to monitor the relevant analytics for a service and use monitoring to detect SLO violations. As mentioned earlier, some important metrics include: The amount of time that a service is up and running (time availability) The number of requests that complete successfully (aggregate availability) The amount of time it takes to serve a request (latency) The proportion of responses that deliver expected results (correctness) The volume of requests that a system is currently handling (throughput) The percentage of available resources being consumed (saturation) Sometimes durability is also measured, which is the length of time that data is stored with accuracy. Dashboards A good way to implement monitoring is through dashboards. An effective dashboard will display SLOs, include the error budget, and present the different risk metrics relevant to the SLO. Example of an effective SRE dashboard (source) Logs Another good way to implement monitoring is through logs. Logs that are both searchable in time and categorized via request are the most effective. If an SLO violation is detected via a dashboard, a more detailed picture can be created by viewing the logs generated during the affected timeframe. Example of a monitoring log (source) Whitebox Versus Blackbox The type of monitoring discussed above that tracks the internal analytics of a service is called whitebox monitoring. Sometimes it’s also important to monitor the behavior of a system from the “outside,” which means testing the workflow of a service from the point of view of an external user; this is called blackbox monitoring. Blackbox monitoring may reveal problems with access permissions or redundancy. Automated Alerts and Ticketing One of the best ways for SREs to reduce effort is to use automation during monitoring for alerts and ticketing. The SRE process is much more efficient than a traditional operations process. A traditional operations response may look like this: A web developer pushes a new update to an algorithm that serves ads to users. The developer notices that the latest push is reducing website traffic due to an unknown cause and manually files a ticket about reduced traffic with the web operations team. A system engineer on the web operations team receives a ticket about the reduced traffic issue. After troubleshooting, the issue is diagnosed as a latency issue caused by a stuck cache. The web operations engineer contacts a member of the database team for help. The database team looks into the codebase and identifies a fix for the cache settings so that data is refreshed more quickly and latency is decreased. The database team updates the cache refresh settings, pushes the fix to production, and closes the ticket. In contrast, an SRE operations response may look like this: The ads SRE team creates a deployment tool that monitors three different traffic SLOs: availability, latency, and throughput. A web developer is ready to push a new update to an algorithm that serves ads, for which he uses the SRE deployment tool. Within minutes, the deployment tool detects reduced website traffic. It identifies a latency SLO violation and creates an alert. The on-call site reliability engineer receives the alert, which contains a proposal for updated cache refresh settings to make processing requests faster. The site reliability engineer accepts the proposed changes, pushes the new settings to production, and closes the ticket. By using an automated system for alerting and proposing changes to the database, the communication required, the number of people involved, and the time to resolution are all reduced. The following code block is a generic language implementation of latency and throughput thresholds and automated alerts triggered upon detected violations. Java # Define the latency SLO threshold in seconds and create a histogram to track LATENCY_SLO_THRESHOLD = 0.1 REQUEST_LATENCY = Histogram('http_request_latency_seconds', 'Request latency in seconds', ['method', 'endpoint']) # Define the throughput SLO threshold in requests per second and a counter to track THROUGHPUT_SLO_THRESHOLD = 10000 REQUEST_COUNT = Counter('http_request_count', 'Request count', ['method', 'endpoint', 'http_status']) # Check if the latency SLO is violated and send an alert if it is def check_latency_slo(): latency = REQUEST_LATENCY.observe(0.1).observe(0.2).observe(0.3).observe(0.4).observe(0.5).observe(0.6).observe(0.7).observe(0.8).observe(0.9).observe(1.0) quantiles = latency.quantiles(0.99) latency_99th_percentile = quantiles[0] if latency_99th_percentile > LATENCY_SLO_THRESHOLD: printf("Latency SLO violated! 99th percentile response time is {latency_99th_percentile} seconds.") # Check if the throughput SLO is violated and send an alert if it is def check_throughput_slo(): request_count = REQUEST_COUNT.count() current_throughput = request_count / time.time() if current_throughput > THROUGHPUT_SLO_THRESHOLD: printf("Throughput SLO violated! Current throughput is {current_throughput} requests per second.") Example of automated alert calls Keep Things Simple The best way to ensure that systems remain reliable is to keep them simple. SRE teams should be hesitant to add new code, preferring instead to modify and delete code where possible. Every additional API, library, and function that one adds to production software increases dependencies in ways that are difficult to track, introducing new points of failure. Site reliability engineers should aim to keep their code modular. That is, each function in an API should serve only one purpose, as should each API in a larger stack. This type of organization makes dependencies more transparent and also makes diagnosing errors easier. Playbooks As part of incident management, playbooks for typical on-call investigations and solutions should be authored and published publicly. Playbooks for a particular scenario should describe the incident (and possible variations), list the associated SLOs, reference appropriate monitoring tools and codebases, offer proposed solutions, and catalog previous approaches. Outline the Release Engineering Process Just as an SRE codebase should emphasize simplicity, so should an SRE release process. Simplicity is encouraged through a couple of principles: Smaller size and higher velocity: Rather than large, infrequent releases, aim for a higher frequency of smaller ones. This allows the team to observe changes in system behavior incrementally and reduces the potential for large system failures. Self-service: An SRE team should completely own its release process, which should be automated effectively. This both eliminates work and encourages small-size, high-velocity pushes. Hermetic builds: The process for building a new release should be hermetic, or self-contained. That is to say, the build process must be locked to known versions of existing tools (e.g., compilers) and not be dependent on external tools. Version Control All code releases should be submitted within a version control system to allow for easy reversions in the event of erroneous, redundant, or ineffective code. Code Reviews The process of submitting releases should be accompanied by a clear and visible code review process. Basic changes may not require approval, whereas more complicated or impactful changes will require approval from other site reliability engineers or technical leads. Recap of SRE Principles The main principles of SRE are embracing risk, setting SLOs, eliminating work via automation, monitoring systems, keeping things simple, and outlining the release engineering process. Embracing risk involves clearly defining failure and setting error budgets. The best way to do this is by creating and enforcing SLOs, which track system performance directly and also help identify the potential costs of system improvement. The appropriate SLO depends on how risk is measured and the needs of the customer. Enforcing SLOs requires monitoring, usually through dashboards and logs. Site reliability engineers focus on project work, in addition to development operations, which allows for services to expand in scope and scale while maintaining low costs. This is called sublinear growth and is achieved through automating repetitive tasks. Monitoring that automates alerting creates a streamlined operations process, which increases reliability. Site reliability engineers should keep systems simple by reducing the amount of code written, encouraging modular development, and publishing playbooks with standard operating procedures. SRE release processes should be hermetic and push small, frequent changes using version control and code reviews.
To log, or not to log? To log! Nowadays, we can’t even imagine a modern software system without logging subsystem implementation, because it’s the very basic tool of debugging and monitoring developers can’t be productive without. Once something gets broken or you just want to know what’s going on in the depths of your code execution, there’s almost no other way than just to implement a similar functionality. With distributed systems, and microservices architectures in particular, the situation gets even more complicated since each service can theoretically call any other service (or several of them at once), using either REST, gRPC, or asynchronous messaging (by means of numerous service buses, queues, brokers, and actor-based frameworks). Background processing goes there as well, resulting in entangled call chains we still want to have control over. In this article we will show you how to implement efficient distributed tracing in .NET quickly, avoiding the modification of low-level code as much as possible so that only generic tooling and base classes for each communication instrument are affected. Ambient Context Is The Core: Exploring The AsyncLocal Let’s start with the root which ensures the growth of our tree - that is, where the tracing information is stored. Because to log the tracing information, we need to store it somewhere and then get it somehow. Furthermore, this information should be available throughout the execution flow - this is exactly what we want to achieve. Thus, I’ve chosen to implement the ambient context pattern (you’re probably familiar with it from HttpContext): simply put, it provides global access to certain resources in the scope of execution flow. Though it’s sometimes considered an anti-pattern, in my opinion, the dependency injection concerns are a bit out of… scope (sorry for the pun), at least for a specific case where we don’t hold any business data. And .NET can help us with that, providing the AsyncLocal<T> class; as opposed to ThreadLocal<T>, which ensures data locality in the scope of a certain thread, AsyncLocal is used to hold data for tasks, which (as we know) can be executed in any thread. It’s worth mentioning that AsyncLocal works top down, so once you set the value at the start of the flow, it will be available for the rest of the ongoing flow as well, but if you change the value in the middle of the flow, it will be changed for the flow branch only; i.e., data locality will be preserved for each branch separately. If we look at the picture above, the following consequent use cases can be considered as examples: We set the AsyncLocal value as 0in the Root Task. If we don’t change it in the child tasks, it will be read as 0 in the child tasks’ branches as well. We set the AsyncLocalvalue as 1 in the Child Task 1. If we don’t change it in the Child Task 1.1, it will be read as 1 in the context of _Child Task 1 _and Child Task 1.1, but not in theRoot Task or Child Task 2 branch - they will keep 0. We set the AsyncLocal value as 2 in the Child Task 2. Similarly to #2, if we don’t change it in the Child Task 2.1, it will be read as 2 in the context of Child Task 2 and Child Task 2.1, but not in the Root Task or Child Task 1 branch - they will be 0 for Root Task, and 1 for Child Task 1 branch. We set the AsyncLocal value as 3 in the Child Task 1.1. This way, it will be read as 3 only in the context of Child Task 1.1, and not others’ - they will preserve previous values. We set the AsyncLocal value as 4 in the Child Task 2.1. This way, it will be read as 4 only in the context of Child Task 2.1, and not others’ - they will preserve previous values. OK, words are cheap: let’s get to the code! C# using Serilog; using System; using System.Threading; namespace DashDevs.Framework.ExecutionContext { /// /// Dash execution context uses to hold ambient context. /// IMPORTANT: works only top down, i.e. if you set a value in a child task, the parent task and other execution flow branches will NOT share the same context! /// That's why you should set needed properties as soon you have corresponding values for them. /// public static class DashExecutionContext { private static AsyncLocal _traceIdentifier = new AsyncLocal(); public static string? TraceIdentifier => _traceIdentifier.Value; /// /// Tries to set the trace identifier. /// /// Trace identifier. /// If existing trace ID should be replaced (set to true ONLY if you receive and handle traced entities in a constant context)! /// public static bool TrySetTraceIdentifier(string traceIdentifier, bool force = false) { return TrySetValue(nameof(TraceIdentifier), traceIdentifier, _traceIdentifier, string.IsNullOrEmpty, force); } private static bool TrySetValue( string contextPropertyName, T newValue, AsyncLocal ambientHolder, Func valueInvalidator, bool force) where T : IEquatable { if (newValue is null || newValue.Equals(default) || valueInvalidator.Invoke(newValue)) { return false; } var currentValue = ambientHolder.Value; if (force || currentValue is null || currentValue.Equals(default) || valueInvalidator.Invoke(currentValue)) { ambientHolder.Value = newValue; return true; } else if (!currentValue.Equals(newValue)) { Log.Error($"Tried to set different value for {contextPropertyName}, but it is already set for this execution flow - " + $"please, check the execution context logic! Current value: {currentValue} ; rejected value: {newValue}"); } return false; } } } Setting the trace ID is as simple as DashExecutionContext.TrySetTraceIdentifier(“yourTraceId”)with an optional value replacement option (we will talk about it later), and then you can access the value with DashExecutionContext.TraceIdentifier. We could implement this class to hold a dictionary as well; just in our case, it was enough (you can do this by yourself if needed, initializing a ConcurrentDictionary<TKey, TValue> for holding ambient context information with TValue being AsyncLocal). In the next section, we will enrich Serilog with trace ID values to be able to filter the logs and get complete information about specific call chains. Logging Made Easy With Serilog Dynamic Enrichment Serilog, being one of the most famous logging tools on the market (if not the most), comes with an enrichment concept - logs can include additional metadata of your choice by default, so you don’t need to set it for each write by yourself. While this piece of software already provides us with an existing LogContext, which is stated to be ambient, too, its disposable nature isn’t convenient to use and reduces the range of execution flows, while we need to process them in the widest range possible. So, how do we enrich logs with our tracing information? Among all the examples I’ve found that the enrichment was made using immutable values, so the initial plan was to implement a simple custom enricher quickly which would accept the delegate to get DashExecutionContext.TraceIdentifier value each time the log is written to reach our goal and log the flow-specific data. Fortunately, there’s already a community implementation of this feature, so we’ll just use it like this during logger configuration initialization: C# var loggerConfiguration = new LoggerConfiguration() ... .Enrich.WithDynamicProperty(“X-Dash-TraceIdentifier”, () => DashExecutionContext.TraceIdentifier) ... Yes, it's as simple as that - just a single line of code with a lambda, and all your logs now have a trace identifier! HTTP Headers With Trace IDs for ASP.NET Core REST API and GRPC The next move is to set the trace ID in the first place so that something valuable is shown in the logs. In this section, we will learn how to do this for REST API and gRPC communication layers, both server and client sides. Server Side: REST API For the server side, we can use custom middleware and populate our requests and responses with a trace ID header (don’t forget to configure your pipeline so that this middleware is the first one!). C# using DashDevs.Framework.ExecutionContext; using Microsoft.AspNetCore.Hosting; using Microsoft.AspNetCore.Http; using Serilog; using System.Threading.Tasks; namespace DashDevs.Framework.Middlewares { public class TracingMiddleware { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; private readonly RequestDelegate _next; public TracingMiddleware(RequestDelegate next) { _next = next; } public async Task Invoke(HttpContext httpContext) { if (httpContext.Request.Headers.TryGetValue(DashTraceIdentifier, out var traceId)) { httpContext.TraceIdentifier = traceId; DashExecutionContext.TrySetTraceIdentifier(traceId); } else { Log.Debug($"Setting the detached HTTP Trace Identifier for {nameof(DashExecutionContext)}, because the HTTP context misses {DashTraceIdentifier} header!"); DashExecutionContext.TrySetTraceIdentifier(httpContext.TraceIdentifier); } httpContext.Response.OnStarting(state => { var ctx = (HttpContext)state; ctx.Response.Headers.Add(DashTraceIdentifier, new[] { ctx.TraceIdentifier }); // there’s a reason not to use DashExecutionContext.TraceIdentifier value directly here return Task.CompletedTask; }, httpContext); await _next(httpContext); } } } Since the code is rather simple, we will stop only on a line where the response header is added. In our practice, we’ve faced a situation when in specific cases the response context was detached from the one we’d expected because of yet unknown reason, and thus the DashExecutionContext.TraceIdentifier value was null. Please, feel free to leave a comment if you know more - we’ll be glad to hear it! Client Side: REST API For REST API, your client is probably a handy library like Refit or RestEase. Not to add the header each time and produce unnecessary code, we can use an HttpMessageHandler implementation that fits the client of your choice. Here we’ll go with Refit and implement a DelegatingHandler for it. C# using System; using System.Net.Http; using System.Threading; using System.Threading.Tasks; using DashDevs.Framework.ExecutionContext; namespace DashDevs.Framework.HttpMessageHandlers { public class TracingHttpMessageHandler : DelegatingHandler { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; protected override async Task SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) { if (!request.Headers.TryGetValues(DashTraceIdentifier, out var traceValues)) { var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } request.Headers.Add(DashTraceIdentifier, traceId); } return await base.SendAsync(request, cancellationToken); } } } Then you just need to register this handler as a scoped service in the ConfigureServices method of your Startup class and finally add it to your client configuration as follows. C# public void ConfigureServices(IServiceCollection services) { ... services.AddScoped(); ... services.AddRefitClient(). ... .AddHttpMessageHandler(); ... } Server Side: gRPC For gRPC, the code is generated from Protobuf IDL (interface definition language) definitions, which can use interceptors for intermediate processing. For the server side, we’ll implement a corresponding one that checks the request headers for the trace ID header. C# using DashDevs.Framework.ExecutionContext; using Grpc.Core; using Grpc.Core.Interceptors; using System; using System.Linq; using System.Threading.Tasks; namespace DashDevs.Framework.gRPC.Interceptors { public class ServerTracingInterceptor : Interceptor { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; public override Task UnaryServerHandler(TRequest request, ServerCallContext context, UnaryServerMethod continuation) { ProcessTracing(context); return continuation(request, context); } public override Task ClientStreamingServerHandler(IAsyncStreamReader requestStream, ServerCallContext context, ClientStreamingServerMethod continuation) { ProcessTracing(context); return continuation(requestStream, context); } public override Task ServerStreamingServerHandler(TRequest request, IServerStreamWriter responseStream, ServerCallContext context, ServerStreamingServerMethod continuation) { ProcessTracing(context); return continuation(request, responseStream, context); } public override Task DuplexStreamingServerHandler(IAsyncStreamReader requestStream, IServerStreamWriter responseStream, ServerCallContext context, DuplexStreamingServerMethod continuation) { ProcessTracing(context); return continuation(requestStream, responseStream, context); } private void ProcessTracing(ServerCallContext context) { if (string.IsNullOrEmpty(DashExecutionContext.TraceIdentifier)) { var traceIdEntry = context.RequestHeaders.FirstOrDefault(m => m.Key == DashTraceIdentifier.ToLowerInvariant()); var traceId = traceIdEntry?.Value ?? Guid.NewGuid().ToString(); DashExecutionContext.TrySetTraceIdentifier(traceId); } } } } To make your server calls intercepted, you need to pass a new instance of the ServerTracingInterceptor to the ServerServiceDefinition.Intercept method. The ServerServiceDefinition, in turn, is obtained by a call of the BindService method of your generated service. The following example can be used as a starting point. C# ... var server = new Server { Services = { YourService.BindService(new YourServiceImpl()).Intercept(new ServerTracingInterceptor()) }, Ports = { new ServerPort("yourServiceHost", Port, ServerCredentials.Insecure) } }; server.Start(); ... Client Side: GRPC ChannelExtensions.Intercept extension method comes to the rescue here - we will call it after channel creation, but at first we’re to implement the interceptor itself in the form of Func like it’s shown below. C# using DashDevs.Framework.ExecutionContext; using Grpc.Core; using System; namespace DashDevs.Framework.gRPC.Interceptors { public static class ClientInterceptorFunctions { private const string DashTraceIdentifier = "X-Dash-TraceIdentifier"; public static Func TraceHeaderForwarder = (Metadata source) => { var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } source.Add(DashTraceIdentifier, traceId); return source; }; } } The usage is quite simple: Create the Channel object with specific parameters. Create your client class object and pass the Intercept method result of a Channel from p.1 using the InterceptorFunctions.TraceHeaderForwarder as a parameter for the client class constructor instead of passing the original Channel instance instead. It can be achieved with the following code as an example: C# … var channel = new Channel("yourServiceHost:yourServicePort", ChannelCredentials.Insecure); var client = new YourService.YourServiceClient(channel.Intercept(ClientInterceptorFunctions.TraceHeaderForwarder)); ... Base Message Class vs. Framework Message Metadata in Asynchronous Communication Software The next question is how to pass the trace ID in various async communication software. Basically, one can choose to use either framework-related features to pass trace ID further or go in a more straightforward manner with a base message. Both have pros and cons: The base message approach is ideal for communication where no features are provided to store contextual data, and it’s the least error-prone overall due to simplicity. On the other hand, if you have already defined a set of messages, backward compatibility may break if you just add another field depending on the serialization mechanism (so if you are to go this way, it’s better to do this from the very beginning and consider among other infrastructure features during design sessions), not mentioning that it may affect much code, which is better to be avoided. Setting framework metadata, if available, is a better choice, because you can leave your message processing code as it is with just a minor improvement, which will be automatically applied to all messaging across the whole system. Also, some software may provide features for additional monitoring of this data (e.g., in the dashboard). Next, we will provide you with some real-world examples. Amazon SQS One of the most widely used message queues is Amazon Simple Queue Service. Fortunately, it provides message metadata (namely, message attributes) out of the box, so we will gladly use it. The first step is to add trace ID to messages we send, so you can do something like this. C# public async Task SendMessageAsync(T message, CancellationToken cancellationToken, string? messageDeduplicationId = null) { var amazonClient = new AmazonSQSClient(yourConfig); var messageBody = JsonSerializer.Serialize(message, yourJsonOptions); return await amazonClient.SendMessageAsync( new SendMessageRequest { QueueUrl = "yourQueueUrl", MessageBody = messageBody, MessageDeduplicationId = messageDeduplicationId, MessageAttributes = new Dictionary() { { "X-Dash-TraceIdentifier", new MessageAttributeValue() { DataType = "String", StringValue = DashExecutionContext.TraceIdentifier, } } } }, cancellationToken); } The second step is to read this trace ID in a receiver to be able to set it for ambient context and continue the same way. C# public async Task> GetMessagesAsync(int maxNumberOfMessages, CancellationToken token) { if (maxNumberOfMessages < 0) { throw new ArgumentOutOfRangeException(nameof(maxNumberOfMessages)); } var amazonClient = new AmazonSQSClient(yourConfig); var asyncMessage = await amazonClient.ReceiveMessageAsync( new ReceiveMessageRequest { QueueUrl = "yourQueueUrl", MaxNumberOfMessages = maxNumberOfMessages, WaitTimeSeconds = yourLongPollTimeout, MessageAttributeNames = new List() { "X-Dash-TraceIdentifier" }, }, token); return asyncMessage.Messages; } Important note (also applicable to other messaging platforms): If you read and handle messages in the background loop one by one (not several at once) and wait for the completion of each one, calling the DashExecutionContext.TrySetTraceIdentifier with trace ID from metadata before message handling method with your business logic, then the DashExecutionContext.TraceIdentifier value always lies in the same async context. That’s why in this case it’s essential to use the override option in the DashExecutionContext.TrySetTraceIdentifiereach time: it’s safe since only one message is processed at a time, so we don’t get a mess anyhow. Otherwise, the very first metadata trace ID will be used for all upcoming messages as well, which is wrong. But if you read and process your messages in batches, the simplest way is to add an intermediate async method where the DashExecutionContext.TrySetTraceIdentifier is called and separate message from a batch is processed, so that you preserve an execution flow context isolation (and therefore trace ID) for each message separately. In this case, the override is not needed. Microsoft Orleans Microsoft Orleans provides its own execution flow context out of the box, so it’s extremely easy to pass metadata by means of the static RequestContext.Set(string key, object value) method, and reading it in the receiver with a RequestContext.Get(string key). The behavior is similar to AsyncLocal we’ve already learned about; i.e., the original caller context always preserves the value that is projected to message receivers, and getting responses doesn’t imply any caller context metadata changes even if another value has been set on the other side. But how can we efficiently interlink it with other contexts we use? The answer lies within Grain call filters. So, at first, we will add the outgoing filter so that the trace ID is set for calls to other Grains (which is an actor definition in Orleans). C# using DashDevs.Framework.ExecutionContext; using Microsoft.AspNetCore.Http; using Orleans; using Orleans.Runtime; using System; using System.Threading.Tasks; namespace DashDevs.Framework.Orleans.Filters { public class OutgoingGrainTracingFilter : IOutgoingGrainCallFilter { private const string TraceIdentifierKey = "X-Dash-TraceIdentifier"; private const string IngorePrefix = "Orleans.Runtime"; public async Task Invoke(IOutgoingGrainCallContext context) { if (context.Grain.GetType().FullName.StartsWith(IngorePrefix)) { await context.Invoke(); return; } var traceId = DashExecutionContext.TraceIdentifier; if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); } RequestContext.Set(TraceIdentifierKey, traceId); await context.Invoke(); } } } By default, the framework is constantly sending numerous service messages between specific actors, so it’s mandatory to move them out of our filters because they’re not subjects for tracing. Thus, we’ve introduced an ignore prefix so that these messages aren’t processed. Also, it’s worth mentioning that this filter is working for the pure client side, too. For example, if you’re calling an actor from the REST API controller by means of the Orleans cluster client, the trace ID will be passed from the REST API context further to the actors’ execution context and so on. Then we’ll continue with an incoming filter, where we get the trace ID from RequestContext and initialize our DashExecutionContext with it. The ignore prefix is used there, too. C# using DashDevs.Framework.ExecutionContext; using Orleans; using Orleans.Runtime; using System.Threading.Tasks; namespace DashDevs.Framework.Orleans.Filters { public class IncomingGrainTracingFilter : IIncomingGrainCallFilter { private const string TraceIdentifierKey = "X-Dash-TraceIdentifier"; private const string IngorePrefix = "Orleans.Runtime"; public async Task Invoke(IIncomingGrainCallContext context) { if (context.Grain.GetType().FullName.StartsWith(IngorePrefix)) { await context.Invoke(); return; } DashExecutionContext.TrySetTraceIdentifier(RequestContext.Get(TraceIdentifierKey).ToString()); await context.Invoke(); } } } Now let’s finish with our Silo (a Grain server definition in Orleans) host configuration to use the features we’ve already implemented, and we’re done here! C# var siloHostBuilder = new SiloHostBuilder(). ... .AddOutgoingGrainCallFilter() .AddIncomingGrainCallFilter() ... Background Processing Another piece of software you can use pretty often is a background jobs implementation. Here the concept itself prevents us from using a base data structure (which would look like an obvious workaround), and we’re going to review the features of Hangfire (the most famous background jobs software) which will help us to reach the goal of distributed tracing even for these kinds of execution units. Hangfire The feature which fits our goal most is the job filtering, implemented in the Attribute form. Thus, we need to define our own filtering attribute which will derive from the JobFilterAttribute, and implement the IClientFilter with IServerFilter interfaces. From the client side, we can access our DashExecutionContext.TraceIdentifier value, but not from the server. So, to be able to reach this value from the server context, we’ll pass our trace ID through the Job Parameter setting (worth mentioning that it’s not the parameter of a job method you write in your code, but a metadata handled by the framework). With this knowledge, let’s define our job filter. C# using DashDevs.Framework.ExecutionContext; using Hangfire.Client; using Hangfire.Common; using Hangfire.Server; using Hangfire.States; using Serilog; using System; namespace DashDevs.Framework.Hangfire.Filters { public class TraceJobFilterAttribute : JobFilterAttribute, IClientFilter, IServerFilter { private const string TraceParameter = "TraceIdentifier"; public void OnCreating(CreatingContext filterContext) { var traceId = GetParentTraceIdentifier(filterContext); if (string.IsNullOrEmpty(traceId)) { traceId = DashExecutionContext.TraceIdentifier; Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the parent job, " + "which means it's not a continuation"); } if (string.IsNullOrEmpty(traceId)) { traceId = Guid.NewGuid().ToString(); Log.Information($"{filterContext.Job.Type.Name} job {TraceParameter} parameter was not set in the {nameof(DashExecutionContext)} either. " + "Generated a new one."); } filterContext.SetJobParameter(TraceParameter, traceId); } public void OnPerforming(PerformingContext filterContext) { var traceId = SerializationHelper.Deserialize( filterContext.Connection.GetJobParameter(filterContext.BackgroundJob.Id, TraceParameter)); DashExecutionContext.TrySetTraceIdentifier(traceId!); } public void OnCreated(CreatedContext filterContext) { return; } public void OnPerformed(PerformedContext filterContext) { return; } private static string? GetParentTraceIdentifier(CreateContext filterContext) { if (!(filterContext.InitialState is AwaitingState awaitingState)) { return null; } var traceId = SerializationHelper.Deserialize( filterContext.Connection.GetJobParameter(awaitingState.ParentId, TraceParameter)); return traceId; } } } The specific case here is a continuation. If you don’t set the DashExecutionContext.TraceIdentifier, enqueue a regular job, and then specify a continuation. Then your continuations will not get the trace ID of a parent job. But in case you do set the DashExecutionContext.TraceIdentifier and then do the same, even though your continuations will share the same trace ID, in the particular case it may be considered as simple luck and a sort of coincidence, considering our job filter implementation and AsyncLocal principles. Thus, checking the parent is a must. Now, the final step is to register it globally so that it’s applied to all the jobs. C# GlobalJobFilters.Filters.Add(new TraceJobFilterAttribute()); Well, that’s it - your Hangfire jobs are now under control, too! By the way, you can compare this approach with the Correlate integration proposed by Hangfire docs. Summary In this article, we’ve tried to compose numerous practices and real-world examples for distributed tracing in .NET so that they can be used for most of the use cases in any software solution. We don’t cover automatic request/response and message logging directly here - it’s the simplest part of the story, so the implementation (i.e., if and where to add automatic request/response/message logging, and all other possible logs as well) should be made according to the specific needs. Also, in addition to tracing, this approach fits for any other data that you may need to pass across your system. As you can see, the DashExecutionContext class, relying on AsyncLocal features, plays the key role in transferring the trace identifier between different communication instruments in the scope of a single service, so it’s crucial to understand how it works. Other interlink implementations depend on the features of each piece of software and should be carefully reviewed to craft the best solution possible, which can be automatically applied to all incoming and outgoing calls without modifications to existing code. Thank you for reading!
CI/CD Explained CI/CD stands for continuous integration and continuous deployment and they are the backbone of modern-day DevOps deployment practices. CI/CD is the process that allows software to be continuously built, tested, automated, and delivered in a continuous cadence. In a rapidly developing world with increasing requirements, the development and integration process need to be at the same speed to ensure business delivery. What Is Continuous Integration? CI, or continuous integration, works on automated tests and builds. Changes made by developers are stored in a source branch of a shared repository. Any changes committed to this branch go through builds and testing before merging. This ensures consistent quality checks of the code that gets merged. As multiple developers work on different complex features, the changes are made to a common repository with changes merged in increments. Code changes go through pre-designed automated builds. Code is tested for any bugs making sure it does not break the current workflow. Once all the checks, unit tests, and integration tests are cleared, the code can be merged into the source branch. The additional checks ensure code quality and versioning makes it easier to track any changes in case of issues. Continuous integration has paved the path for rapid development and incremental merging making it easier to fulfill business requirements faster. What Is Continuous Delivery? CD, or continuous deployment, works on making the deployment process easier and bridges the gap between developers, operations teams, and business requirements. This process automatically deploys a ready, tested code to the production environment. But, through the process of automating the effort taken for deployment, frequent deployments can be handled by the operations team. This enables more business requirements to be delivered at a faster rate. CD can also stand for continuous delivery, which includes the testing of code for bugs before it is deployed to the pre-production environment. Once tests are complete and bugs are fixed, they can then be deployed to production. This process allows for a production-ready version of the code to always be present with newly tested changes added in continuous increments. As code gets merged in short increments, it is easy to test and scan for bugs before getting merged in the pre-production and production environments. Code is already scanned in the automated pipelines before getting handed to the testing teams. This cycle of repeated scanning and testing helps reduce issues and also helps in faster debugging. Continuous integration allows for continuous delivery, which is followed by continuous deployment. Figure 1: CI/CD What Is the Difference Between Continuous Integration (CI) and Continuous Deployment (CD)? The biggest difference between CI and CD is that CI focuses on prepping and branching code for the production environment, and CD focuses on automation and ensuring that this production-ready code is released. Continuous integration includes merging the developed features into a shared repository. It is then built and unit-tested to make sure it is ready for production. This stage also includes UI testing if needed. Once a deployment-ready code version is ready we can move to the next phase, i.e., continuous deployment. The operations team then picks the code version for automated tests to ensure a bug-free code. Once the functionality is tested, the code is merged into production using automated deployment pipelines. Hence, both CI and CD work in sync to deliver at a rapid frequency with reduced manual efforts. Fundamentals of Continuous Integration Continuous integration is also an important practice when it comes to Agile software development. Code changes are merged into a shared repository and undergo automated tests and checks. This helps in identifying possible issues and bugs at an earlier stage. As multiple developers may work on the same code repository, this step ensures there are proper checks in place that test the code, validate the code, and get a peer review before the changes get merged. Read DZone's guide to DevOps code reviews. Continuous integration works best if developers merge the code in small increments. This helps keep track of all the features and possible bug fixes that get merged into the shared code repository. Fundamentals of Continuous Deployment Continuous deployment enables frequent production deployments by automating the deployment process. As a result of CI, a production-ready version of code is always present in the pre-production environment. This allows developers and testers alike to run automated integration and regression tests, UI tests, and more in the staging environment. Once the tests are successfully run and the expected criteria are met, the code can be easily pushed to a live environment by either the Development or Operations teams. Advantages and Disadvantages of CI/CD Implementation CI/CD implementation can have both pros and cons. Having a faster deployment cycle can also lead to other problems down the line. Below are a few benefits and drawbacks of CI/CD implementation. Advantages of CI/CD Disadvantages of CI/CD Automated tests and builds: Automated tests and builds take the strain off of the developers and testers and bring consistency to the code. This is an important step in the CI/CD world. Rapid deployments where they are not needed: There might be businesses that do not appreciate rapid change. A faster rollout period may not be suitable for the business model. Deep testing before deployment can also ensure fewer bugs and problems down the line. Better code quality: Every commit goes through certain predefined checks before getting merged into the main branch. This ensures consistent code quality and minimal bugs or plausible issues to be detected at an earlier stage. Monitoring: Faster rollout leads to less deep testing. Continuous monitoring is important in such cases to quickly identify any issues as they come. Hence monitoring is a crucial part of a CI/CD process. Faster rollout: Automated deployment leads to faster rollout. More features can be released to the end user in smaller chunks. Business requirements are delivered faster keeping up with increasing demands and changes. Issues and fixes: No thorough testing may lead to escaped corner cases also known as bugs. Some cases may be left unnoticed for longer periods. Better transparency: As multiple developers work on a common repository, it is easier to track the changes and maintain transparency. Various version management tools help track history and versions with additional checks before merging to ensure no overlaps or conflicts in the changes. Dependency management: A change made in one microservice can cause a cascading chain of issues. Orchestration is required in such cases to ensure less breakage due to any change added in one part of the service. Faster rollbacks and fixes: As the history and versioning are tracked, it is easier to roll back any change(s) that are causing issues in the application. Any fixes made can also be deployed to production faster. Managing resources: With continuous changes being made development and operations teams need to also keep up with the continuous requirements and maintenance of pipelines. Popular CI/CD Tools Below are a few common CI/CD tools that make life easier for the development teams: AWS AWS, or Amazon Web Services, is a popular DevOps and CI/CD tool. Similarly to Azure, it provides the infrastructure needed for a CI/CD implementation. DZone's previously covered building CI/CD Pipelines with AWS. Azure DevOps Azure DevOps services by Microsoft provide a suite of services to run a CI/CD implementation. From continuous builds to deployments, Azure DevOps handles everything in one platform. Bitbucket Bitbucket is a cloud version system developed by Atlassian. Bitbucket Pipelines is a CI tool that is easily integrated with Bitbucket. GitLab In addition to providing all features of GitHub, GitLab also provides a complete CI/CD setup. From wiki, branch management, versioning, and builds, to deployment, GitLab provides an array of services. Jenkins Jenkins is built using Java and is a commonly used CI/CD tool. It is an open-source continuous integration tool. It is easy to plug in and helps manage builds, automated checks, and deployments. It is very handy for real-time testing and reporting. Learn how to setup a CI/CD pipeline from scratch. Alternative Comparisons: Jenkins VS GitLab and Jenkins VS Bamboo. Conclusion As said by Stan Lee, "With great power comes great responsibility." CI/CD provides a powerful array of tools to enable rapid development and deployment of features to keep up with business requirements. CI/CD is a constant process enabling continuous change. Once it is adapted accurately, teams can easily deal with new requirements, and fix and rollout any bugs or issues as they come. CI/CD is also often used in DevOps practices. Review these best practices further by reading this DevOps Tutorial. With new tools available in the markets adoption or migration to CI/CD has become easier than before. However one needs to assess if CI/CD is the right approach depending on their business use case and available resources. Please share your experience with CI/CD and your favorite CI/CD tool in the comments below.
What Are Feature Flags? Feature flags are a software development technique that help to turn certain functionality on and off during runtime without the deployment of code. For both feature flags and modern development in general, it is always focused on the race to deploy software faster to the customers. However, it is not only that the software has to reach the customer faster, but it also has to be done with lesser risk. Feature flags are a potent tool (set of patterns or techniques) that can be used to reinforce the CI/CD pipeline by increasing the velocity and decreasing the risk of the software deployed to the production environment. Feature flags are also known as feature bits, feature flippers, feature gates, conditional features, feature switches, or feature toggles (even though the last one may have a subtle distinction which we will see a bit later). Related: CI/CD Software Development Patterns. Feature flags help to control and experiment over the feature lifecycle. They are a DevOps best practice that are often observed in distributed version control systems. Even incomplete features can be pushed to production because feature flags help to separate deployment from release. Earlier, the lowest level of control was at the deployment level. Now, feature flags move the lowest level of control to each individual item or artifact (feature, update, or bug fixes) that’s in production which makes it even more granular than the production deployment. Feature Flags Deployment Feature flags can be implemented as: Properties in JSON files or config maps A feature flag service Once we have a good use case (e.g., show or hide a button to access the feature, etc.) to use the feature flags, we will have to see where to implement the flag (frontend, backend, or a mix of both). With a feature flag service, we must install the SDK and create and code the flags within the feature flag platform and then we wrap the new paths of the code or new features within the flags. This enables the feature flags, and the new feature can be toggled on or off through a configuration file or a visual interface as part of the feature flagging platform. We also set up the flag rules so that we may manage various scenarios. You may use different SDKs depending on the language of each service used. This also helps product managers to run some experiments on the new features. After the feature flags are live, we must manage them, which is also known as feature flag management. After the feature flag has served its purpose or no longer serving its purpose, we need to remove them to avoid the technical debt of having the feature flags being left in the codebase. This can also be automated within the service platform. DZone's previously covered how to trigger pipelines with jobs in Azure DevOps. Feature Toggles vs. Feature Flags From an objective perspective, there may be no specific difference between a feature toggle and a feature flag, and for all practical purposes, you may consider them as similar terms. However, feature toggles may carry a subtle connotation of a heavier binary "on/off" for the whole application, whereas feature flags could be much lighter and can manage ramp-up testing more easily. For example, a toggle could be an on/off switch (show ads on the site, don't show ads) and it could be augmented by a flag like (Region1 gets ads from provider A, Region2 gets ads from provider B). Toggling may turn off all the ads, but a feature flag might be able to switch from provider B to provider D. Types of Feature Flags There are different types of feature flags based on various scenarios and in this section, we will look at some of the important types of flags. The fundamental benefits of the feature flags are their ability to ship alternative code pathways within a deployable environment and the ability to choose specific pathways at runtime. Different user scenarios indicate that this benefit can be applied in multiple modes in different contexts. Two important facets that can be applied to categorize the types of feature flags are longevity (how long the flag will be alive), and dynamism (what is the frequency of the switching decision), even though we may also have other factors for consideration. Release Flags For teams practicing continuous delivery, release flags enable faster shipping velocity for the customer and trunk-based development. These flags allow incomplete and untested code pathways which can be shipped to production as latent code. The flag also facilitates the continuous delivery principle of separating the feature release from the deployment of code. These flags are very useful for product managers to manage the delivery of the product to the customers as per the requirements. Operational Flags Operational flags are used for managing the operational aspects of the system’s behavior. If we have a feature that is being rolled out and it has unclear performance issues, we should be able to quickly disable/degrade that feature in production, when required. These are generally short-lived flags but we also have some long-lived flags, a.k.a. kill switches, which can help in degrading non-vital system functionality in production when there are heavy loads. These long-lived flags may also be seen as a manually managed circuit breaker that can be triggered if we cross the set thresholds. The flags are very useful to quickly respond during production issues and they also need to be re-configured quickly so that they are ready for the next set of issues that may occur. Experimental Flags Experimental flags are generally used in A/B or multivariate testing. Users are placed in cohorts and at runtime, the toggle router will send different users across different code pathways based on which cohort they belong to. By tracking the aggregate behavior of the different cohorts, the effect of different code pathways can be observed, and this can help to make data-driven optimizations to the application functionalities like search variables that have the most impact on the user. These flags need to operate with the same configuration for a specific time period (as decided by traffic patterns and other factors so that the results of the experiment are not invalidated) in order to generate statistically significant results. However, since this may not be possible in a production environment where each request may be from a different user, these flags are highly dynamic and need to be managed appropriately. Customer/Permission Flags Customer/permissions flags restrict or change the type of features or product experience that a user gets from a product. One example of this is a premium feature that only some users get based on a subscription. Martin Fowler adds that the technique of turning on new features for a set of internal or beta users as a champagne brunch – an early instance of tasting your own medicine or drinking your own champagne. These flags are quite long-lived flags (many years) compared to other flags. Additionally, as the permissions are specific to a user, switching decisions is generally on a per-request basis, and hence, these flags are very dynamic. Feature Flags and CI/CD Feature flags are one of the important tools that helps the CI/CD pipeline to work better and deliver code faster to the customer. Continuous integration means integrating code changes from the development teams/members every few hours. With continuous delivery, the software is ready for deployment. With continuous deployment, we deploy the software as soon as it is ready, using an automated process. CI and CD are therefore observed to have great benefits because when they work in tandem, they shorten the software development lifecycle (SDLC). However, software has bugs, and delivering code continuously and quickly can rapidly turn from an asset to a liability, and this is where feature flags give us a way to enable or disable new features without a build or a deployment. In effect, they are acting as a safety net just like tests, which also act as a safety net to let us know if the code is broken. We can ship new features and turn them on or off, as required. Thus, feature flags are part of the release and rollout processes. Many engineering teams are now discussing how to implement continuous testing into the DevOps CI/CD pipeline. Implementation Techniques of Feature Flags Below are a few important implementation patterns and practices that may help to reduce messy toggle point issues. Avoiding Conditionals Generally, toggle or switch points are implemented using 'if' statements for short-lived toggles. However, for long-lived toggles or for multiple toggle points, we may use some sort of a strategy pattern to implement alternative code pathways that are a more maintainable approach. Decision Points and Decision Logic Should be Decoupled An issue with feature flags is that we may couple the toggle point (where the toggling decision is made) with the toggle router (the logic behind the decision). This can create rigidity due to the toggle points being linked/hard-wired to the feature directly and we may not be able to modify the sub-feature functionalities easily. By decoupling the decision logic from the decision point, we may be able to manage toggle scope changes more effectively. Inversion of Decision If the application is linked to the feature flagging service or platform, we again have to deal with rigidity as the application is harder to work with and think in isolation, and it also becomes difficult to test it. These issues can be resolved by applying the software design principle – inversion of control by decoupling the application from the feature flagging service. Related: Create a Release Pipeline with Azure DevOps. How Feature Flags Can Improve Release Management Some of the benefits of using feature flags for release management are: Turn on/off without deployment Test directly in production Segment the users based on different attributes Segments are users or groups of users that have some attributes tied to them like location or email ID. Be sure to group segments as collections so that feature flags are tied to specific apps (which are the web pages). Here are some benefits of feature flag service platforms for release management: Can be centrally managed On/off without modifying your properties in your apps/web pages Audit and usage data Conclusion Feature flags in conjunction with CI/CD and release management help in improving many aspects of software delivery. To name a few, these include shipping velocity and reduced time-to-market with less fear of bugs being released in production. They also introduce complexity and challenges in the code that need to be monitored and managed appropriately. In order to use feature flags effectively, it should be an organization-wide initiative and it should not be limited to a few developers only. To further your reading, learn more about running a JMeter test with Jenkins pipelines.
In today's tech landscape, where application systems are numerous and complex, real-time monitoring during deployments has transitioned from being a luxury to an absolute necessity. Ensuring that all the components of an application are functioning as expected during and immediately after deployment while also keeping an eye on essential application metrics is paramount to the health and functionality of any software application. This is where Datadog steps in — a leading monitoring and analytics platform that brings visibility into every part of the infrastructure, from front-end apps to the underlying hardware. In tandem with this is Ansible, a robust tool for automation, particularly in deployment and configuration management. In this article, we will discover how Datadog real-time monitoring can be integrated into Ansible-based deployments and how this integration can be leveraged during deployments. This concept and methodology can be applied to similar sets of monitoring and deployment tools as well. Why Integrate Real-Time Monitoring in Deployments? In the ever-evolving realm of DevOps, the line between development and operations is continuously blurring. This integration drives a growing need for continuous oversight throughout the entire lifecycle of an application, not just post-deployment. Here's why integrating Datadog with your deployment processes and within your deployment scripts is both timely and essential: Immediate Feedback: One of the primary benefits of real-time monitoring during deployments is the instant feedback loop it creates. When an issue arises after deploying to a host or hosts during a rolling deployment, the real-time monitoring data can be immediately used to make a decision to pause or initiate a deployment rollback. This quick turnaround can mean the difference between a minor hiccup and a major catastrophe, especially for applications where even a 1-minute downtime can result in a substantial number of errors and lost revenue. Resource and Performance Oversight: As new features or changes are deployed, there's always the risk of inadvertently impacting performance, resource utilization, and the associated costs. With such real-time monitoring, teams can get an immediate read on how these changes affect system performance and resource utilization, thereby determining any immediate remediations necessary to ensure that users continue to have an optimal experience. Proactive Issue Resolution: Rather than reacting to problems after they've affected end-users, integrating Datadog directly into the deployment process allows teams to proactively address and prevent potential issues from snowballing into a major outage. This proactive approach can increase uptime, more stable releases, and higher overall user satisfaction. The Process of Implementing Real-Time Monitoring Into Deployment As soon as the deployment tool is triggered and the underneath scripts start to execute, we pre-determine an ideal place to perform monitoring checks based on our application needs and send one or more Datadog API requests querying either metrics, monitor data or any other information that helps us determine the health of deployments and the application in general. Then, we add logic in our scripts so that the API response from Datadog can be parsed and an appropriate decision can be made whether to roll forward to the next group or not. For example, if we determine that there are too many errors and the monitors are firing, we parse that information accordingly and decide to abort the deployment from going forward to the next group, thereby reducing the blast radius of a potential production incident. The below flowchart is a representation of how the process typically works. However, the stages need to be tweaked based on your application needs. Deployment flow with integrated monitoring. Utilizing Datadog and Its API Interface for Real-Time Queries Beyond the foundational monitoring capabilities, Datadog offers another pivotal advantage that empowers DevOps teams: its robust API interface. This isn't just a feature; it's a transformative tool. With the ability to query metrics, traces, and logs programmatically, teams can dynamically integrate Datadog deeper into their operations. This allows for tailored monitoring configurations, automated alert setups, and on-the-fly extraction of pertinent data. This real-time querying isn't just about fetching data; it's about informing deployment decisions, refining application performance, and creating a more synergetic tech ecosystem. By leveraging Datadog's API, monitoring becomes not just a passive observation but an active driver of optimized deployment workflows. Datadog monitors are tools that keep an eye on your tech setup, checking things like performance and errors. They give quick updates, so if something goes wrong, you get alerted right away. This helps teams fix problems faster and keep everything running smoothly. In this implementation, we're going to query the monitor's data to check for any alerts that are firing. Alternatively, we can also query metrics and other similar data that help determine the health of the application. The following is a sample example to fetch the details of a particular monitor (obtained from Datadog's API reference sheet). Sample Curl request to a Datadog API endpoint. Using Ansible as an Example in Deployment Automation As we delve deeper into sophisticated monitoring with tools like Datadog, it's essential to understand the deployment mechanisms that underpin these applications. We're going to use Ansible in our case as an example. This open-source automation tool stands out for its simplicity and power. Ansible uses declarative language to define system configurations, making it both human-readable and straightforward to integrate with various platforms and tools. In the context of deployments, Ansible ensures consistent and repeatable application rollouts, mitigating many of the risks associated with manual processes. When coupled with real-time monitoring solutions like Datadog, Ansible not only deploys applications but also guarantees they perform optimally post-deployment. This synergy between deployment automation and real-time monitoring underscores a robust, responsive, and resilient deployment ecosystem. The code snippets below show how we can implement Datadog querying in Ansible. Querying monitors with a tag called 'deployment_priority: blocker' as an example: Monitor querying implemented in Ansible. Next, parsing the status of all such monitors returned from Datadog and making a decision whether to abort or continue to the next host or group of deployments. Iterative monitor parsing and decision-making. We now have the capability to parse Datadog monitoring information and make informed decisions in our deployment process. This concludes the implementation portion. Summary The intersection of deployment automation and real-time monitoring is where modern DevOps truly shines. In this exploration, we've used Ansible as a prime example of the power of deployment tools, emphasizing its capacity to deliver consistent and reliable rollouts. When combined with the granular, real-time insights of a platform like Datadog, we unlock operational efficiency and reliability. As the tech landscape continues to evolve, tools like Ansible and Datadog stand as a testament to the potential of integrated, intelligent DevOps practices. Whether you're a seasoned DevOps professional or just beginning your journey, there's immense value in understanding and employing such synergies for a future-ready and resilient tech ecosystem.
DevOps has shown to be successful in streamlining the product delivery process over time. A well-structured framework for maximizing the value of corporate data had to be established when firms all over the world adopted a data-driven strategy. These data-driven insights enabled consumers to make wise decisions based on verifiable evidence rather than relying on incorrect assumptions and forecasts. To better understand the distinction between DataOps and DevOps, it is meaningful to first define a clear definition. DevOps represents a paradigm shift in the capacity of development and software teams to deliver output effectively. At the same time, DataOps primarily centers around optimizing and refining intelligent systems and analytical models through the expertise of data analysts and data engineers. What Is DataOps? Your data analytics and decision-making processes can only reach their full potential with the help of DataOps. Reducing the cost of data management is one of the main goals of data operations. You may minimize the need for labor-intensive operations and free up precious resources by automating manual data-gathering and processing processes. It not only saves money but also frees up your team to concentrate on more strategic projects. Additionally, enhancing data quality is the core of data operations. You can spot and fix any problems or irregularities in your data pipeline in real time by using continuous monitoring. Making educated judgments is made possible by ensuring the reliability and accuracy of the insights and information you rely on. What is DevOps? DevOps is an approach to software development that focuses on making things run smoothly and continuously improving. It's like the agile development method, but it takes a step further by involving the IT operations and quality assurance teams. So now, the development team is focused on creating the product and how it performs after being deployed. The focus of DevOps is to make collaboration better and reduce any obstacles in the development process. It's all about being efficient. The great thing is that it comes with benefits like better communication between the product teams, saving money, getting better, and quickly responding to customer feedback. DataOps and DevOps: Similarities DataOps and DevOps share a common foundation in agile project management. The subsequent sections will delve into the specific aspects that highlight this shared background. Agile Methodology The Agile methodology serves as the foundation for the extensions known as DevOps and DataOps, which are specifically tailored to the domains of software development and data analysis, respectively. Agile methodology places a strong emphasis on adaptable thinking and swift adjustments to effectively address evolving business requirements and capitalize on emerging technologies and opportunities. DevOps and DataOps adhere to this philosophy to optimize their respective pipelines. Iterative Cycles Both methodologies employ brief iterative cycles to efficiently generate outcomes and gather input from stakeholders to guide their subsequent actions. Incremental development enables users to promptly benefit from the deliverable and assess its alignment with fundamental requirements. Subsequently, DevOps or DataOps teams can commence constructing subsequent layers of the product or alter their trajectory as necessary. Collaboration DataOps and DevOps are all about teamwork and collaboration! In DataOps, our awesome data engineers and scientists team up with business users and analysts to uncover valuable insights that align with our business goals. Meanwhile, in DevOps, our development, operations, and quality assurance teams join forces to create top-notch software that our customers will love. The best part? Both models put a huge emphasis on gathering feedback from our end users because we believe that their satisfaction is the ultimate measure of success. DataOps and DevOps: Differences Outcomes When it comes to achieving results, DataOps is all about creating a seamless flow of data and ensuring that valuable information reaches the hands of end users. To maximize efficiency, this includes developing cutting-edge data transformation apps and optimizing infrastructure. DevOps, on the other hand, adopts a somewhat different strategy by emphasizing the rapid delivery of excellent software to clients. DevOps seeks to deliver a minimal viable product (MVP) as rapidly as possible by distributing updates and making incremental adjustments based on insightful consumer input. The best thing, though? In following development cycles, its functionality can be increased and improved to give clients the greatest experience possible. Testing In DataOps, it is important to verify test results because the true value or statistic is unknown. This may lead to questions about the relevance of the data and the use of the most recent information, which requires validation to ensure confidence in the analysis. In DevOps, the outcomes are clearly defined and expected, making the testing phase simpler. The main focus is on whether the application achieves the desired result. If it is successful, the process continues; if not, debugging and retesting are done. Workflow Real-time data processing for decision-making and ensuring that high-quality data is consistently delivered via the data pipeline are the main goals of the field of data operations. Due to the always-evolving and expanding nature of data sets, building pipelines for new use cases is only one important aspect of maintaining and improving the underlying infrastructure. In contrast, DevOps, while also prioritizing efficiency, follows a structured sequence of stages in its pipeline. Some organizations employ DevOps and continuous integration/continuous deployment (CI/CD) to frequently introduce new features. However, the velocity of a DataOps pipeline surpasses that of DevOps, as it promptly processes and transforms newly collected data, potentially resulting in multiple deliveries per second based on the volume of data. Feedback DataOps places a high emphasis on soliciting feedback from business users and analysts to ensure that the final deliverable is in line with their specific requirements. These stakeholders possess valuable contextual knowledge regarding the data-generating business processes and the findings they make based on the information provided. In contrast, DevOps does not always need customer feedback unless a particular aspect of the application fails to meet their needs. If the end users are content, their feedback becomes optional. Nevertheless, teams should actively monitor the usage of the application and DevOps metrics to evaluate overall satisfaction, identify areas for enhancement, and guarantee that the product fulfills all intended use cases. Which One Is Better for You? Although they may sound like fancy buzzwords, DevOps and DataOps are revolutionizing the fields of software development and data engineering. These two techniques may have some similar concepts in common, such as efficiency, automation, and cooperation, but they also have a different focus. Let's start with DevOps. This approach is all about optimizing software delivery and IT operations. The smooth development, testing, and deployment of your software is like having a well-oiled machine. You can put an end to the annoying delays and bottlenecks that used to bog down your development process with DevOps. It's all about simplifying processes and making everything operate seamlessly. On the other hand, we have DataOps. This methodology takes data management to a whole new level. It's not just about storing and organizing data anymore. The goal of DataOps is to improve your analytics and decision-making processes. It's analogous to having a crystal ball that provides insights and forecasts based on your data. You may get a competitive advantage in the market by making smarter, data-driven decisions using DataOps.
While attending college, I worked part-time at a local recording studio to maintain my serious interest in the music industry. Since you’re reading this article in a publication unrelated to the music industry, it is easy to conclude that I parted ways with music since that time (well, aside from the creation of what I feel are some pretty impressive Spotify playlists). Some of my friends still work and thrive in the music industry, and it’s impressive to hear how things have changed over the years, especially on the recording side of the spectrum. The industry has continued to innovate, mostly because product manufacturers listened to feedback provided by those who depend on such tooling to create their art. This is no different than working in the tech industry today. Getting and listening to user feedback is critical to successful tech products. However, while Web2 has embraced user feedback both in concept and in tooling, Web3 still lags behind. One example of this discrepancy centers around the concept of using a constant feedback loop to improve Web3 DevOps—an area where it’s challenging and uncommon for team members to obtain quality feedback. The concept has yet to gain traction, both in best practices and in tools available. And the poor user experience is evidence. I wondered if there was a better way to bridge this communication gap. Web3 DevOps Use Cases The concept of DevOps is relatively new in software development and is a solid example of the industry listening to the pain points faced by software engineers. So it should be no surprise that Web3 DevOps has already started to gain momentum. Just like with Web3 projects, Web3 teams need to bridge the gap between traditional software engineering and operations. And it’s important. Successful Web3 DevOps can provide benefits such as: Introduce a faster development experience. Meet compliance regulations (auditable and secure practices). Scale in tandem with Web3 adoption. DevOps Needs Continuous User Feedback Under the old model, PMs managed, developers coded, testers tested, and ops deployed. But this was slow and caused the famous "it works on my machine!" But with modern DevOps, team members are now a unified team, with everyone working closely together and responsible for the project as a whole. This means exposing everyone to end-user requests is a good thing. And the feedback should be continuous! All team members should know right away, and all the time, what's happening in production. With this continuous feedback loop, it's easier for everyone, as a team, to understand the project and the customers' needs. Where Feedback Forms Provide Value Exposing the team to user feedback might seem like information overload. But in reality, user feedback provides value for the entire team: Software Engineers Become part of the prioritization effort to determine what features will be added next. See how their point of view compares with users. Often developers can become stuck in their way of thinking. Increases ownership of the project, not just the code Ops Engineers Better understand nonfunctional requirements Understand performance from a user’s perspective, which can often differ from results measured through standard observatory practices. Gain insight into the most important features to work on next. Testers Takes testers out of their silo and involves them with the actual users. Helps testers see the project as a whole, not as a series of tests. Helps testers conduct better UAT by gaining a deeper understanding of users. Product Owners / Project Managers Builds team sharing into the natural development cycle flow Implement Web3 Feedback Forms with Form xChange and MetaMask So we know why we need feedback. But how do we get this in the world of Web3? We could go with the traditional centralized solutions (google forms, etc) - but in the spirit of Web3, we really need a decentralized and open solution. That’s where the Form xChange open source tool comes in. It gives you the ability to easily create and use feedback forms on Web3. And it’s pretty easy to implement and use. The solution connects to a MetaMask wallet (which users most likely already have) and allows application users to vote anonymously using one or more forms—with each form allowing for multiple questions. The cool part is that the entire feedback process employs its own factory contract written in Solidity, without requiring you to create or maintain your own smart contract. Below is a summary of the Form xChange lifecycle: After installation, the creator will author a new form and deploy the form using the factory contract. After deployment, participants simply populate the form anonymously and submit their results. After submission, the results are available to both the creator as well as the participants. Getting Started with Form xChange At a high level, the following steps are required to start using Form xChange. Note that we’ll deploy Form xChange using Truffle on Linea Goerli for this example (the Ethereum L2 Linea test network) to avoid spending any real funds while exploring the feedback forms. Here are the steps to getting started with Form xChange: Install MetaMask into your browser. Acquire test ETH (LineaETH) from a faucet like Infura’s faucet. Establish a Linea RPC endpoint using Infura. Install node and npm as found here. Clone the Form xChange repository. Deploy the feedback form. Deploy the Next frontend factory. Launch the form using localhost:3000 in your browser where MetaMask is installed. It’s pretty easy. You can find a detailed example walking through the setup in full details at the MetaMask site. After following the above steps, the Form xChange home screen will display in your browser: Next, use the Connect Wallet button to connect your MetaMask wallet. Once connected, use the localhost:3000/create-form URL to create a new feedback form: Now you are ready to create feedback forms. User Feedback Made Simple with MetaMask and more Sitting in a recording studio now versus what I recall from the 1990s maintains very few similarities from 40 years ago. The industry realized there was a better way to do things—by listening to their customers—and provided the necessary innovation. That’s no different than what we have seen with the creation and evolution of DevOps as software engineers. The use of feedback forms can provide faster innovation, which I noted around a simple Web3 DevOps use case and ConsenSys Form xChange. My readers may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” - J. Vester The creators at Form xChange allow me to adhere to my personal mission statement by not forcing me to create my own feedback form process as part of my Web3 development lifecycle. In doing so, I am able to simply leverage the Form xChange tool to create quick feedback forms that are easy to manage, implement, and deploy. If you are focused on Web3 and find value in getting feedback from your customers, I highly recommend giving the Form xChange tool a try. After all, it comes at no cost to you… other than a small amount of your time. Have a really great day!
Part 1: The Challenge and the Workload FinOps is an evolving practice to deliver maximum value from investments in Cloud. Most organizations in their journey of adopting FinOps focus on highly tactical and visible activities. They perform activities post-deployment of the applications to help them understand and then optimize their cloud usage and cost. This approach, while being able to clearly demonstrate benefits, falls short of the potential of FinOps as it requires workloads to be effectively built multiple times. Shifting left with FinOps — you build once, this not only saves money on your cloud bill — but increases innovation within your business by freeing up your resources. In this series, we will walk you through an example solution and how to effectively implement a shift-left approach to FinOps to demonstrate the techniques to discover and validate cost optimizations throughout a typical cloud software development lifecycle. Part1: The challenge and the workload Part2: Creating and implementing the cost model Part3: Cost Optimization Techniques for Infrastructure Part4: Cost Optimization Techniques for Applications Part5: Cost Optimization Techniques for Data Part6: Implementation / Case Study Results The Challenge In the current format of this evolving discipline, there are three iterative phases: Inform, Optimize, and Operate. The Inform phase gives the visibility for creating shared accountability. The Optimize phase is intended to identify efficiency opportunities and determine their value. The operating phase defines and implements processes that achieve the goals of technology, finance, and business. FinOps Phases However, with modern cloud pricing calculators and workload planning tools, it is possible to get visibility to your complete cloud cost without having to go through the development process well before anything is built. The cost of development, deployment, and operations can be determined based on the architecture, services, and technical components. The current architecture method involves understanding the scope and requirements. The personas involved and the functional requirements are captured as use cases. The non-functional requirements are captured based on the qualities (security, performance, scalability, and availability) and constraints. Based on the functional and non-functional requirements, a candidate architecture is proposed. Existing Architecture method and FinOps activities As soon as a proposed candidate architecture is developed, we include a phase to do a FinOps model for the candidate. In this step, we shift left some of the FinOps activities at the architecture phase itself. The candidate architecture is reviewed through the frame of FinOps for optimizations. This will go through iterations and refinement of the architecture to arrive at an optimal cost of the solution without compromising on any of the functional and non-functional aspects. ShiftLeft FinOps Model for Creating Working Architecture Building a FinOps Cost Model is very similar to how you can shift left security in a DevOps pipeline by creating a threat model upfront. Creating a FinOps model for the solution is an iterative process. This starts with establishing an initial baseline cost for the candidate architecture. The solution components are then reviewed for cost optimization. In certain cases, it might require the teams to perform a proof of engineering to get the cost estimates or projections. The cost optimization techniques need to be applied at various levels or layers to arrive at a working architecture. They can be divided as follows: Cost Optimization Techniques for Infrastructure Cost Optimization Techniques for Application Cost Optimization Techniques for Data The Workload Workloads can be very different; however, when viewed as functional components — you can utilize similar optimization approaches and techniques to maximize efficiency and value. Most workloads will have some form of data input, processing, and output, so the example we will use is a cloud-native application that performs data ingestion, processing to enrich and analyze the data, and then outputs the data along with reports and insights for a user. We will utilize a cloud-agnostic approach and break the workload and optimization techniques into the following components: Infrastructure: This is the computing, storage, and networking. This will include the resources, services, and associated attributes of them. Application: This is the application design and architecture and covers how the application will behave and function on the infrastructure. Data: This is the data itself and also the formatting and handling of the data throughout the workload. These methods and techniques for each component and layer are discussed in detail with the help of an example. The workload for this example is a cloud-native application that involves some domain-specific data ingestion, processing, and analysis. The structured/semi-structured data is enriched and analyzed further to create reports and insights for the end user. This application is architected to be a deployment that can leverage services in multiple clouds — for instance, AWS and GCP. Candidate Architecture for the representative Cloud Native Application Conclusion FinOps is a practice that gives the enterprise a better way to manage their cloud spending. Shifting left FinOps gives more opportunities to save costs earlier in the software development lifecycle. This involves the introduction of a few simple steps before the solution architecture is pushed to the detailed design and implementation phase. The steps involve creating a FinOps Cost Model and iterating through it to ensure that you have applied all the cost optimization techniques at the infrastructure, application, and data layer and components. You can optimize your overall cloud expenses by shifting left FinOps. In part 2 of the blog series, we will be creating and implementing the cost model.
In today's fast-evolving technology landscape, the integration of Artificial Intelligence (AI) into Internet of Things (IoT) systems has become increasingly prevalent. AI-enhanced IoT systems have the potential to revolutionize industries such as healthcare, manufacturing, and smart cities. However, deploying and maintaining these systems can be challenging due to the complexity of the AI models and the need for seamless updates and deployments. This article is tailored for software engineers and explores best practices for implementing Continuous Integration and Continuous Deployment (CI/CD) pipelines for AI-enabled IoT systems, ensuring smooth and efficient operations. Introduction To CI/CD in IoT Systems CI/CD is a software development practice that emphasizes the automated building, testing, and deployment of code changes. While CI/CD has traditionally been associated with web and mobile applications, its principles can be effectively applied to AI-enabled IoT systems. These systems often consist of multiple components, including edge devices, cloud services, and AI models, making CI/CD essential for maintaining reliability and agility. Challenges in AI-Enabled IoT Deployments AI-enabled IoT systems face several unique challenges: Resource Constraints: IoT edge devices often have limited computational resources, making it challenging to deploy resource-intensive AI models. Data Management: IoT systems generate massive amounts of data, and managing this data efficiently is crucial for AI model training and deployment. Model Updates: AI models require periodic updates to improve accuracy or adapt to changing conditions. Deploying these updates seamlessly to edge devices is challenging. Latency Requirements: Some IoT applications demand low-latency processing, necessitating efficient model inference at the edge. Best Practices for CI/CD in AI-Enabled IoT Systems Version Control: Implement version control for all components of your IoT system, including AI models, firmware, and cloud services. Use tools like Git to track changes and collaborate effectively. Create separate repositories for each component, allowing for independent development and testing. Automated Testing: Implement a comprehensive automated testing strategy that covers all aspects of your IoT system. This includes unit tests for firmware, integration tests for AI models, and end-to-end tests for the entire system. Automation ensures that regressions are caught early in the development process. Containerization: Use containerization technologies like Docker to package AI models and application code. Containers provide a consistent environment for deployment across various edge devices and cloud services, simplifying the deployment process. Orchestration: Leverage container orchestration tools like Kubernetes to manage the deployment and scaling of containers across edge devices and cloud infrastructure. Kubernetes ensures high availability and efficient resource utilization. Continuous Integration for AI Models: Set up CI pipelines specifically for AI models. Automate model training, evaluation, and validation. This ensures that updated models are thoroughly tested before deployment, reducing the risk of model-related issues. Edge Device Simulation: Simulate edge devices in your CI/CD environment to validate deployments at scale. This allows you to identify potential issues related to device heterogeneity and resource constraints early in the development cycle. Edge Device Management: Implement device management solutions that facilitate over-the-air (OTA) updates. These solutions should enable remote deployment of firmware updates and AI model updates to edge devices securely and efficiently. Monitoring and Telemetry: Incorporate comprehensive monitoring and telemetry into your IoT system. Use tools like Prometheus and Grafana to collect and visualize performance metrics from edge devices, AI models, and cloud services. This helps detect issues and optimize system performance. Rollback Strategies: Prepare rollback strategies in case a deployment introduces critical issues. Automate the rollback process to quickly revert to a stable version in case of failures, minimizing downtime. Security: Security is paramount in IoT systems. Implement security best practices, including encryption, authentication, and access control, at both the device and cloud levels. Regularly update and patch security vulnerabilities. CI/CD Workflow for AI-Enabled IoT Systems Let's illustrate a CI/CD workflow for AI-enabled IoT systems: Version Control: Developers commit changes to their respective repositories for firmware, AI models, and cloud services. Automated Testing: Automated tests are triggered upon code commits. Unit tests, integration tests, and end-to-end tests are executed to ensure code quality. Containerization: AI models and firmware are containerized using Docker, ensuring consistency across edge devices. Continuous Integration for AI Models: AI models undergo automated training and evaluation. Models that pass predefined criteria are considered for deployment. Device Simulation: Simulated edge devices are used to validate the deployment of containerized applications and AI models. Orchestration: Kubernetes orchestrates the deployment of containers to edge devices and cloud infrastructure based on predefined scaling rules. Monitoring and Telemetry: Performance metrics, logs, and telemetry data are continuously collected and analyzed to identify issues and optimize system performance. Rollback: In case of deployment failures or issues, an automated rollback process is triggered to revert to the previous stable version. Security: Security measures, such as encryption, authentication, and access control, are enforced throughout the system. Case Study: Smart Surveillance System Consider a smart surveillance system that uses AI-enabled cameras for real-time object detection in a smart city. Here's how CI/CD principles can be applied: Version Control: Separate repositories for camera firmware, AI models, and cloud services enable independent development and versioning. Automated Testing: Automated tests ensure that camera firmware, AI models, and cloud services are thoroughly tested before deployment. Containerization: Docker containers package the camera firmware and AI models, allowing for consistent deployment across various camera models. Continuous Integration for AI Models: CI pipelines automate AI model training and evaluation. Models meeting accuracy thresholds are considered for deployment. Device Simulation: Simulated camera devices validate the deployment of containers and models at scale. Orchestration: Kubernetes manages container deployment on cameras and cloud servers, ensuring high availability and efficient resource utilization. Monitoring and Telemetry: Metrics on camera performance, model accuracy, and system health are continuously collected and analyzed. Rollback: Automated rollback mechanisms quickly revert to the previous firmware and model versions in case of deployment issues. Security: Strong encryption and authentication mechanisms protect camera data and communication with the cloud. Conclusion Implementing CI/CD pipelines for AI-enabled IoT systems is essential for ensuring the reliability, scalability, and agility of these complex systems. Software engineers must embrace version control, automated testing, containerization, and orchestration to streamline development and deployment processes. Continuous monitoring, rollback strategies, and robust security measures are critical for maintaining the integrity and security of AI-enabled IoT systems. By adopting these best practices, software engineers can confidently deliver AI-powered IoT solutions that drive innovation across various industries.
IT teams have been observing applications for their health and performance since the beginning. They observe the telemetry data (logs, metrics, traces) emitted from the application/microservice using various observability tools and make informed decisions regarding scaling, maintaining, or troubleshooting applications in the production environment. If observability is not something new and there are a plethora of monitoring and observability tools available in the market, why bother about OpenTelemetry? What makes it special such that it is getting widely adopted? And most importantly, what is in it for developers, DevOps, and SRE folks? Well, let us find out. What Is OpenTelemetry? OpenTelemetry (OpenTelemetry) provides open-source standards and formats for collecting and exporting telemetry data from microservices for observability purposes. The standardized way of collecting data helps DevOps and SRE engineers use any compatible observability backend of their choice to observe services and infrastructure, without being vendor locked-in. OpenTelemetry diagram for microservices deployed in a Kubernetes cluster OpenTelemetry is both a set of standards and an open-source project that provides components, such as collectors and agents, for its implementation. Besides, OpenTelemetry offers APIs, SDKs, and data specifications for application developers to standardize instrumenting their application code. (Instrumentation is the process of adding observability libraries/dependencies to the application code so that it emits logs, traces, and metrics.) Why Is OpenTelemetry Good News for DevOps and SREs? The whole observability process starts with application developers. Typically, they instrument application code with the proprietary library/agent provided by the observability backend tool that IT teams plan to go with. For example, let us say IT teams want to use Dynatrace as the observability tool. Then, application developers use code/SDKs from Dynatrace to instrument (i.e., to generate and export telemetry data) all the applications in the system. It helps to fetch and feed data in the format Dynatrace is compatible with. But this is where the problem lies. The observability requirements of DevOps and SREs seldom stay the same. They will have to switch between vendors providing observability tools or may want to use more than one tool, as their needs evolve. But, since all the applications are instrumented with the proprietary code from the current vendor, switching becomes a nightmare: The new vendor may prefer collecting telemetry data in a format (tracing format, for example) not compatible with the existing vendor. It means developers will have to rewrite the instrumentation code for all applications. This will have severe overhead in terms of cost, developer effort, and potential service disruptions, depending on the deployments and infrastructure. Non-compatible formats also cause problems with historical data while switching vendors. That is, it becomes hard for DevOps and SREs to analyze the performance before and after the migration. This is where OpenTelemetry proves helpful, and this the reason it is being widely adopted. OpenTelemetry prevents such vendor lock-in by standardizing telemetry data collection and exportation. With OpenTelemetry, developers can send the data to one or more observability backends, be it open-source or proprietary, as it supports most of the leading observability tools. OpenTelemetry Components and Workflow OpenTelemetry provides certain vendor-agnostic components that work together to fetch, process, and export telemetry data to various backends. There are three major components: Instrumentation library, OpenTelemetry Collector, and Exporters. Instrumentation Library OpenTelemetry provides SDKs and libraries for application developers to instrument their code manually or automatically. They support many popular programming languages, such as Java, Python, Ruby, Rust, JavaScript, and more.The instrumentation library is evolving, and developers should check the status of the telemetry data component in the instrumentation library, specific to the programming language they use. OpenTelemetry docs update them frequently. The status at the time of writing this piece is given below: Status of programming language-specific telemetry data support in OpenTelemetry For Kubernetes workloads, OpenTelemetry Operator for Kubernetes can be used to inject auto-instrumentation libraries. OpenTelemetry Collector (OTC) The collector has receiver, processor, and exporter components, which gather, process, and export telemetry data from instrumented applications or infrastructure to observability backends for visualization (refer to the image below). It can receive and export data in various formats, such as its native format (OpenTelemetry Protocol or OTLP), Prometheus, Jaeger, and more. OpenTelemetry Collector components and workflow OTC can be deployed as an agent — either as a sidecar container that runs alongside the application container or as a DaemonSet that runs on each node. And it can be scaled in or out depending on the data throughput. OpenTelemetry Collector is not mandatory since OpenTelemetry is designed to be modular and flexible. IT teams can pick components of their choice as receivers, processors, and exporters or even add custom ones. Exporters They allow developers to configure any compatible backend they want to send the processed telemetry data to. There are open-source and vendor-specific exporters available. Some of them are Apache Skywalking, Prometheus, Datadog, and Dynatrace, which are part of the contrib projects. You can see the complete list of vendors who provide exporters here. The difference Between Trace Data Collected by OpenTelemetry and Istio In a distributed system, tracing is the process of monitoring and recording the lifecycle of a request as it goes through different services in the system. It helps DevOps and SREs visualize the interaction between services and troubleshoot issues, like latency. Istio is one of the most popular service mesh software that provides distributed tracing for observability purposes. In Istio, application containers accompany sidecar containers, i.e., Envoy proxies. The proxy intercepts traffic between services and provides telemetry data for observability (refer to the image below). Istio sidecar architecture and observability Although both OpenTelemetry and Istio provide tracing data, there is a slight difference between them. Istio focuses on the lifecycle of a request as it traverses through multiple services in the system (networking layer) while OpenTelemetry — given that the application is instrumented with the OpenTelemetry library — focuses on the lifecycle of a request as it flows through an application (application layer), interacting with various functions and modules. For example, let us say service A is talking to service B, and the communication has latency issues. Istio can show you which service causes latency and by how much. While this information is enough for DevOps and SREs, it will not help developers debug the part of the application that is causing the problem. This is where OpenTelemetry tracing helps. Since the application is instrumented with the OpenTelemetry library, OpenTelemetry tracing can provide details regarding the specific function of the application that causes latency here. To put it another way, Istio gives traces from outside the application, while OpenTelemetry tracing provides traces from within the application. Istio tracing is good for troubleshooting problems at the networking layer, while OpenTelemetry tracing helps to troubleshoot problems at the application level. OpenTelemetry for Microservices Observability and Vendor Neutrality Enterprises adopting microservices architecture have applications distributed across the cloud, with respective IT teams maintaining them. By instrumenting applications with OpenTelemetry libraries and SDKs, the IT teams are free to choose any compatible observability backend of their choice. The choice will not affect the Ops/SRE teams’ ability to have central visibility into the entire services in the system. OpenTelemetry supports a variety of data formats and seamlessly integrates with most of the open-source and vendor-specific monitoring and observability tools. This also makes switching between vendors painless. Get Started With OpenTelemetry for Istio Service Mesh Watch the following video to learn how to get started with OpenTelemetry for Istio service mesh to achieve observability-in-depth: Additionally, you can go through the blog post, "Integrate Istio and Apache Skywalking for Kubernetes Observability," where the OpenTelemetry collector is used to scrape Prometheus endpoints.
Boris Zaikin
Lead Solution Architect,
CloudAstro GmBH
Pavan Belagatti
Developer Evangelist,
SingleStore
Alireza Chegini
DevOps Architect / Azure Specialist,
Coding As Creating