2024 DevOps Lifecycle: Share your expertise on CI/CD, deployment metrics, tech debt, and more for our Feb. Trend Report (+ enter a raffle!).
Kubernetes in the Enterprise: Join our Virtual Roundtable as we dive into Kubernetes over the past year, core usages, and emerging trends.
Modern systems span numerous architectures and technologies and are becoming exponentially more modular, dynamic, and distributed in nature. These complexities also pose new challenges for developers and SRE teams that are charged with ensuring the availability, reliability, and successful performance of their systems and infrastructure. Here, you will find resources about the tools, skills, and practices to implement for a strategic, holistic approach to system-wide observability and application monitoring.
Scaling SRE Teams
Distributed Cloud Architecture for Resilient Systems
If you’re into monitoring, Prometheus is probably an essential part of your stack. Thanks to its expressive query language (PromQL), scalability, and configurable data format, it remains one of the most popular tools for data collection. Paired with Prometheus exporters, the tool can adapt to a variety of surroundings, which is one of its strongest points. With the help of exporters, Prometheus can provide insightful data and effectively monitor a variety of services, including databases, web servers, and custom-made applications. Prometheus' sophisticated alerting system, lively community, and easy integration make it the go-to option for monitoring intricate and dynamic systems. In this article, we’ll take a look at the use of Prometheus exporters, some Prometheus exporter examples, and their use in action. What Is a Prometheus Exporter? By acting as go-betweens for Prometheus and other services, exporters make it easier to keep an eye on many different applications and infrastructure parts. These exporters collect particular metrics from the target systems and deliver them in a Prometheus-readable way. A variety of metrics, including CPU, memory, disk I/O, network statistics, and custom application-specific metrics, can be included. Prometheus exporters can be community-made third-party integrations or custom creations for particular services or applications. Prometheus will monitor several system components with the use of exporters, giving important information on the functionality and health of the system. Prometheus gathers time-series data points from exporters' data scrapes, which results in dashboards, alerts, and graphs. This data-driven approach to monitoring facilitates more effective troubleshooting and improved decision-making by providing insight into the behavior of infrastructure and applications. For instance, when Prometheus gathers time-series data points from a custom collector monitoring a microservices architecture, it can track metrics such as individual service response times, error rates, and resource utilization. These metrics, once collected and visualized on dashboards, provide a detailed overview of each service's performance. This data-driven approach enables teams to quickly identify bottlenecks, troubleshoot issues, and make informed decisions, ensuring the smooth operation of the entire microservices ecosystem. How To Use Prometheus Exporters for Monitoring In order to use Prometheus exporters for efficient monitoring, these exporters must be configured, deployed, and integrated with Prometheus. Here's a detailed how-to: Identify Metrics to Monitor Choose the precise system or application metrics that you wish to keep an eye on. These might be application-specific metrics or conventional metrics like CPU and memory consumption. Choose an Exporter Choose between building a customized exporter based on the metrics of your application or using one of the pre-built exporters already available in-app or in the community. Note that some third-party software export metrics in the Prometheus format, so you don't need any separate exporters. Deploy The deployment is different based on the exporter you choose. Pre-Built Exporters Download or install the pre-built exporter relevant to your application. Configure the exporter, specifying which metrics to collect and the HTTP endpoint to expose them. Deploy the exporter alongside your application or on a server accessible to Prometheus. Custom Exporters Develop a custom application or script that collects the desired metrics. Expose an HTTP endpoint in the exporter application for Prometheus to scrape. Ensure the metrics are formatted correctly, either as key-value pairs or in Prometheus exposition format. Prometheus Configuration Modify the Prometheus configuration to include the endpoint details of the exporter. Add a new job configuration in the Prometheus configuration file (usually prometheus.yml). Define the target endpoints (exporter's HTTP endpoints) under the appropriate job in the configuration file. The scheme defaults to ‘http’ and the metrics_path defaults to '/metrics' change those accordingly Here’s an example of a Prometheus configuration snippet: TypeScript scrape_configs: - job_name: 'example-job' static_configs: - targets: ['exporter-endpoint:port'] # Default to scraping over https. If required, just disable this or change to # `http`. scheme: https metrics_path: /metrics Restart Prometheus Restart the Prometheus server to apply the updated configuration. Verify and Explore Access Prometheus web interface and verify that the newly added job and targets are visible in the "Targets" or "Service Discovery" section. Then, you can explore the collected metrics using Prometheus' query language (PromQL) through the web interface. Finally, use the data to create custom dashboards, graphs, and alerts based on the collected metrics through Prometheus or visualization tools like Grafana. Monitoring and Alerting Set up alerting rules in Prometheus based on the exported metrics to receive notifications when certain conditions are met. Continuously monitor the metrics and adjust alerting thresholds as necessary to maintain an effective monitoring system. Prometheus retrieves metrics using a straightforward HTTP request. That's why a user could have thousands — possibly many, many thousands — of different Prometheus metric categories. Nonetheless, the four primary categories of Prometheus metrics are counter, gauge, histogram, and summary. Let's examine each of these categories of metrics. Counter Counters have values that increase monotonically; they always start at zero and never stop. They stand for values that accrue over time, such as the overall quantity of requests fulfilled or assignments finished. How do counters behave? Counters cannot be decreased. The only ways they can reset are when the monitored system restarts or when an external action specifically resets the counter. What’s their use? For metrics that show cumulative numbers, counters are perfect since they let you monitor the total number of times a particular event occurs over time. Gauge Gauges are numerical numbers with an up/down range. They show numbers that are immediate at a certain moment, such as the system's available RAM or the proportion of the CPU that is currently being used. How do gauges behave? Gauges can increase or fall in response to the monitored system's condition. They can have any number value and are appropriate for measures that change over time. What’s their use? Gauges are used to record quantities such as temperatures, sizes, or percentages and are useful for metrics that indicate a certain condition at a particular time. Histogram The distribution of observed data into programmable buckets is measured using histograms. They also offer an event count and the total of all observed values. How do histograms behave? By tracking the value distribution, histograms help you comprehend the variability and dispersion of data points. Under the hood, they automatically generate bucketed counts. What’s their use? Histograms are handy when monitoring things like response times or request durations since they help visualize the distribution of numbers, which is crucial for spotting outliers and bottlenecks. Summary Summaries follow the distribution of observed data, much like histograms do. On the other hand, summaries employ quantiles to present a more realistic image of the distribution of the data. How do summaries behave? By tracking percentiles (such as the 50th, 90th, and 99th percentile) of the observed values, summaries help you gain a better understanding of the distribution and provide insights into the variability of the data. What’s their use? Summaries could be used to measure latency, and in systems that handle a lot of requests, it can be important to understand certain percentiles, such as the 99th percentile, in order to guarantee a positive user experience. Checkly’s Prometheus V2 Exporter in Action Checkly's Prometheus Exporter is a specialized tool designed to seamlessly integrate Checkly's synthetic monitoring data with Prometheus. Our recently released Prometheus Exporter V2 fetches metrics generated by Checkly's synthetic checks — such as response times, status codes, and error rates — and exposes them in a Prometheus format. By utilizing this exporter, users can consolidate their real-time monitoring and alerting data within the Prometheus ecosystem, enabling comprehensive analysis, visualization, and alerting alongside other metrics collected from various applications and systems. How To Activate Checkly’s Prometheus Exporter V2 1. Go to the integrations tab in your Checkly dashboard and click the ‘Create Prometheus endpoint’ button 2. We directly create an endpoint for you and provide its URL and the required Bearer token. 3. Create a new job in your Prometheus prometheus.yml config and set up a scraping interval. The scrape interval should be above 60 seconds. Add the URL (divided into metrics_path, scheme, and target) and bearer_token. Here is an example: YAML # prometheus.yml - job_name: 'checkly' scrape_interval: 60s metrics_path: '/accounts/993adb-8ac6-3432-9e80-cb43437bf263/v2/prometheus/metrics' bearer_token: 'lSAYpOoLtdAa7ajasoNNS234' scheme: https static_configs: - targets: ['api.checklyhq.com'] Now restart Prometheus, and you should see metrics coming in. Here are some examples: Find more metrics here. Key Takeaways Prometheus Exporters Overview Prometheus is one of the most popular open-source monitoring and alerting toolkit designed for reliability and scalability. Prometheus exporters bridge the gap between Prometheus and various applications/systems by translating metrics into Prometheus-compatible formats. Exporters collect specific metrics, format them, and expose them through HTTP endpoints for Prometheus to scrape. There are four types of Prometheus metrics: counter, gauge, histogram, and summary. Prometheus Exporter Best Practices Export only essential metrics to avoid unnecessary overhead. Optimize exporters for efficiency to minimize resource usage. Implement robust error handling to ensure stability. Secure exporters to prevent unauthorized access. Monitor exporters internally and integrate them with centralized monitoring systems. Choosing Between Existing and Custom Exporters Existing exporters save development time and benefit from community contributions. Custom exporters offer flexibility for highly specific use cases but require significant development effort. Application of Prometheus Exporters Exporters are available for a wide range of applications, including databases, web servers, messaging systems, and cloud services. Integrating exporters with Prometheus allows for comprehensive monitoring, analysis, visualization, and alerting in diverse computing environments. Prometheus exporters empower businesses to achieve unparalleled monitoring precision and efficiency. They can help organizations navigate the complexities of modern IT environments and stay ahead in the ever-evolving landscape of technology.
When we introduced the concept of data observability four years ago, it resonated with organizations that had unlocked new value…and new problems thanks to the modern data stack. Now, four years later, we are seeing organizations grapple with the tremendous potential…and tremendous challenges posed by generative AI. The answer today is the same as it was then: improve data product reliability by getting full context and visibility into your data systems. However, the systems and processes are evolving in this new AI era and so data observability must evolve with them, too. Perhaps the best way to think about it is to consider AI another data product and data observability as the living, breathing system that monitors ALL of your data products. The need for reliability and visibility into what is a very black box is just as critical for building trust in LLMs as it was in building trust for analytics and ML. For GenAI in particular, this means data observability must prioritize resolution, pipeline efficiency, and streaming/vector infrastructures. Let’s take a closer look at what that means. Going Beyond Anomalies Software engineers have long since gotten a handle on application downtime, thanks in part to observability solutions like New Relic and Datadog (who, by the way, just reported a stunning quarter). Data teams, on the other hand, recently reported that data downtime nearly doubled year over year and that each hour was getting more expensive. Image courtesy of Monte Carlo. Data products — analytical, ML and AI applications — need to become just as reliable as those applications to truly become enmeshed within critical business operations. How? Well, when you dig deeper into the data downtime survey, a trend starts to emerge: the average time-to-resolution (once detected) for an incident rose from 9 to 15 hours. In our experience, most data teams (perhaps influenced by the common practice of data testing) start the conversation around detection. While early detection is critically important, teams vastly underestimate the significance of making incident triage and resolution efficient. Just imagine jumping around between dozens of tools trying to hopelessly figure out how an anomaly came to be or whether it even matters. That typically ends up with fatigued teams that ignore alerts and suffer from data downtime. Monte Carlo has accelerated the root cause analysis of this data freshness incident by correlating it to a dbt model error resulting from a GitHub pull request where the model code was incorrectly modified with the insertion of a semi-colon on line 113. Image courtesy of Monte Carlo. Data observability is characterized by the ability to accelerate root cause analysis across data, system, and code and to proactively set data health SLAs across the organization, domain, and data product levels. The Need for Speed (and Efficiency) Data engineers are going to be building more pipelines faster (thanks Gen AI!) and tech debt is going to be accumulating right alongside it. That means degraded query, DAG, and dbt model performance. Slow running data pipelines cost more, are less reliable, and deliver poor data consumer experience. That won’t cut it in the AI era when data is needed as soon as possible. Especially not when the economy is forcing everyone to take a judicious approach with expense. That means pipelines need to be optimized and monitored for performance. Data observability has to cater for it. Observing the GenAI Data Stack This will shock no one who has been in the data engineering or machine learning space for the last few years, but LLMs perform better in areas where the data is well-defined, structured, and accurate. Not to mention, there are few enterprise problems to be solved that don’t require at least some context of the enterprise. This is typically proprietary data — whether it is user ids, transaction history, shipping times or unstructured data from internal documents, images and videos. These are typically held in a data warehouse/lakehouse. I can’t tell a Gen AI chatbot to cancel my order if it doesn’t have any idea of who I am, my past interactions, or the company cancellation policy. Ugh, fine. Be that way, Chat-GPT 3.5. Image courtesy of Monte Carlo. To solve these challenges, organizations are typically turning to RAG or pre-training/fine tuning approaches, both of which require smart and reliable data pipelines. In an (oversimplified) nutshell, RAG involves providing the LLM additional context through a database (oftentimes a vector database…) that is regularly ingesting data from a pipeline, while fine tuning or pre-training involves tailoring how the LLM performs on specific or specialized types of requests by providing it a training corpus of similar data points. Data observability needs to help data teams deliver reliability and trust in this emerging stack. In the Era of AI, Data Engineering Is More Important Than Ever Data engineering has never been a slowly evolving field. If we started talking to you ten years ago about Spark clusters, you would have politely nodded your head and then crossed the street. To paraphrase a Greek data engineer philosopher, the only constant is change. To this we would add, the only constants in data engineering are the eternal requirements for more. More data, more reliability, and more speed (but at less cost, please and thank you). Gen AI will be no different, and we see data observability as an essential bridge to this future that is suddenly here.
The Problem There are various scenarios when we choose to deploy our applications in different AWS accounts: There are multiple microservices deployed in different AWS accounts in different regions based on the use case. For an organization, there could be multiple AWS accounts configured that will deploy related/unrelated services. One AWS account, one AWS region, etc. AWS provides local metrics and monitoring via AWS CloudWatch. But things will get complicated when we need to monitor multiple applications from all these accounts to extrapolate and make decisions based on the metrics. The Technology AWS released cross-account observability which will allow monitoring applications spanning across multiple accounts in a region. This will allow accounts to share the following to a central monitoring account: CloudWatch Metrics CloudWatch Log groups and AWS X-Ray Traces Glossary Monitoring account: A central AWS account that will view and interact with the data generated by source accounts Source accounts: These are individual AWS accounts that will generate the above data. They are ’n’ in numbers. Limitations 1 source account can share observability with at most 5 monitoring accounts Cross-region observability is not allowed. Only the source account can disable sharing with the monitoring account. The Solution Cross-account observability and analyzing data Above is a high-level diagram of sharing the metrics, logs, and X-ray data from multiple source accounts to a central metrics account using cross-account observability and then extending the solution to further analyze metrics data using ETL, Data Warehouse, and QuickSight. Let me discuss each section: 1. Monitoring Account This is the central account to stream and visualize the CloudWatch and X-ray data from source accounts. In order to do this, certain AWS permissions and configurations are required. Login using the below permissions to configure Oam Sink or log in as admin: { "Version": "2012-10-17", "Statement": [ { "Sid": "AllowSinkModification", "Effect": "Allow", "Action": [ "oam:CreateSink", "oam:DeleteSink", "oam:PutSinkPolicy", "oam:TagResource" ], "Resource": "*" }, { "Sid": "AllowReadOnly", "Effect": "Allow", "Action": ["oam:Get*", "oam:List*"], "Resource": "*" } ] } List of source accounts: In order to allow broader source accounts, an organization ID can be used. Replace aws:ResourceAccount with aws:PrincipalOrgID. { "Action": [ "oam:CreateLink", "oam:UpdateLink" ], "Effect": "Allow", "Resource": "arn:*:oam:*:*:sink/*", "Condition": { "StringEquals": { "aws:ResourceAccount": [ "999999999999" ] } } } Final policy for an AWS organization unit: You can remove any resource type that is not needed for your use case. "Name": "SampleSinkPolicy", "Policy": { "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": "*", "Resource": "*", "Action": [ "oam:CreateLink", "oam:UpdateLink" ], "Condition": { "StringEquals": {"aws:PrincipalOrgID":"o-xxxxxxxxxxx"}, "ForAllValues:StringEquals": { "oam:ResourceTypes": [ "AWS::CloudWatch::Metric", "AWS::Logs::LogGroup", "AWS::XRay::Trace" ] } } }] } Configure using AWS Console: Login -> CloudWatch -> Monitoring account configuration -> Configure (select appropriate options) Once the configuration is complete, you can navigate to CloudWatch -> Monitoring account configuration-> Resources to link accounts and select the type of configuration to download the CloudFormation Template or copy the URL. This will be used later to configure the source accounts. Using CDK: const sinkPolicy = new iam.PolicyDocument({ statements: [ new iam.PolicyStatement({ actions: ['oam:CreateLink', 'oam:UpdateLink'], resources: ['*'], principals: ['*'], conditions: { 'ForAllValues:StringEquals': { 'oam:PrincipalOrgID': 'o-xxxxxxxxxxx' } } }) ] }); this.sink = new oam.CfnSink(this, 'SampleMonitoringSink', { name: 'SampleMonitoringSink', policy: sinkPolicy }); 2. Source Accounts Source account configuration can be done in 3 ways. 1. Using CloudFormation Once the monitoring account is configured and CloudFormation is downloaded, it can be used to deploy the stack using AWS CloudFormation. The CloudFormation will look like this: { "LabelTemplate": "SampleLabel", "ResourceTypes": [ "AWS::CloudWatch::Metric", "AWS::Logs::LogGroup", "AWS::XRay::Trace" ], "SinkIdentifier": "arn:aws:oam:eu-north-1:1111111111111111:sink/EXAMPLE-206d-4daf-9b42-1e17d5f145ef" } This will create a link between the source and monitor accounts. 2. Using URL Log in to the source account and paste the URL. This will pre-populate the page with all the details. Select Link and Confirm. 3. Using CDK new oam.CfnLink(this, 'SampleLink', { labelTemplate: 'SampleLabel', resourceTypes: [ 'AWS::CloudWatch::Metric', 'AWS::Logs::LogGroup', 'AWS::XRay::Trace' ], sinkIdentifier: 'arn:aws:oam:eu-north-1:1111111111111111:sink/EXAMPLE-206d-4daf-9b42-1e17d5f145ef' }); Once done, all the selected resources; i.e., Metric, LogGroup, or X-Ray trace will be available in the monitoring accounts from source accounts. Things To Consider Never use * as a resource without an account ID or organization ID. If * is used without any check-in place, this will open a risk on the monitoring account to allow any source account to establish a link. Always use a list of account IDs or organization IDs in monitoring account configuration. Bonus The above will provide cross-account observability to view and monitor logs, metrics, and X-rays from source accounts to a single monitoring account. This section will discuss transporting the log and metrics data to a data warehouse to analyze the extrapolation using QuickSight. This will need the below configuration in the monitoring account. Create S3 bucket: This bucket will store metrics and logs from source accounts. Configure metrics stream: In order to stream and store the data from source accounts into the monitoring account S3 bucket, you’ll need to configure the Metrics Stream. In order to do so: Navigate to CloudWatch -> Metrics -> Streams. Select Create a metric stream. Click on include source account metrics and select namespaces (or all namespaces) as required. Follow different configurations as required. Once configured, a new metric stream will be created. This will stream the data from all linked source accounts to the defined S3 bucket. You can configure an ETL job on top of S3 data (or use a firehose to do this as well). Use QuickSight or any other visualization application to visualize collected data. Make a decision!!
The OpenTelemetry Collector sits at the center of the OpenTelemetry architecture but is unrelated to the W3C Trace Context. In my tracing demo, I use Jaeger instead of the Collector. Yet, it's ubiquitous, as in every OpenTelemetry-related post. I wanted to explore it further. In this post, I explore the different aspects of the Collector: The data kind: logs, metrics, and traces Push and pull models Operations: reads, transformations, and writes First Steps A long time ago, observability as we know it didn't exist; what we had instead was monitoring. Back then, monitoring was a bunch of people looking at screens displaying dashboards. Dashboards themselves consisted of metrics and only system metrics, mainly CPU, memory, and disk usage. For this reason, we will start with metrics. Prometheus is one of the primary monitoring solutions. It works on a pull-based model: Prometheus scrapes compatible endpoints of your application(s) and stores them internally. We will use the OTEL Collector to scrape a Prometheus-compatible endpoint and print out the result in the console. Grafana Labs offers a project that generates random metrics to play with. For simplicity's sake, I'll use Docker Compose; the setup looks like the following: YAML version: "3" services: fake-metrics: build: ./fake-metrics-generator #1 collector: image: otel/opentelemetry-collector:0.87.0 #2 environment: #3 - METRICS_HOST=fake-metrics - METRICS_PORT=5000 volumes: - ./config/collector/config.yml:/etc/otelcol/config.yaml:ro #4 No Docker image is available for the fake metrics project; hence, we need to build it Latest version of the OTEL Collector at the time of this writing Parameterize the following configuration file Everything happens here As I mentioned above, the OTEL Collector can do a lot. Hence, configuration is everything. YAML receivers: #1 prometheus: #2 config: scrape_configs: #3 - job_name: fake-metrics #4 scrape_interval: 3s static_configs: - targets: [ "${env.METRICS_HOST}:${env.METRICS_PORT}" ] exporters: #5 logging: #6 loglevel: debug service: pipelines: #7 metrics: #8 receivers: [ "prometheus" ] #9 exporters: [ "logging" ] #10 List of receivers. A receiver reads data; it can be either push-based or pull-based. We use the prometheus pre-defined receiver Define pull jobs Job's configuration List of exporters. In contrast to receivers, an exporter writes data. The simplest exporter is to write data on the standard out Pipelines assemble receivers and exporters Define a metric-related pipeline The pipeline gets data from the previously-defined prometheus receiver and sends it to the logging exporter, i.e., prints them Here's a sample of the result: 2023-11-11 08:28:54 otel-collector-collector-1 | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Timestamp: 2023-11-11 07:28:54.14 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Value: 83.090000 2023-11-11 08:28:54 otel-collector-collector-1 | NumberDataPoints #1 2023-11-11 08:28:54 otel-collector-collector-1 | Data point attributes: 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__embrace_world_class_systems: Str(concept) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__exploit_magnetic_applications: Str(concept) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__facilitate_wireless_architectures: Str(extranet) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__grow_magnetic_communities: Str(challenge) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__reinvent_revolutionary_applications: Str(support) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__strategize_strategic_initiatives: Str(internet_solution) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__target_customized_eyeballs: Str(concept) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__transform_turn_key_technologies: Str(framework) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__whiteboard_innovative_partnerships: Str(matrices) 2023-11-11 08:28:54 otel-collector-collector-1 | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Timestamp: 2023-11-11 07:28:54.14 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Value: 53.090000 2023-11-11 08:28:54 otel-collector-collector-1 | NumberDataPoints #2 2023-11-11 08:28:54 otel-collector-collector-1 | Data point attributes: 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__expedite_distributed_partnerships: Str(approach) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__facilitate_wireless_architectures: Str(graphical_user_interface) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__grow_magnetic_communities: Str(policy) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__reinvent_revolutionary_applications: Str(algorithm) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__transform_turn_key_technologies: Str(framework) 2023-11-11 08:28:54 otel-collector-collector-1 | StartTimestamp: 1970-01-01 00:00:00 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Timestamp: 2023-11-11 07:28:54.14 +0000 UTC 2023-11-11 08:28:54 otel-collector-collector-1 | Value: 16.440000 2023-11-11 08:28:54 otel-collector-collector-1 | NumberDataPoints #3 2023-11-11 08:28:54 otel-collector-collector-1 | Data point attributes: 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__exploit_magnetic_applications: Str(concept) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__grow_magnetic_communities: Str(graphical_user_interface) 2023-11-11 08:28:54 otel-collector-collector-1 | -> fake__target_customized_eyeballs: Str(extranet) Beyond Printing The above is an excellent first step, but there's more than printing to the console. We will expose the metrics to be scraped by a regular Prometheus instance; we can add a Grafana dashboard to visualize them. While it may seem pointless, bear with it, as it's only a stepstone. To achieve the above, we only change the OTEL Collector configuration: YAML exporters: prometheus: #1 endpoint: ":${env:PROMETHEUS_PORT}" #2 service: pipelines: metrics: receivers: [ "prometheus" ] exporters: [ "prometheus" ] #3 Add a prometheus exporter Expose a Prometheus-compliant endpoint Replace printing with exposing That's it. The OTEL Collector is very flexible. Note that the Collector is multi-input, multi-output. To both print data and expose them via the endpoint, we add them to the pipeline: YAML exporters: prometheus: #1 endpoint: ":${env:PROMETHEUS_PORT}" logging: #2 loglevel: debug service: pipelines: metrics: receivers: [ "prometheus" ] exporters: [ "prometheus", "logging" ] #3 Expose data Print data The pipeline will both print data and expose them With the Prometheus exporter configured, we can visualize metrics in Grafana. Note that receivers and exporters specify their type and every one of them must be unique. To comply with the last requirement, we can append a qualifier to distinguish between them, i.e., prometheus/foo and prometheus/bar. Intermediary Data Processing A valid question would be why the OTEL Collector is set between the source and Prometheus, as it makes the overall design more fragile. At this stage, we can leverage the true power of the OTEL Collector: data processing. So far, we have ingested raw metrics, but the source format may not be adapted to how we want to visualize data. For example, in our setup, metrics come from our fake generator, "business," and the underlying NodeJS platform, "technical." It is reflected in the metrics' name. We could add a dedicated source label and remove the unnecessary prefix to filter more efficiently. You declare data processors in the processors section of the configuration file. The collector executes them in the order they are declared. Let's implement the above transformation. The first step toward our goal is to understand that the collector has two flavors: a "bare" one and a contrib one that builds upon it. Processors included in the former are limited, both in number and in capabilities; hence, we need to switch the contrib version. YAML collector: image: otel/opentelemetry-collector-contrib:0.87.0 #1 environment: - METRICS_HOST=fake-metrics - METRICS_PORT=5000 - PROMETHEUS_PORT=8889 volumes: - ./config/collector/config.yml:/etc/otelcol-contrib/config.yaml:ro #2 Use the contrib flavour For added fun, the configuration file is on another path At this point, we can add the processor itself: YAML processors: metricstransform: #1 transforms: #2 - include: ^fake_(.*)$ #3 match_type: regexp #3 action: update operations: #4 - action: add_label #5 new_label: origin new_value: fake - include: ^fake_(.*)$ match_type: regexp action: update #6 new_name: $${1} #6-7 # Do the same with metrics generated by NodeJS Invoke the metrics transform processor List of transforms applied in order Matches all metrics with the defined regexp List of operations applied in order Add the label Rename the metric by removing the regexp group prefix Fun stuff: syntax is $${x} Finally, we add the defined processor to the pipeline: YAML service: pipelines: metrics: receivers: [ "prometheus" ] processors: [ "metricstransform" ] exporters: [ "prometheus" ] Here are the results: Connecting Receivers and Exporters A connector is both a receiver and an exporter and connects two pipelines. The example from the documentation receives the number of spans (tracing) and exports the count, which has a metric. I tried to achieve the same with 500 errors — spoiler: it doesn't work as intended. Let's first add a log receiver: YAML receivers: filelog: include: [ "/var/logs/generated.log" ] Then, we add a connector: YAML connectors: count: requests.errors: description: Number of 500 errors condition: [ "status == 500 " ] Lastly, we connect the log receiver and the metrics exporter: YAML service: pipelines: logs: receivers: [ "filelog" ] exporters: [ "count" ] metrics: receivers: [ "prometheus", "count" ] The metric is named log_record_count_total, but its value stays at 1. Logs Manipulation Processors allow data manipulation; operators are specialized processors that work on logs. If you're familiar with the ELK stack, they are the equivalent of Logstash. As of now, the log timestamp is the ingestion timestamp. We shall change it to the timestamp of its creation. YAML receivers: filelog: include: [ "/var/logs/generated.log" ] operators: - type: json_parser #1 timestamp: #2 parse_from: attributes.datetime #3 layout: "%d/%b/%Y:%H:%M:%S %z" #4 severity: #2 parse_from: attributes.status #3 mapping: #5 error: 5xx #6 warn: 4xx info: 3xx debug: 2xx - id: remove_body #7 type: remove field: body - id: remove_datetime #7 type: remove field: attributes.datetime - id: remove_status #7 type: remove field: attributes.status The log is in JSON format; we can use the provided JSON parser Metadata attributes to set Fields to read from Parsing pattern Mapping table Accept a range, e.g., 501-599. The operator has a special interpreted value 5xx (and similar) for HTTP statuses. Remove duplicated data Logs At this point, we can send the logs to any log aggregation component. We shall stay in the Grafana Labs sphere and use Loki. YAML exporters: loki: endpoint: "http://loki:3100/loki/api/v1/push" We can also use logs from the collector itself: YAML service: telemetry: logs: Finally, let's add another pipeline: YAML service: pipelines: logs: receivers: [ "filelog" ] exporters: [ "loki" ] Grafana can also visualize the logs. Choose Loki as a datasource. Conclusion In this post, we delved into the OpenTelemetry collector. While it's not a mandatory part of the OTEL architecture, it's a useful Swiss knife for all your data processing needs. In case you're not stuck to a specific stack or don't want to, it's a tremendous help. The complete source code for this post can be found on GitHub. To Go Further OpenTelemetry Collector OpenTelemetry Operators
This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report Observability is essential in any modern software development and production environment. It allows teams to better identify areas of improvement, enabling them to make informed decisions about their development processes. Telemetry, being a critical part of observability, refers to the continuous nature of data collection. This data enables organizations to paint a picture of the whole system's health and achieve a higher level of observability and responsiveness in managing their applications. This article will provide some guidance for your observability journey, starting with defining the significance of "true" observability. It will explore the different observability maturity models, examining the steps required to ascend the maturity ladder, as well as the challenges and solutions involved in advancing maturity levels. Additionally, it will cover techniques for adopting observability, including how to get started, best practices for implementing it, and how to generate useful performance data. Lastly, the article will cover the role of automation and AI in observability and how comprehensive telemetry can impact overall application performance. Defining "True" Observability In the domain of application performance management, the term "observability" has evolved beyond its traditional monitoring roots, getting to a level often referred to as "true" observability. At its core, "true" observability is more than just keeping an eye on your systems; it's a holistic approach that provides a 360-degree view of your whole system: infrastructure, applications, and services. Figure 1: Observability pillars Traditional monitoring typically involves a collection of specific performance metrics and predefined thresholds to help better identify known issues and alert administrators when those thresholds are exceeded. It primarily focuses on basic health checks, like system uptime, CPU, and memory utilization, thus providing a simplified view of a system's behavior. This makes traditional monitoring often limited in the ability to diagnose more complex problems or identify underlying causes, as it lacks the depth and data granularity of "true" observability. Traditional monitoring tends to be passive, addressing only well-understood issues, and may not keep pace with the dynamic nature of modern, distributed applications and infrastructure. Moving from traditional monitoring to true observability means incorporating a data-rich approach that relies on in-depth telemetry. Unlike traditional monitoring, which often focuses on surface-level metrics, "true" observability incorporates metrics, traces, and logs, providing a more detailed and nuanced view of application behavior. This helps to identify the root cause of issues, giving teams visibility into the entire ecosystem, and providing a more comprehensive picture of not just what's happening in the system, but why and how it's happening. Ultimately, true observability empowers teams to deliver more reliable, responsive, and efficient applications that elevate the overall user experience. The Observability Maturity Model In order to achieve "true" observability, it's important to understand the Observability Maturity Model. This model outlines the stages through which organizations evolve in their observability practices, acting as a roadmap. Here, we'll describe each maturity stage, highlight their advantages and disadvantages, and offer some practical tips for moving from one stage to the next. As seen in Table 1, the Observability Maturity Model is broken down into four distinct levels of observability: initial, awareness, proactive, and predictive. OBSERVABILITY MATURITY STAGES: ADVANTAGES AND DISADVANTAGES Maturity Stage Purpose Advantages Disadvantages Initial(Stage 1) Also called the monitoring level, this is where the basic health of individual system components is tracked. Alarms and notifications are triggered to signal that something went wrong. Simplicity: easy to implement and understand Quick issue detection Easily accessible through many open-source and SaaS solutions Cost-effective Helps ensure basic availability Limited visibility due to lack of insights into system behavior Reactive issue resolution Lack of context Manual root cause analysis Alert noise from multiple sources Awareness(Stage 2) This is the observability level, where you have more insights into system behavior by observing its outputs. It focuses on results from metrics, logs, and traces, combined with existing monitoring data to help answer what went wrong and why. Offers a deeper and broader understanding of overall system health Helps uncover not just known failure types, but unknown as well Delivers baseline data for investigating issues Complex manual queries for manual data correlation can make troubleshooting inefficient Data from different sources may remain in silos, which is challenging for cross-domain and cross-team collaboration Lack of automation Proactive(Stage 3) This stage provides more comprehensive insights to help understand the problem's origin and consequences. Building upon Stages 1 and 2, it adds the ability to track topology changes over time in the stack and generates extensive, correlated information that helps identify what went wrong quicker, why the issue occurred, when it started, and what areas are impacted. Clear contextual view through unified data Accelerates resolution time through visualization and analysis Automated foundation for root cause analysis and alert correlation Enables visualization of the impact of network, infrastructure, and app events on business services Challenges in data normalization may require additional capabilities or organizational changes Time-consuming setup Still some manual efforts and limited automation at this level Predictive(Stage 4) This is called the intelligent observability phase, as the usage of AI/ML algorithms helps identify error correlation patterns and offers remediation workflows. Here you start understanding how to predict anomalies and automate response. Leveraging AI/ML to analyze large volumes of data for more accurate insights Early issue detection Results in more efficient ITOps Automated responses and self-healing systems May require significant configuration and training Handling the velocity and variety of data can be challenging Demonstrating ROI may take time Potential for misinterpretation in self-healing systems Table 1 Adopting "True" Observability After understanding the Observability Maturity Model, it's essential to explore the multifaceted approach companies must embrace for a successful observability transition. Despite the need to adopt advanced tools and practices, the path to "true" observability can demand significant cultural and organizational shifts. Companies must develop strategies that align with the observability maturity model, nurture a collaborative culture, and make cross-team communication a priority. The rewards are quite substantial — faster issue resolution and improved user experience, making "true" observability a transformative journey for IT businesses. How To Get Started With Observability If your organization is at the beginning of your observability journey, make sure to start by assessing your current monitoring capabilities and identifying gaps. Invest in observability tools and platforms that align with your maturity level, making sure you capture metrics, logs, and traces effectively. Set clear objectives and key performance indicators (KPIs) to measure progress along the way. As you establish a cross-functional observability team and promote a culture of knowledge sharing and collaboration, you'll be well-prepared to move forward in your observability journey. Generating Useful Performance Data Central to this journey is the effective generation of performance data. Telemetry data — metrics, logs, and traces — provide insights into system health and performance. To get started, define what data is most important to your unique system needs. Logging for Clarity and Accessibility Implement structured logging practices that ensure logs are accessible and clear. Logs offer insights into system behavior, errors, and transactions, so it's critical to ensure the consistency of logs in a standardized format. Prioritize log accessibility by implementing log aggregation solutions that centralize logs from multiple sources in the system. This centralized access simplifies troubleshooting and anomaly detection. Metrics for Insights Metrics provide quantifiable data points that encapsulate the critical aspects of your applications, like traffic, latency, error rates, and saturation. Define clear objectives and benchmarks for these metrics, and provide a baseline for performance assessment. Implement monitoring tools that can capture, store, and visualize these metrics in real-time, and analyze them regularly to make data-driven decisions. Tracing to Precision Distributed tracing is a powerful tool for understanding the complex flows in today's modern architectures. To implement effective tracing, start by generating trace data in your applications. Ensure these traces are correlated, providing a detailed view of request paths and interactions between services. Invest in tracing tools that can visualize these traces and offer solutions for root cause analysis. This can help pinpoint performance bottlenecks, troubleshoot quickly, and maintain a precise understanding of your system. The Role of Automation and AI On the journey to "true" observability, automation and AI become your allies in harnessing the full potential of the data you've collected. They offer capabilities that can elevate your observability game to the next level. Using automation, you can streamline the process of generating insights from the data, and detect patterns and anomalies with AI-driven algorithms. Figure 2: AI at the heart of observability Using automation and AI, you can analyze telemetry data to identify deviations from expectations. They can recognize early warning signals and predict performance degradation. AI algorithms can sift through vast amounts of data, identify causes, and provide actionable insights to your operators. AI-driven observability doesn't stop at identification and analysis but can extend to intelligent remediation. When an issue is detected, AI can help provide the instructions for resolution and suggest actions to be taken or changes to be implemented in the system. With AI's assistance, your Ops team can be more efficient and effective, ensuring minimal disruption and optimal system availability. Conclusion In the evolving landscape of IT and application performance management, true observability is a guide through the complexities of modern systems. As environments become more dynamic, distributed, and modular, adopting true observability is a necessity, rather than a luxury. This article uncovered the layers of observability, from understanding the foundations of monitoring to achieving proactive observability with automation and AI. We explored the significance of each maturity level, highlighting the need for cultural and organizational shifts, and we emphasized the benefits of faster issue resolution and an improved user experience. Lastly, we covered the way to adopt "true" observability and the components of a telemetry ecosystem: metrics, traces, and logs, as well as the role of automation and AI for more effective collection, storage, and analysis of the telemetry data. Moving forward, the key takeaway is that the goal of true observability isn't just to collect data; it's to harness its power to deliver seamless and reliable user experiences. To continue your exploration into this subject, consider the following resources: The Observability Maturity Model Refcard by Lodewijk Bogaards The Getting Started With OpenTelemetry Refcard by Joana Carvalho "A Deep Dive Into AIOps and MLOps" by Hicham Bouissoumer and Nicolas Giron This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report
This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report AIOps applies AI to IT operations, enabling agility, early issue detection, and proactive resolution to maintain service quality. AIOps integrates DataOps and MLOps, enhancing efficiency, collaboration, and transparency. It aligns with DevOps for application lifecycle management and automation, optimizing decisions throughout DataOps, MLOps, and DevOps. Observability for IT operations is a transformative approach that provides real-time insights, proactive issue detection, and comprehensive performance analysis, ensuring the reliability and availability of modern IT systems. Why AIOps Is Fundamental to Modern IT Operations AIOps streamlines operations by automating problem detection and resolution, leading to increased IT staff efficiency, outage prevention, improved user experiences, and optimized utilization of cloud technologies. The major contributions of AIOps are shared in Table 1: CONTRIBUTIONS OF AIOPS Key Functions Function Explanations Event correlation Uses rules and logic to filter and group event data, prioritizing service issues based on KPIs and business metrics. Anomaly detection Identifies normal and abnormal behavior patterns, monitoring multiple services to predict and mitigate potential issues. Automated incident management Aims to automate all standardized, high-volume, error-sensitive, audit-critical, repetitive, multi-person, and time-sensitive tasks. Meanwhile, it preserves human involvement in low ROI and customer support-related activities. Performance optimization Analyzes large datasets employing AI and ML, proactively ensuring service levels and identifying issue root causes. Enhanced collaboration Fosters collaboration between IT teams, such as DevOps, by providing a unified platform for monitoring, analysis, and incident response. Table 1 How Does AIOps Work? AIOps involves the collection and analysis of vast volumes of data generated within IT environments, such as network performance metrics, application logs, and system alerts. AIOps uses these insights to detect patterns and anomalies, providing early warnings for potential issues. By integrating with other DevOps practices, such as DataOps and MLOps, it streamlines processes, enhances efficiency, and ensures a proactive approach to problem resolution. AIOps is a crucial tool for modern IT operations, offering the agility and intelligence required to maintain service quality in complex and dynamic digital environments. Figure 1: How AIOps works Popular AIOps Platforms and Key Features Leading AIOps platforms are revolutionizing IT operations by seamlessly combining AI and observability, enhancing system reliability, and optimizing performance across diverse industries. The following tools are just a few of many options: Prometheus acts as an efficient AIOps platform by capturing time-series data, monitoring IT environments, and providing anomaly alerts. OpenNMS automatically discovers, maps, and monitors complex IT environments, including networks, applications, and systems. Shinken enables users to monitor and troubleshoot complex IT environments, including networks and applications. The key features of the platforms and the role they play in AIOps are shared in Table 2: KEY FEATURES OF AIOPS PLATFORMS AND THE CORRESPONDING TASKS Features Tasks Visibility Provides insight into the entire IT environment, allowing for comprehensive monitoring and analysis. Monitoring and management Monitors the performance of IT systems and manages alerts and incidents. Performance Measures and analyzes system performance metrics to ensure optimal operation. Functionality Ensures that the AIOps platform offers a range of functionalities to meet various IT needs. Issue resolution Utilizes AI-driven insights to address and resolve IT issues more effectively. Analysis Analyzes data and events to identify patterns, anomalies, and trends, aiding in proactive decision-making. Table 2 Observability's Role in IT Operations Observability plays a pivotal role in IT operations by offering the means to monitor, analyze, and understand the intricacies of complex IT systems. It enables continuous tracking of system performance, early issue detection, and root cause analysis. Observability data empowers IT teams to optimize performance, allocate resources efficiently, and ensure a reliable user experience. It supports proactive incident management, compliance monitoring, and data-driven decision-making. In a collaborative DevOps environment, observability fosters transparency and enables teams to work cohesively toward system reliability and efficiency. Data sources like logs, metrics, and traces play a crucial role in observability by providing diverse and comprehensive insights into the behavior and performance of IT systems. ROLES OF DATA SOURCES Logs Metrics Traces Event tracking Root cause analysis Anomaly detection Compliance and auditing Performance monitoring Threshold alerts Capacity planning End-to-end visibility Latency analysis Dependency mapping Table 3 Challenges of Observability Observability is fraught with multiple technical challenges. Accidental invisibility takes place where critical system components or behaviors are not being monitored, leading to blind spots in observability. The challenge of insufficient source data can result in incomplete or inadequate observability, limiting the ability to gain insights into system performance. Dealing with multiple information formats poses difficulties in aggregating and analyzing data from various sources, making it harder to maintain a unified view of the system. Popular Observability Platforms and Key Features Observability platforms offer a set of key capabilities essential for monitoring, analyzing, and optimizing complex IT systems. OpenObserve provides scheduled and real-time alerts and reduces operational costs. Vector allows users to collect and transform logs, metrics, and traces. The Elastic Stack — comprising Elasticsearch, Kibana, Beats, and Logstash — can search, analyze, and visualize data in real time. The capabilities of observability platforms include real-time data collection from various sources such as logs, metrics, and traces, providing a comprehensive view of system behavior. They enable proactive issue detection, incident management, root cause analysis, system reliability aid, and performance optimization. Observability platforms often incorporate machine learning for anomaly detection and predictive analysis. They offer customizable dashboards and reporting for in-depth insights and data-driven decision-making. These platforms foster collaboration among IT teams by providing a unified space for developers and operations to work together, fostering a culture of transparency and accountability. Leveraging AIOps and Observability for Enhanced Performance Analytics Synergizing AIOps and observability represents a cutting-edge strategy to elevate performance analytics in IT operations, enabling data-driven insights, proactive issue resolution, and optimized system performance. Observability Use Cases Best Supported by AIOps Elevating cloud-native and hybrid cloud observability with AIOps: AIOps transcends the boundaries between cloud-native and hybrid cloud environments, offering comprehensive monitoring, anomaly detection, and seamless incident automation. It adapts to the dynamic nature of cloud-native systems while optimizing on-premises and hybrid cloud operations. This duality makes AIOps a versatile tool for modern enterprises, ensuring a consistent and data-driven approach to observability, regardless of the infrastructure's intricacies. Seamless collaboration of dev and ops teams with AIOps: AIOps facilitates the convergence of dev and ops teams in observability efforts. By offering a unified space for data analysis, real-time monitoring, and incident management, AIOps fosters transparency and collaboration. It enables dev and ops teams to work cohesively, ensuring the reliability and performance of IT systems. Challenges To Adopting AIOps and Observability The three major challenges to adopting AIOps and observability are data complexity, integration complexity, and data security. Handling the vast and diverse data generated by modern IT environments can be overwhelming. Organizations need to manage, store, and analyze this data efficiently. Integrating AIOps and observability tools with existing systems and processes can be complex and time-consuming, potentially causing disruptions if not executed properly. The increased visibility into IT systems also raises concerns about data security and privacy. Ensuring the protection of sensitive information is crucial. Impacts and Benefits of Combining AIOps and Observability Across Sectors The impacts and benefits of integrating AIOps and observability transcend industries, enhancing reliability, efficiency, and performance across diverse sectors. It helps in improved incident response by using machine learning to detect patterns and trends, enabling proactive issue resolution, and minimizing downtime. Predictive analytics anticipates capacity needs and optimizes resource allocation in advance, which ensures uninterrupted operations. Full-stack observability leverages data from various sources — including metrics, events, logs, and traces (MELT) — to gain comprehensive insights into system performance, supporting timely issue identification and resolution. MELT capabilities are the key drivers where metrics help pinpoint issues, events automate alert prioritization, logs aid in root cause analysis, and traces assist in locating problems within the system. All contribute to improved operational efficiency. APPLICATION SCENARIOS OF COMBINING AIOPS AND OBSERVABILITY Industry Sectors Key Contributions Finance Enhance fraud detection, minimize downtime, and ensure compliance with regulatory requirements, thus safeguarding financial operations. Healthcare Improve patient outcomes by guaranteeing the availability and performance of critical healthcare systems and applications, contributing to better patient care. Retail Optimize supply chain operations, boost customer experiences, and maintain online and in-store operational efficiency. Manufacturing Enhance the reliability and efficiency of manufacturing processes through predictive maintenance and performance optimization. Telecommunications Support network performance to ensure reliable connectivity and minimal service disruptions. E-commerce Real-time insights into website performance, leading to seamless shopping experiences and improved conversion rates. Table 4 The application scenarios of combining AIOps and observability span diverse industries, showcasing their transformative potential in improving system reliability, availability, and performance across the board. Operational Guidance for AIOps Implementation Operational guidance for AIOps implementation offers a strategic roadmap to navigate the complexities of integrating AI into IT operations, ensuring successful deployment and optimization. Figure 2: Steps for implementing AIOps The Future of AIOps in Observability: The Road Ahead AIOps' future in observability promises to be transformative. As IT environments become more complex and dynamic, AIOps will play an increasingly vital role in ensuring system reliability and performance and will continue to evolve, integrating with advanced technologies like cognitive automation, natural language understanding (NLU), large language models (LLMs), and generative AI. APPLICATION SCENARIOS OF COMBINING AIOPS AND OBSERVABILITY Impact Area Role of AIOps Synergy With Cognitive Automation LLM and Generative AI Integration Data collection and analysis Collects and analyzes a wide range of IT data, including performance metrics, logs, and incidents Process unstructured data, such as emails, documents, and images Predict potential issues based on historical data patterns and generate reports Incident management Automatically detects, prioritizes, and responds to IT incidents Extract relevant information from incident reports and suggest or implement appropriate actions Understand its context and generate appropriate responses Root cause analysis Identifies root causes of incidents Access historical documentation and knowledge bases to offer detailed explanations and solutions Provide recommendations by analyzing historical data for resolving issues NLU Uses NLU to process user queries and understand context Engage in natural language conversations with IT staff or end-users, improving user experiences Power chatbots and virtual IT assistants, offering user-friendly interaction and support to answer queries and provide guidance Table 5 Conclusion The fusion of AI/ML with AIOps has ushered in a new era of observability. IT operations are constantly evolving, and so is the capability to monitor, analyze, and optimize performance. In the age of AI/ML-driven observability, our IT operations won't merely survive, but will thrive, underpinned by data-driven insights, predictive analytics, and an unwavering commitment to excellence. References: OpenNMS repositories, GitHub OpenObserve repositories, GitHub OpsPAI/awesome-AIOps, GitHub Precompiled binaries and Docker images for Prometheus components Shinken documentation This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report
This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report Agile development practices must be supported by an agile monitoring framework. Overlooking the nuances of the system state — spanning infrastructure, application performance, and user interaction — is a risk businesses can't afford. This is particularly true when performance metrics and reliability shape customer satisfaction and loyalty, directly influencing the bottom line. Traditional application performance monitoring (APM) tools were initially designed for environments that were more static and predictable. These tools were designed to track neither the swift, iterative changes of microservice architectures nor the complexities of cloud-native applications. This led to the gradual evolution of the modern observability approach that leveraged the data collection principles of APM and extended them to provide deeper insights into a system's state. In this article, we delve into the core concepts of observability and monitoring while discussing how the modern observability approach differs from and complements traditional monitoring practices. Optimizing Application Performance Through Data Quality Performance metrics are only as reliable as the data feeding them. Diverse data sources, each with their own format and scale, can convolute the true picture of application performance. Given the "garbage in, garbage out" challenge, data normalization serves as the corrective measure where a dataset is reorganized to reduce redundancy and improve data integrity. The primary aim is to ensure that data is stored efficiently and consistently, which makes it easier to retrieve, manipulate, and make sense of. For APM, there are various normalization techniques that help bring heterogeneous data onto a common scale so that it can be compared and analyzed more effectively: Unit conversion – standardizing units of measure, like converting all time-based metrics to milliseconds Range scaling – adjusting metrics to a common range; useful for comparing metrics that originally existed on different scales Z-score normalization – converting metrics to a standard distribution, which is especially useful when dealing with outlier values Monitoring vs. Observability: Core Concepts In optimizing application performance, monitoring and observability play equally critical but distinct roles. Some often inaccurately use the terms interchangeably, but there's a nuanced difference. Monitoring follows a proactive approach of collecting data points based on predefined thresholds and setting up alarms to flag anomalies. This essentially answers the question, Is my system working as expected? On the other hand, observability allows for deep dives into system behavior, offering insights into issues that you didn't know existed. The approach helps you answer, Why isn't my system working as expected? Example Use Case: E-Commerce Platform For context, consider an e-commerce platform where application uptime and user experience are critical. To ensure everything is running smoothly, the right blend of monitoring and observability strategies can be broken down as follows. MONITORING vs. OBSERVABILITY FOR AN E-COMMERCE PLATFORM Strategy Type Strategy Name Purpose Monitoring Availability checks Regular pings to ensure the website is accessible Latency metrics Measuring page load times to optimize user experience Error rate tracking Flags raised if server errors like "404 Not Found" exceed a threshold Transaction monitoring Automated checks for crucial processes like checkout Observability Log analysis Deep inspection of server logs to trace failed user requests Distributed tracing Maps the path of a request through various services Event tagging Custom tags in code for real-time understanding of user behavior Query-driven exploration Ad hoc queries to examine system behavior Table 1 Synergy Between Monitoring and Observability Monitoring and observability don't conflict; instead, they work hand-in-hand to develop an efficient APM framework. Integrating monitoring and observability will allow you to realize numerous advantages, including those listed below: Enhanced coverage– Monitoring identifies known issues while observability lets you explore the unknown. From system crashes to subtle performance degradations, everything gets covered here. In practical terms, this may mean simply not knowing that your server responded with a 500 error but also understanding why it occurred and what its effects are to the entire ecosystem. Improved analysis – A blended approach enables you to pivot from what is happening to why it's happening. This is crucial for data-driven decision-making. You can allocate resources more effectively, prioritize bug fixes, or even discover optimization opportunities you didn't know existed. For example, you might find that certain API calls are taking longer only during specific times of the day and trace it back to another internal process hogging resources. Scalability– As your system grows, its complexity often grows exponentially. The scalability of your APM can be significantly improved when both monitoring and observability work in sync. Monitoring helps you keep tabs on performance indicators, but observability allows you to fine-tune your system for optimal performance at scale. As a result, you achieve a scalable way to not just proactively identify bottlenecks and resource constraints but also to investigate and resolve them. Figure 1: How observability and monitoring overlap Creating a Cohesive System Synergizing monitoring and observability is one of the most critical aspects of building a robust, scalable, and insightful APM framework. The key here is to build an environment where monitoring and observability are not just coexisting but are codependent, thus amplifying each other's efficacy in maintaining system reliability. While different use cases may require different approaches, consider the following foundational approaches to build a cohesive monitoring and observability stack. Unified Data Storage and Retrieval The first step towards creating a cohesive analytics pipeline is unified data storage. A single data storage and retrieval system enhances the speed and accuracy of your analytics. Your performance analysis stack should accommodate both fixed metrics from monitoring and dynamic metrics from observability. At its core, the underlying system architecture should be capable of handling different data types efficiently. Solutions like time series databases or data lakes can often serve these varied needs well. However, it's crucial to consider the system's capability for data indexing, searching, and filtering, especially when dealing with large-scale, high-velocity data. Interoperability Between Specialized Tools An agile APM system relies on seamless data exchange between monitoring and observability tools. When each tool operates as a disjointed/standalone system, the chances of getting siloed data streams and operational blind spots increase. Consider building an interoperable system that allows you to aggregate data into a single, comprehensive dashboard. Opt for tools that adhere to common data formats and communication protocols. A more advanced approach of achieving this is to leverage a custom middleware to serve as a bridge between different tools. As an outcome, you can correlate monitoring KPIs with detailed logs and traces from your observability tools. Data-Driven Corrective Actions Knowing exactly what needs to be fixed allows for quicker remediation. This speed is vital in a live production environment where every minute of suboptimal performance can translate to lost revenue or user trust. When your monitoring system flags an anomaly, the logical next step is a deep dive into the underlying issue. For instance, a monitoring system alerts you about a sudden spike in error rates, but it doesn't tell you why. Integrating observability tools helps to correlate the layers. These tools can sift through log files, query databases, and analyze trace data, ultimately offering a more granular view. As a result, you're equipped to take targeted, data-driven actions. To streamline this further, consider establishing automated workflows. An alert from the monitoring system can trigger predefined queries in your observability tools, subsequently fast-tracking the identification of the root cause. Distinguishing Monitoring From Observability While the approach of monitoring and observability often intersect, their objectives, methods, and outcomes are distinct in the following ways. Metrics vs. Logs vs. Traces Monitoring primarily revolves around metrics. Metrics are predefined data points that provide quantifiable information about your system's state, indicating when predefined thresholds are breached. These are typically numerical values, such as CPU utilization, memory usage, or network latency. Observability, on the other hand, focuses typically on logs and traces. Logs capture specific events and information that are essential for deep dives when investigating issues. These contain rich sources of context and detail, allowing you to reconstruct events or understand the flow of a process. Traces additionally provide a broader perspective. They follow a request's journey through your system, tracking its path across various services and components. Traces are particularly useful in identifying bottlenecks, latency issues, and uncovering the root causes of performance problems. Reactive vs. Proactive Management Focusing on predefined thresholds through metrics, monitoring predominantly adopts a reactive management approach. When a metric breaches these predefined limits, it offers quick responses to support a broader performance analysis strategy. This reactive nature of monitoring is ideal for addressing known problems promptly but may not be well-suited for handling complex and novel issues that require a more proactive and in-depth approach. While monitoring excels at handling known issues with predefined thresholds, observability extends the scope to tackle complex and novel performance challenges through proactive and comprehensive analysis. This dynamic, forward-looking approach helps constantly analyze data sources, looking for patterns and anomalies that might indicate performance issues, such as a subtle change in response times, a small increase in error rates, or any other deviations from the expected behavior. Observability then initiates a comprehensive investigation to understand the root causes and take corrective actions. Fixed Dashboards vs. Ad Hoc Queries Monitoring systems typically feature fixed dashboards to display a predefined set of metrics and performance indicators. Most modern monitoring tools can be configured with specific metrics and data points that are considered essential for tracking the system's well-being. The underlying metrics can be selected based on the historical understanding of the system and industry best practices. Although fixed dashboards are optimized to answer known questions efficiently, they lack the flexibility to address unforeseen or complex problems and may not provide the necessary data points to investigate effectively. Conversely, observability offers a dynamic and real-time approach to querying your system's performance data. These ad hoc queries can be tailored to specific, context-sensitive issues. The technical foundation of such queries lies in their ability to analyze vast amounts of data from diverse sources and over a rich dataset that includes metrics, logs, and traces. This flexible querying capability provides invaluable flexibility for troubleshooting new or unanticipated issues. When a previously unseen problem occurs, you can create custom queries to extract relevant data for detailed analysis. The following comparative table emphasizes how each set of key performance indicators (KPIs) aligns with the underlying philosophy and how monitoring and observability contribute to system management: MONITORING vs. OBSERVABILITY KPIs KPIs Monitoring Observability Primary objective Ensure system is functioning within set parameters Understand system behavior and identify anomalies Nature of data Metrics Metrics, logs, traces Key metrics CPU usage, memory usage, network latency Error rates, latency distribution, user behavior Data collection method Pre-defined data points Dynamic data points Scope Reactive: addresses known issues Proactive: explores known and unknown issues Visual representation Fixed dashboards Ad hoc queries, dynamic dashboards Alerts Threshold-based Anomaly-based Scale of measurement Usually single-dimension metrics Multi-dimensional metrics Table 2 Conclusion The strength of a perpetually observable system is its proactive nature. To harness the full potential of observability though, one must capture the right data — the kind that deciphers both predictable and unpredictable production challenges. Embrace a culture that emphasizes refining your application's instrumentation. A recommended approach is to set up a stack where any query about your app's performance gets its due response. It is also important to note that observability is an evolving process and not a one-time setup. As your application scales and changes, so should your instrumentation capabilities. This approach ensures that queries — whether they probe routine operations or unexpected anomalies — receive the informed responses that only a finely tuned, responsive observability framework can provide. To further strengthen your journey with observability, consider exploring these resources: OpenTelemetry documentation "Prioritizing Gartner's APM Model: The APM Conceptual Framework" by Larry Dragich "Elevating System Management: The Role of Monitoring and Observability in DevOps" by Saurabh Pandey This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report
Observability is essential for developing and running modern distributed applications, but fragmented tools and data often obstruct critical insights. AI and unified observability can overcome these challenges. Observability is crucial for modern software development, allowing developers to monitor, troubleshoot, and optimize complex distributed applications. However, many organizations struggle to achieve effective observability due to data silos, complex monitoring tools, and fragmented insights. Observe aims to overcome these challenges by providing a unified observability platform powered by a graph data layer. I recently spoke with Observe CEO Jeremy Burton to learn more about their approach and how it helps developers improve observability. Unifying Observability Data A key challenge with existing observability solutions is that data remains siloed across different tools. As Burton explained, traditionally, companies have used specialized tools for metrics, tracing, and logs that don't interoperate. This fragmentation forces developers, DevOps engineers, and SREs to manually piece together insights. Observe tackles this by ingesting and correlating all observability data — metrics, traces, and logs — into a single platform. Their graph data layer links related data points together, providing context and speeding up troubleshooting. Users can start from any data type and pivot across others for a unified view. According to Burton, "You can attack things at a different point of entry. You can attack things a bit more top-down and bottom-up." Rather than chasing down IDs across dashboards, developers can navigate by logical entities like customers and services. Observe maintains all raw data in an affordable object store data lake. However, their graph indexes and transforms commonly queried data for fast interactive analysis. This powers rapid troubleshooting while allowing users to fetch older data on demand. Optimizing Kubernetes and Cloud-Native Apps Observe provides extensive support for containerized and cloud-native environments like Kubernetes and AWS. The platform auto-discovers infrastructure topology and maps raw Kubernetes data into concepts like pods and containers. As Burton noted, "We transform the data into things that people recognize." This accelerates Kubernetes monitoring and troubleshooting by presenting data in familiar terms. Developers can go directly to impacted containers and services during incidents. Observe also auto-instruments customer applications by scanning code for context like customer IDs. Burton explained how this helped Topgolf quickly resolve issues with their games by linking logs to specific bays. These logical mappings simplify troubleshooting for distributed cloud-native apps. Leveraging AI and Machine Learning Observe uses AI techniques like conversational interfaces and code generation to enhance the user experience. Burton sees AI as the key to making observability feel more intuitive. Their O11y GPT chatbot leverages large language models to understand natural language queries, guide troubleshooting, and generate data transformations. Users can describe problems in plain terms rather than memorizing query syntax. Observe also trained Codex to automatically generate data parsing and analysis code in their Opal query language. This co-pilot capability allows engineers unfamiliar with Opal to be productive immediately. As Burton noted, modern applications have made troubleshooting highly complex, so AI can help "eliminate 130 minutes of difference" in the meantime to resolution. By leveraging machine learning to capture expertise, Observe aims to make observability more accessible. Improving Economics and Customer Experience While providing richer functionality, Observe is engineered for cloud scale and economics. Their cloud-native architecture takes advantage of affordable storage and compute. This allows retaining high-resolution observability data for up to 30 months to aid in deep troubleshooting. Observe also integrates tightly with collaboration tools like Slack. Burton explained how surfacing alerts in incident channels and providing an AI assistant improves coordination and reduces mean time to resolution. For customers like Blooma, Observe has delivered strong outcomes. Blooma's Director of Technical Operations, Jason Huling, reported dramatically faster troubleshooting and no platform degradation despite a 10x data increase. He attributed this to Observe's ease of use and stellar customer support. For customers like Reveal, Observe has delivered fast results. As Reveal's Director of Engineering, Stephen Montoya, noted, "We move so fast here like it's a rocket ship over here. We didn't have time to have to devote to really learning Observe. It was easy to learn right out of the box." He also praised Observe's stellar customer support. The Future of Observability When asked about the observability market outlook, Burton highlighted the potential for AI to redefine interactions and blur organizational barriers. He envisions developers initiating and driving incident response via collaboration tools, with machine learning suggesting fixes in real-time. Observe's investments in applied AI aim to make observability seamless. Burton believes this can reduce the skill gap by codifying tribal knowledge into systems engineers can conversationally query. Integrated and proactive observability will enable developers to focus on higher-value tasks. Overall, Observe's unified observability platform aims to help engineers better understand and optimize modern applications. Their innovative data architecture provides interconnected insight across metrics, traces, and logs. Combined with usability enhancements like AI, Observe strives to make observability effortless. This enables developers to spend less time firefighting and more time innovating.
My most-used Gen AI trick is the summarization of web pages and documents. Combined with semantic search, summarization means I waste very little time searching for the words and ideas I need when I need them. Summarization has become so important that I now use it as I write to ensure that my key points show up in ML summaries. Unfortunately, it’s a double-edged sword: will reliance on deep learning lead to an embarrassing, expensive, or career-ending mistake because the summary missed something, or worse because the summary hallucinated? Fortunately, many years as a technology professional have taught me the value of risk management, and that is the topic of this article: identifying the risks of summarization and the (actually pretty easy) methods of mitigating the risks. Determining the Problem For all of the software development history, we had it pretty easy to verify that our code worked as required. Software and computers are deterministic, finite state automata, i.e., they do what we tell them to do (barring cosmic rays or other sources of Byzantine failure). This made testing for correct behavior simple. Every possible unit test case could be handled by assertEquals(actual, expected), assertTrue, assertSame, assertNotNull, assertTimeout, and assertThrows. Even the trickiest dynamic string methods could be handled by assertTrue(string.Contains(a), string.Contains(b), string.Contains(c) and string.Contains(d). But that was then. We now have large language models, which are fundamentally random systems. Not even the full alphabet of contains(a), contains(b), or contains(c) is up to the task of verifying the correct behavior of Gen AI when the response to an API call can vary by an unknowable degree. Neither JUnit nor Nunit nor PyUnit has assertMoreOrLessOK(actual, expected). And yet, we still have to test these Gen AI APIs and monitor them in production. Once your Gen AI feature is in production, traditional observability methods will not alert you to any potential failure modes described below. So, the problem is how to ensure that the content returned by Gen AI systems are consistent with expectations, and how can we monitor them in production? For that, we have to understand the many failure modes of LLMs. Not only do we have to understand them, we have to be able to explain them to our non-technical colleagues - before there’s a problem. LLM failure modes are unique and present some real challenges to observability. Let me illustrate with a recent example from OpenAI that wasn’t covered in the mainstream news but should have been. Three researchers from Stanford University UC Berkeley had been monitoring ChatGPT to see if it would change over time, and it did. Problem: Just Plain Wrong In one case, the investigators repeatedly asked ChatGPT a simple question: Is 17,077 a prime number? Think step by step and then answer yes or no. ChatGPT responded correctly 98% of the time in March of 2023. Three months later, they repeated the test, but ChatGPT answered incorrectly 87% of the time! It should be noted that OpenAI released a new version of the API on March 14, 2023. Two questions must be answered: Did OpenAI know the new release had problems, and why did they release it? If they didn’t know, then why not? This is just one example of your challenges in monitoring Generative AI. Even if you have full control of the releases, you have to be able to detect outright failures. The researchers have made their code and instructions available on GitHub, which is highly instructive. They have also added some additional materials and an update. This is a great starting point if your use case requires factual accuracy. Problem: General Harms In addition to accuracy, it’s very possible for Generative AI to produce responses with harmful qualities such as bias or toxicity. HELM, the Holistic Evaluation of Language Models, is a living and rapidly growing collection of benchmarks. It can evaluate more than 60 public or open-source LLMs across 42 scenarios, with 59 metrics. It is an excellent starting point for anyone seeking to better understand the risks of language models and the degree to which various vendors are transparent about the risks associated with their products. Both the original paper and code are freely available online. Model Collapse is another potential risk; if it happens, the results will be known far and wide. Mitigation is as simple as ensuring you can return to the previous model. Some researchers claim that ChatGPT and Bard are already heading in that direction. Problem: Model Drift Why should you be concerned about drift? Let me tell you a story. OpenAI is a startup; the one thing a startup needs more than anything else is rapid growth. The user count exploded when ChatGPT was first released in December of 2022. Starting in June of 2023, however, user count started dropping and continued to drop through the summer. Many pundits speculated that this had something to do with student users of ChatGPT taking the summer off, but commentators had no internal data from OpenAI, so speculation was all they could do. Understandably, OpenAI has not released any information on the cause of the drop. Now, imagine that this happens to you. One day, usage stats for your Gen AI feature start dropping. None of the other typical business data points to a potential cause. Only 4% of customers tend to complain, and your complaints haven’t increased. You have implemented excellent API and UX observability; neither response time nor availability shows any problems. What could be causing the drop? Do you have any gaps in your data? Model Drift is the gradual change in the LLM responses due to changes in the data, the language model, or the cultures that provide the training data. The changes in LLM behavior may be hard to detect when looking at individual responses. Data drift refers to changes in the input data model processes over time. Model driftrefers to changes in the model's performance over time after it has been deployed and can result in: Performance degradation: the model's accuracy decreases on the same test set due to data drift. Behavioral drift: the model makes different predictions than originally, even on the same data. However, drift can also refer to concept drift, which leads to models learning outdated or invalid conceptual assumptions, leading to incorrect modeling of the current language. It can cause failures on downstream tasks, like generating appropriate responses to customer messages. And the Risks? So far, the potential problems we have identified are failure and drift in the Generative AI system’s behavior, leading to unexpected outcomes. Unfortunately, It is not yet possible to categorically state what the risks to the business might be because nobody can determine beforehand what the possible range of responses might be with non-deterministic systems. You will have to anticipate the potential risks on a Gen AI use-case-by-use-case basis: is your implementation offering financial advice or responding to customer questions for factual information about your products? LLMs are not deterministic; a statement that, hopefully, means more to you now than it did three minutes ago. This is another challenge you may have when it comes time to help non-technical colleagues understand the potential for trouble. The best thing to say about risk is that all the usual suspects are in play (loss of business reputation, loss of revenue, regulatory violations, security). Fight Fire With Fire The good news is that mitigating the risks of implementing Generative AI can be done with some new observability methods. The bad news is that you have to use machine learning to do it. Fortunately, it’s pretty easy to implement. Unfortunately, you can’t detect drift using your customer prompts - you must use a benchmark dataset. What You’re Not Doing This article is not about detecting drift in a model’s dataset - that is the responsibility of the model's creators, and the work to detect drift is serious data science. If you have someone on staff with a degree in statistics or applied math, you might want to attempt to drift using the method (maximum mean discrepancy) described in this paper: Uncovering Drift In Textual Data: An Unsupervised Method For Detecting And Mitigating Drift In Machine Learning Models What Are You Doing? You are trying to detect drift in a model’s behavior using a relatively small dataset of carefully curated text samples representative of your use case. Like the method above, you will use discrepancy, but not for an entire set. Instead, you will create a baseline collection of prompts and responses, with each prompt-response pair sent to the API 100 times, and then calculate the mean and variance for each prompt. Then, every day or so, you’ll send the same prompts to the Gen AI API and look for excessive variance from the mean. Again, it’s pretty easy to do. Let’s Code! Choose a language model to use when creating embeddings. It should be as close as possible to the model being used by your Gen AI API. You must be able to have complete control over this model’s files, and all of its configurations, and all of the supporting libraries that are used when embeddings are created and when similarity is calculated. This model becomes your reference. The equivalent of the 1 kg sphere of pure Silicon that serves as a global standard of mass. Java Implementation The how-do-I-do-this-in-Java experience for me, a 20-year veteran of Java coding, was painful until I sorted out the examples from Deep Java Learning. Unfortunately, DJL has a very limited list of native language models available compared to Python. Though over-engineered, for example, the Java code is almost as pithy as Python: Setup of the LLM used to create sentence embedding vectors. Code to create the text embedding vectors and compare the semantic similarity between two texts: The function that calculates the semantic similarity. Put It All Together As mentioned earlier, the goal is to be able to detect drift in individual responses. Depending on your use case and the Gen AI API you’re going to use, the number of benchmark prompts, the number of responses that form the baseline, and the rate at which you sample the API will vary. The steps go like this: Create a baseline set of prompts and Gen AI API responses that are strongly representative of your use case: 10, 100, or 1,000. Save these in Table A. Create a baseline set of responses: for each of the prompts, send to the API 10, 50, or 100 times over a few days to a week, and save the text responses. Save these in Table B. Calculate the similarity between the baseline responses: for each baseline response, calculate the similarity between it and the response in Table A. Save these similarity values with each response in Table B. Calculate the mean, variance, and standard deviation of the similarity values in table B and store them in table A. Begin the drift detection runs: perform the same steps as in step 1 every day or so. Save the results in Table C. Calculate the similarity between the responses in Table A at the end of each detection run. When all the similarities have been calculated, look for any outside the original variance. For those responses with excessive variance, review the original prompt, the original response from Table A, and the latest response in Table C. Is there enough of a difference in the meaning of the latest response? If so, your Gen AI API model may be drifting away from what the product owner expects; chat with them about it. Result The data, when collected and charted, should look something like this: The chart shows the result of a benchmark set of 125 prompts sent to the API 100 times over one week - the Baseline samples. The mean similarity for each prompt was calculated and is represented by the points in the Baseline line and mean plot. The latest run of the same 125 benchmark samples was sent to the API yesterday. Their similarity was calculated vs the baseline mean values, the Latest samples. The responses of individual samples that seem to vary quite a bit from the mean are reviewed to see if there is any significant semantic discrepancy with the baseline response. If that happens, review your findings with the product owner. Conclusion Non-deterministic software will continue to be a challenge for engineers to develop, test, and monitor until the day that the big AI brain takes all of our jobs. Until then, I hope I have forewarned and forearmed you with clear explanations and easy methods to keep you smiling during your next Gen AI incident meeting. And, if nothing else, this article should help you to make the case for hiring your own data scientist. If that’s not in the cards, then… math?
In enterprises, SREs, DevOps, and cloud architects often discuss which platform to choose for observability for faster troubleshooting of issues and understanding about performance of their production systems. There are certain questions they need to answer to get maximum value for their team, such as: Will an observability tool support all kinds of workloads and heterogeneous systems? Will the tool support all kinds of data aggregation, such as logs, metrics, traces, topology, etc..? Will the investment in the (ongoing or new) observability tool be justified? In this article, we will provide the best way to get started with unified observability of your entire infrastructure using open-source Skywalking and Istio service mesh. Istio Service Mesh of Multi-Cloud Application Let us take an example of a multi-cloud example where there are multiple services hosted on on-prem or managed Kubernetes clusters. The first step for unified observability will be to form a service mesh using Istio service mesh. The idea is that all the services or workloads in Kubernetes clusters (or VMs) should be accompanied by an Envoy proxy to abstract the security and networking out of business logic. As you can see in the below image, a service mesh is formed, and the network communication between edge to workloads, among workloads, and between clusters is controlled by the Istio control plane. In this case, the Istio service mesh emits a logs, metrics, and traces for each Envoy proxies, which will help to get unified observability. We need a visualization tool like Skywalking to collect the data and populate it for granular observability. Why Skywalking for Observability SREs from large companies such as Alibaba, Lenovo, ABInBev, and Baidu use Apache Skywalking, and the common reasons are: Skywalking aggregates logs, metrics, traces, and topology. It natively supports popular service mesh software like Istio. While other tools may not support getting data from Envoy sidecars, Skywalking supports sidecar integration. It supports OpenTelemetry (OTel) standards for observability. These days, OTel standards and instrumentation are popular for MTL (metrics, logs, traces). Skywalking supports observability-data collection from almost all the elements of the full stack- database, OS, network, storage, and other infrastructure. It is open-source and free (with an affordable enterprise version). Now, let us see how to integrate Istio and Apache skywalking into your enterprise. Steps To Integrate Istio and Apache Skywalking We have created a demo to establish the connection between the Istio data plane and Skywalking, where it will collect data from Envoy sidecars and populate them in the observability dashboards. Note: By default, Skywalking comes with predefined dashboards for Apache APISIX and AWS Gateways. Since we are using Istio Gateway, it will not get a dedicated dashboard out-of-the-box, but we’ll get metrics for it in other locations. If you want to watch the video, check out my latest Istio-Skywalking configuration video. You can refer to the GitHub link here. Step 1: Add Kube-State-Metrics to Collect Metrics From the Kubernetes API Server We have installed kube-state-metrics service to listen to the Kubernetes API server and send those metrics to Apache skywalking. First, add the Prometheus community repo: Shell helm repo add prometheus-community https://prometheus-community.github.io/helm-charts (After every helm repo add, add a line about running helm repo update to fetch the latest charts.) And now you can install kube-state-metrics. Shell helm install kube-state-metrics prometheus-community/kube-state-metrics Step 2: Install Skywalking Using HELM Charts We will install Skywalking version 9.2.0 for this observability demo. You can run the following command to install Skywalking into a namespace (my namespace is skywalking). You can refer to the values.yaml. Shell helm install skywalking oci://registry-1.docker.io.apache/skywalking-helm -f -n skywalking (Optional reading) In helm chart values.yaml, you will notice that: We have made the flag oap (observability analysis platform, i.e., the back-end) and ui configuration as true. Similarly, for databases, we have enabled postgresql as true. For tracking metrics from Envoy access logs, we have configured the following environmental variables: SW_ENVOY_METRIC: default SW_ENVOY_METRIC_SERVICE: true SW_ENVOY_METRIC_ALS_HTTP_ANALYSIS: k8s-mesh,mx-mesh,persistence SW_ENVOY_METRIC_ALS_TCP_ANALYSIS: k8s-mesh,mx-mesh,persistence This is to select the logs and metrics from the Envoy from the Istio configuration (‘c’ and ‘d’ are the rules for analyzing Envoy access logs). We will enable the OpenTelemetry receiver and configure it to receive data in otlp format. We will also enable OTel rules according to the data we will send to Skywalking. In a few moments (in Step 3), we will configure the OTel collector to scrape istiod, k8s, kube-state-metrics, and the Skywalking OAP itself. So, we have enabled the appropriate rules: SW_OTEL_RECEIVER: default SW_OTEL_RECEIVER_ENABLED_HANDLERS: “otlp” SW_OTEL_RECEIVER_ENABLED_OTEL_RULES: “istio-controlplane,k8s-cluster,k8s-node,k8s-service,oap” SW_TELEMETRY: prometheus SW_TELEMETRY_PROMETHEUS_HOST: 0.0.0.0 SW_TELEMETRY_PROMETHEUS_PORT: 1234 SW_TELEMETRY_PROMETHEUS_SSL_ENABLED: false SW_TELEMETRY_PROMETHEUS_SSL_KEY_PATH: “” SW_TELEMETRY_PROMETHEUS_SSL_CERT_CHAIN_PATH: “” We have instructed Skywalking to collect data from the Istio control plance, Kubernetes cluster, node, services, and also oap (Observability Analytics Platform by Skywalking).(The configurations from ‘d’ to ‘i’ enable Skywalking OAP’s self-observability, meaning it will expose Prometheus-compatible metrics at port 1234 with SSL disabled. Again, in Step 3, we will configure the OTel collector to scrape this endpoint.) In the helm chart, we have also enabled the creation of a service account for Skywalking OAP. Step 3: Setting Up Istio + Skywalking Configuration After that, we can install Istio using this IstioOperator configuration. In the IstioOperator configuration, we have set up the meshConfig so that every Sidecar will have enabled the envoy access logs service and set the address for access logs service and metrics service to skywalking. Additionally, with the proxyStatsMatcher, we are configuring all metrics to be sent via the metrics service. YAML meshConfig: defaultConfig: envoyAccessLogService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" envoyMetricsService: address: "skywalking-skywalking-helm-oap.skywalking.svc:11800" proxyStatsMatcher: inclusionRegexps: - .* enableEnvoyAccessLogService: true Step 4: OpenTelemetry Collector Once the Istio and Skywalking configuration is done, we need to feed metrics from applications, gateways, nodes, etc, to Skywalking. We have used the opentelemetry-collector.yaml to scrape the Prometheus compatible endpoints. In the collector, we have mentioned that OpenTelemetry will scrape metrics from istiod, Kubernetes-cluster, kube-state-metrics, and skywalking. We have created a service account for OpenTelemetry. Using opentelemetry-serviceaccount.yaml, we have set up a service account, declared ClusterRole and ClusterRoleBinding to define what all actions the opentelemetry service account will be able to take on various resources in our Kubernetes cluster. Once you deploy the opentelemetry-collector.yaml and opentelemetry-serviceaccount.yaml, there will be data flowing into Skywalking from- Envoy, Kubernetes cluster, kube-state-metrics, and Skywalking (oap). Step 5: Observability of Kubernetes Resources and Istio Resource in Skywalking To check the UI of Skywalking, port-forward the Skywalking UI service to port (say 8080). Run the following command: Shell kubectl port-forward svc/skywalking-skywalking-helm-ui -n skywalking 8080:80 You can open the Skywalking UI service at localhost:8080. (Note: For setting up load to services and see the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. ) Once you are on the Skywalking UI (refer to the image below), you can select service mesh in the left-side menu and select control plane or data plane. Skywalking would provide all the resource consumption and observability data of Istio control and data plane, respectively. Skywalking Istio-dataplane provides info about all the Envoy proxies attached to services. Skywalking provides metrics, logs, and traces of all the Envoy proxies. Refer to the below image, where all the observability details are displayed for just one service-proxy. Skywalking provides the resource consumption of Envoy proxies in various namespaces. Similarly, Skywalking also provides all the observable data of the Istio control plane. Note, in case you have multiple control planes in different namespaces (in multiple clusters), you just provide the access Skywalking oap service. Skywalking provides Istio control planes like metrics, number of pilot pushes, ADS monitoring, etc. Apart from the Istio service mesh, we also configured Skywalking to fetch information about the Kubernetes cluster. You can see in the below image Skywalking provides all the info about the Kubernetes dashboard, such as the number of nodes, pods, K8s deployments, services, pods, containers, etc. You also get the respective resource utilization metrics of each K8 resource in the same dashboard. Skywalking provides holistic information about a Kubernetes cluster. Similarly, you can drill further down into a service in the Kubernetes cluster and get granular information about their behavior and performance. (refer to the below images.) For setting up load to services and seeing the behavior and performance of apps, cluster, and Envoy proxy, check out the full video. Benefits of Istio Skywalking Integrations There are several benefits of integrating Istio and Apache Skywalking for Unified observability. Ensure 100% visibility of the technology stack, including apps, sidecars, network, database, OS, etc. Reduce 90% of the time to find the root cause (MTTR) of issues or anomalies in production with faster troubleshooting. Save approximately ~$2M of lifetime spend on closed-source solutions, complex pricing, and custom integrations.
Joana Carvalho
Site Reliability Engineering,
Virtuoso
Eric D. Schabell
Director Technical Marketing & Evangelism,
Chronosphere
Chris Ward
Zone Leader,
DZone