Maintenance Resources

DZone's Featured Maintenance Resources

Scaling SRE Teams

By Stelios Manioudakis

This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report From cultural and structural challenges within an organization to balancing daily work and dividing it between teams and individuals, scaling teams of site reliability engineers (SREs) comes with many challenges. However, fostering a resilient site reliability engineering (SRE) culture can facilitate the gradual and sustainable growth of an SRE team. In this article, we explore the challenges of scaling and review a successful scaling framework. This framework is suitable for guiding emerging teams and startups as they cultivate an evolving SRE culture, as well as for established companies with firmly entrenched SRE cultures. The Challenges of Scaling SRE Teams As teams scale, complexity may increase as it can be more difficult to communicate, coordinate, and maintain a team's coherence. Below is a list of challenges to consider as your team and/or organization grows: Rapid growth – Rapid growth leads to more complex systems, which can outpace the capacity of your SRE team, leading to bottlenecks and reduced reliability. Knowledge-sharing – Maintaining a shared understanding of systems and processes may become difficult, making it challenging to onboard new team members effectively. Tooling and automation – Scaling without appropriate tooling and automation can lead to increased manual toil, reducing the efficiency of the SRE team. Incident response – Coordinating incident responses can become more challenging, and miscommunications or delays can occur. Maintaining a culture of innovation and learning – This can be challenging as SREs may become more focused on solving critical daily problems and less focused on new initiatives. Balancing operational and engineering work – Since SREs are responsible for both operational tasks and engineering work, it is important to ensure that these teams have enough time to focus on both areas. A Framework for Scaling SRE Teams Scaling may come naturally if you do the right things in the right order. First, you must identify what your current state is in terms of infrastructure. How well do you understand the systems? Determine existing SRE processes that need improvement. For the SRE processes that are necessary but are not employed yet, find the tools and the metrics necessary to start. Collaborate with the appropriate stakeholders, use feedback, iterate, and improve. Step 1: Assess Your Current State Understand your system and create a detailed map of your infrastructure, services, and dependencies. Identify all the components in your infrastructure, including servers, databases, load balancers, networking equipment, and any cloud services you utilize. It is important to understand how these components are interconnected and dependent on each other — this includes understanding which services rely on others and the flow of data between them. It's also vital to identify and evaluate existing SRE practices and assess their effectiveness: Analyze historical incident data to identify recurring issues and their resolutions. Gather feedback from your SRE team and other relevant stakeholders. Ask them about pain points, challenges, and areas where improvements are needed. Assess the performance metrics related to system reliability and availability. Identify any trends or patterns that indicate areas requiring attention. Evaluate how incidents are currently being handled. Are they being resolved efficiently? Are post-incident reviews being conducted effectively to prevent recurrences? Step 2: Define SLOs and Error Budgets Collaborate with stakeholders to establish clear and meaningful service-level objectives (SLOs) by determining the acceptable error rate and creating error budgets based on the SLOs. SLOs and error budgets can guide resource allocation optimization. Computing resources can be allocated to areas that directly impact the achievement of the SLOs. SLOs set clear, achievable goals for the team and provide a measurable way to assess the reliability of a service. By defining specific targets for uptime, latency, or error rates, SRE teams can objectively evaluate whether the system is meeting the desired standards of performance. Using specific targets, a team can prioritize their efforts and focus on areas that need improvement, thus fostering a culture of accountability and continuous improvement. Error budgets provide a mechanism for managing risk and making trade-offs between reliability and innovation. They allow SRE teams to determine an acceptable threshold for service disruptions or errors, enabling them to balance the need for deploying new features or making changes to maintain a reliable service. Step 3: Build and Train Your SRE Team Identify talent according to the needs of each and every step of this framework. Look for the right skillset and cultural fit, and be sure to provide comprehensive onboarding and training programs for new SREs. Beware of the golden rule that culture eats strategy for breakfast: Having the right strategy and processes is important, but without the right culture, no strategy or process will succeed in the long run. Step 4: Establish SRE Processes, Automate, Iterate, and Improve Implement incident management procedures, including incident command and post-incident reviews. Define a process for safe and efficient changes to the system. Figure 1: Basic SRE process One of the cornerstones of SRE involves how to identify and handle incidents through monitoring, alerting, remediation, and incident management. Swift incident identification and management are vital in minimizing downtime, which can prevent minor issues from escalating into major problems. By analyzing incidents and their root causes, SREs can identify patterns and make necessary improvements to prevent similar issues from occurring in the future. This continuous improvement process is crucial for enhancing the overall reliability and performance whilst ensuring the efficiency of systems at scale. Improving and scaling your team can go hand in hand. Monitoring Monitoring is the first step in ensuring the reliability and performance of a system. It involves the continuous collection of data about the system's behavior, performance, and health. This can be broken down into: Data collection – Monitoring systems collect various types of data, including metrics, logs, and traces, as shown in Figure 2. Real-time observability – Monitoring provides real-time visibility into the system's status, enabling teams to identify potential issues as they occur. Proactive vs. reactive – Effective monitoring allows for proactive problem detection and resolution, reducing the need for reactive firefighting. Figure 2: Monitoring and observability Alerting This is the process of notifying relevant parties when predefined conditions or thresholds are met. It's a critical prerequisite for incident management. This can be broken down into: Thresholds and conditions – Alerts are triggered based on predefined thresholds or conditions. For example, an alert might be set to trigger when CPU usage exceeds 90% for five consecutive minutes. Notification channels – Alerts can be sent via various notification channels, including email, SMS, or pager, or even integrated into incident management tools. Severity levels – Alerts should be categorized by severity levels (e.g., critical, warning, informational) to indicate the urgency and impact of the issue. Remediation This involves taking actions to address issues detected through monitoring and alerting. The goal is to mitigate or resolve problems quickly to minimize the impact on users. Automated actions – SRE teams often implement automated remediation actions for known issues. For example, an automated scaling system might add more resources to a server when CPU usage is high. Playbooks – SREs follow predefined playbooks that outline steps to troubleshoot and resolve common issues. Playbooks ensure consistency and efficiency during remediation efforts. Manual interventions – In some cases, manual intervention by SREs or other team members may be necessary for complex or unexpected issues. Incident Management Effective communication, knowledge-sharing, and training are crucial during an incident, and most incidents can be reproduced in staging environments for training purposes. Regular updates are provided to stakeholders, including users, management, and other relevant teams. Incident management includes a culture of learning and continuous improvement: The goal is not only to resolve the incident but also to prevent it from happening again. Figure 3: Handling incidents A robust incident management process ensures that service disruptions are addressed promptly, thus enhancing user trust and satisfaction. In addition, by effectively managing incidents, SREs help preserve the continuity of business operations and minimize potential revenue losses. Incident management plays a vital role in the scaling process since it establishes best practices and promotes collaboration, as shown in Figure 3. As the system scales, the frequency and complexity of incidents are likely to increase. A well-defined incident management process enables the SRE team to manage the growing workload efficiently. Conclusion SRE is an integral part of the SDLC. At the end of the day, your SRE processes should be integrated into the entire process of development, testing, and deployment, as shown in Figure 4. Figure 4: Holistic view of development, testing, and the SRE process Iterating on and improving the steps above will inevitably lead to more work for SRE teams; however, this work can pave the way for sustainable and successful scaling of SRE teams at the right pace. By following this framework and overcoming the challenges, you can effectively scale your SRE team while maintaining system reliability and fostering a culture of collaboration and innovation. Remember that SRE is an ongoing journey, and it is essential to stay committed to the principles and practices that drive reliability and performance. This is an article from DZone's 2023 Observability and Application Performance Trend Report.For more: Read the Report More

Infrastructure as Code: The Evolution of Cloud Infrastructure Management

By Aditya Bhuyan

Businesses are increasingly depending on cloud-based services to improve efficiency, increase scalability, and streamline operations in the quickly developing digital age. The requirement for efficient resource management has multiplied as the cloud has become a crucial part of contemporary IT infrastructures. Let us introduce Infrastructure as Code (IaC), a ground-breaking method for managing infrastructure that will fundamentally alter how we deploy and manage cloud resources. Infrastructure as Code has emerged as a pillar of contemporary cloud infrastructure management, allowing businesses to increase automation, efficiency, and scalability while lowering operational risks and complexity related to manual configurations. What Is Infrastructure as Code? Using code, developers and system administrators can manage and provision cloud resources using the same software engineering techniques they would use to manage and provision any other kind of software application. IaC enables teams to define and manage these resources through declarative or imperative code, which can then be version-controlled, tested, and automatically deployed. This eliminates the need for teams to manually configure servers, networking, databases, and other infrastructure components. Infrastructure as Code (IaC) is a practice and approach used in software engineering for managing and provisioning infrastructure resources. IaC enables developers and system administrators to define and manage their infrastructure through code, just like any other software application, as opposed to manually configuring servers, networks, databases, and other components. The idea of infrastructure as a service emerged from the growing demand for quicker, more dependable, and scalable infrastructure management, particularly in cloud-based environments. Traditional manual configuration techniques become ineffective and error-prone as businesses increasingly use virtualized infrastructure and cloud services. By introducing automation, version control, and consistency to infrastructure provisioning, IaC resolves these issues. Key Components of Infrastructure as Code Code: In IaC, infrastructure is represented as code using a programming language or a configuration file format. This code defines the desired state of the infrastructure, specifying the resources, their configurations, and the relationships between them. Version Control: Treating infrastructure as code enables version control, similar to how developers manage software code. This means that changes to the infrastructure can be tracked, reviewed, and rolled back as needed, promoting collaboration and accountability. Automation: One of the core tenets of IaC is automation. By using code to define infrastructure, repetitive and manual tasks can be automated, reducing the risk of human error and saving time and effort. Declarative or Imperative Paradigm: IaC supports both declarative and imperative approaches. In the declarative approach, developers specify the desired end state of the infrastructure, and the system figures out the steps to reach that state. In contrast, the imperative approach involves defining the exact sequence of steps needed to achieve the desired infrastructure state. How Infrastructure as Code Works: Definition: Infrastructure is defined using code written in a domain-specific language or configuration file format. Popular languages and formats include HashiCorp Configuration Language (HCL) for Terraform, JSON or YAML for AWS CloudFormation Azure Resource Manager templates, and Python for Ansible. Version Control and Collaboration: The infrastructure code is stored in version control systems like Git, enabling collaboration among team members, tracking changes, and providing a history of modifications. Continuous Integration and Deployment (CI/CD): IaC is often integrated into CI/CD pipelines, where changes to infrastructure code are automatically tested and deployed to various environments, ensuring consistency and reliability across the development lifecycle. Orchestration and Automation: IaC tools interact with cloud providers’ APIs to create, modify, or delete resources as specified in the infrastructure code. These tools handle resource dependencies, orchestrate the provisioning process, and apply the necessary changes to achieve the desired infrastructure state. Key Principles of Infrastructure as Code Declarative vs. Imperative: IaC supports both declarative and imperative paradigms. In the declarative approach, developers specify the desired end state of the infrastructure, and the system figures out the steps to reach that state. In contrast, the imperative approach involves defining the exact sequence of steps needed to achieve the desired infrastructure state. Version Control: IaC treats infrastructure configurations as code, making it amenable to version control systems like Git. This allows teams to track changes, roll back to previous versions if necessary, and collaborate more effectively. Automation: Automation is a core tenet of Infrastructure as Code. By scripting the infrastructure provisioning process, repetitive tasks can be eliminated, reducing the risk of human error and saving valuable time and resources. Reproducibility: Since the infrastructure is defined in code, it can be reproduced consistently across different environments, such as development, testing, and production. This ensures consistency and reduces the “works on my machine” problem. Infrastructure as Code (IaC) involves managing and provisioning infrastructure resources using code. There are two main paradigms within IaC, each with its key principles: Declarative Approach: In the declarative approach to IaC, developers specify the desired end-state of the infrastructure without explicitly defining the steps to achieve it. The system determines how to reach the desired state based on the provided configuration. Some key principles of the declarative approach include: A. Idempotency: The ability to apply the same configuration repeatedly without causing unintended side effects. When applied multiple times, the result remains the same, ensuring consistency and avoiding unnecessary changes. B. State Management: IaC tools maintain a state file that keeps track of the current infrastructure configuration. This file allows the tool to understand the differences between the desired state and the current state, enabling it to apply only the necessary changes. C. Immutable Infrastructure: In the declarative approach, infrastructure is considered immutable. Instead of modifying existing resources, new resources are created with updated configurations and old resources are replaced. This practice ensures predictability and simplifies rollback processes. D. Dependency Management: Declarative IaC tools handle dependencies between resources automatically. They understand the relationships between different infrastructure components and manage their order of provisioning accordingly. Imperative Approach: In the imperative approach to IaC, developers explicitly define the steps and procedures to achieve the desired infrastructure state. They specify the exact sequence of actions required to create, modify, or delete resources. Some key principles of the imperative approach include: A. Procedural Definition: Developers define the specific steps and commands required to configure the infrastructure. This approach can resemble traditional scripting or programming. B. Flexibility and Control: The imperative approach provides more granular control over the configuration process, allowing developers to manage complex scenarios that may not be easily achievable using a declarative approach. C. Responsibility for Dependencies: In contrast to the declarative approach, imperative IaC requires developers to manage resource dependencies explicitly. This can be both an advantage and a challenge, as it provides more control but may also lead to more complex and error-prone configurations. It’s worth noting that many IaC tools support both declarative and imperative paradigms, allowing users to choose the most appropriate approach for their specific use cases. Additionally, some tools offer a mix of both paradigms, allowing developers to use declarative code for some resources and imperative code for others within the same configuration. Regardless of the approach chosen, the fundamental principles of Infrastructure as Code emphasize automation, version control, consistency, and reproducibility. These principles promote efficient infrastructure management, reduce manual errors, enhance collaboration, and enable organizations to scale and adapt their infrastructure more effectively in the dynamic and rapidly changing cloud environment. Benefits of Infrastructure as Code Infrastructure as Code (IaC) offers a wide range of benefits that significantly improve the management and deployment of cloud infrastructure. Let’s explore some of the key advantages of adopting IaC practices: Agility and Speed: IaC enables rapid and automated provisioning of infrastructure resources. With just a few lines of code, teams can create and configure complex environments, reducing the time required for deployment and accelerating the software development lifecycle. This agility allows businesses to respond quickly to changing requirements and market demands. Consistency and Standardization: Manual infrastructure setup can lead to inconsistencies and configuration drift between environments, causing issues during deployments. IaC ensures that infrastructure is defined and managed consistently across development, testing, and production environments. This standardization improves reliability and reduces the likelihood of errors caused by differences between setups. Version Control and Collaboration: IaC treats infrastructure configurations as code, making it amenable to version control systems like Git. This allows teams to track changes, collaborate effectively, and maintain a history of infrastructure modifications. Developers can work in parallel, review changes, and roll back to previous versions if needed, promoting collaboration and accountability. Cost Efficiency: Traditional infrastructure provisioning might lead to over-provisioning of resources or leave resources idle when not in use. IaC allows for better resource management, ensuring that resources are provisioned only when required and de-provisioned when they are no longer needed. This optimization can result in significant cost savings for organizations. Improved Quality and Reduced Errors: Manual configuration is prone to human errors, which can lead to costly outages and downtime. IaC reduces the risk of misconfigurations by automating the setup process and applying consistent configurations. Automated testing can also be integrated into the deployment pipeline to catch issues before they affect the production environment, thereby improving the overall quality of the infrastructure. Scalability and Flexibility: IaC makes it easy to scale infrastructure up or down based on demand. Auto-scaling configurations can be included in the code to automatically adjust resource capacity according to traffic or workload fluctuations. This elasticity allows organizations to handle varying levels of usage efficiently. Disaster Recovery and Reproducibility: IaC allows organizations to recreate entire infrastructures quickly in the event of a disaster. By keeping the infrastructure configuration as code, disaster recovery processes become more reliable and straightforward. Additionally, the ability to reproduce environments accurately ensures consistency between different stages of development and reduces the likelihood of deployment issues. Vendor-Agnostic and Multi-Cloud Support: Many IaC tools support multiple cloud providers, making it easier to manage infrastructure across different platforms without vendor lock-in. This flexibility allows organizations to choose the best services from various cloud providers while maintaining a consistent management approach. Security and Compliance: IaC encourages the use of best practices and security standards from the outset. By codifying security configurations, access controls, and compliance requirements, organizations can enforce security measures consistently across their entire infrastructure. Documentation and Self-Documentation: IaC code serves as an up-to-date and comprehensive documentation of the infrastructure setup. This self-documentation helps new team members quickly understand the infrastructure architecture and reduces reliance on outdated, separate documentation. Popular Infrastructure as Code Tools: The popularity of Infrastructure as Code (IaC) tools was steadily growing, and several tools had gained significant traction in the industry. While this landscape may have evolved further since then, here are some popular IaC tools that were widely used at that time: Terraform: Terraform, developed by HashiCorp, was one of the most popular and versatile IaC tools. It allowed users to define, manage, and provision infrastructure resources across various cloud providers and even on-premises environments. Terraform’s declarative configuration language, HashiCorp Configuration Language (HCL), made it easy to describe complex infrastructure setups and maintain state information. AWS CloudFormation: As a native IaC tool provided by Amazon Web Services (AWS), CloudFormation was widely used by AWS customers to define and deploy infrastructure resources in an automated and repeatable manner. Users could create CloudFormation templates using JSON or YAML to describe AWS resources and their relationships. Azure Resource Manager (ARM) Templates: For Microsoft Azure users, ARM Templates served as the IaC tool of choice. These templates used JSON to define the Azure resources and configurations required for an application or infrastructure deployment. Google Cloud Deployment Manager: For those leveraging Google Cloud Platform (GCP), Google Cloud Deployment Manager offered an IaC solution. It allowed users to define and deploy GCP resources using YAML or Python templates. Ansible: Although Ansible is primarily known as a configuration management and automation tool, it was also commonly used for IaC purposes. Ansible used YAML-based “playbooks” to describe and automate the provisioning and configuration of infrastructure resources. Pulumi: Pulumi was an IaC tool that aimed to provide a modern infrastructure-as-code platform. It supported multiple cloud providers and programming languages, allowing users to define infrastructure using familiar programming languages like Python, JavaScript, TypeScript, and more. Chef: Similar to Ansible, Chef was initially developed as a configuration management tool but had capabilities for infrastructure automation. It used Ruby-based “cookbooks” to manage infrastructure configurations and deployments. SaltStack: SaltStack, often referred to as Salt, was another popular configuration management and infrastructure automation tool that allowed users to manage infrastructure resources using a declarative approach. Jenkins: Continuous integration and continuous deployment (CI/CD) pipelines frequently used Jenkins, a well-known open-source automation server. Although not exclusively an IaC tool, Jenkins integrated well with other IaC tools and was used in conjunction with them to achieve complete automation. Cloud DevOps Platforms: In addition to individual IaC tools, cloud DevOps platforms like AWS CodePipeline, Azure DevOps, and Google Cloud Build were also gaining popularity. These platforms offered end-to-end CI/CD solutions integrated with multiple IaC tools and other DevOps services. It’s importaant to note that the IaC landscape is continually evolving, and new tools may have emerged or gained prominence since my last update. As such, it’s always a good idea to conduct further research and keep an eye on the latest trends and developments in the field of Infrastructure as Code. Conclusion In the field of managing cloud infrastructure, Infrastructure as Code has emerged as a game-changer. IaC gives organizations the power to achieve higher levels of automation, scalability, and reliability by treating infrastructure as software, which ultimately results in shorter development cycles and greater overall efficiency. Infrastructure as Code will continue to be a vital resource for creating and maintaining the infrastructure of the future as businesses continue to adopt the cloud. In conclusion, Infrastructure as Code offers a revolutionary way to manage cloud infrastructure, offering many advantages like speed, consistency, cost-effectiveness, scalability, and improved security. IaC has evolved into a critical procedure for achieving these goals successfully as organizations embrace the cloud and seek to streamline their operations and development. More

What to Do if You Expose a Secret: How to Stay Calm and Respond to an Incident

By Dwayne McDaniel

Keeping AI Infrastructure Costs Down With API Governance

By Rachael Kiselev

Handling Errors and Maintaining Data Integrity in ETL Processes

By Eduardo Moore

Gossip Protocol in Social Media Networks: Instagram and Beyond

Gossip protocol is a communication scheme used in distributed systems for efficiently disseminating information among nodes. It is inspired by the way people gossip, where information spreads through a series of casual conversations. This article will discuss the gossip protocol in detail, followed by its potential implementation in social media networks, including Instagram. We will also include code snippets to provide a deeper technical understanding. Gossip Protocol The gossip protocol is based on an epidemic algorithm that uses randomized communication to propagate information among nodes in a network. The nodes exchange information about their state and the state of their neighbors. This process is repeated at regular intervals, ensuring that the nodes eventually become aware of each other's states. The key features of gossip protocol include: Fault-tolerance: The protocol can handle node failures effectively, as it does not rely on a central authority or a single point of failure. Scalability: Gossip protocol can efficiently scale to large networks with minimal overhead. Convergence: The system converges to a consistent state quickly, even in the presence of failures or network delays. Gossip Protocol in Social Media Networks: Instagram Social media networks are distributed systems that need to handle massive amounts of data and user interactions. One of the critical aspects of such networks is efficiently propagating updates and notifications to users. Gossip protocol can be employed to achieve this goal by allowing user nodes to exchange information about their state and the state of their connections. For instance, consider Instagram, a social media platform where users can post photos and follow other users. When a user posts a new photo, it needs to be propagated to all their followers. Using the gossip protocol, the photo can be efficiently disseminated across the network, ensuring that all followers receive the update in a timely manner. Technical Implementation of Gossip Protocol in Social Media Networks To illustrate the implementation of gossip protocol in a social media network, let's consider a simplified example using Python. In this example, we will create a basic network of users who can post updates and follow other users, similar to Instagram. First, let's define a User class to represent a user in the network: Python class User: def __init__(self, user_id): self.user_id = user_id self.followers = set() self.posts = [] def post_photo(self, photo): self.posts.append(photo) def follow(self, user): self.followers.add(user) Next, we'll implement the gossip protocol to propagate updates among users. We will create a GossipNetwork class that manages the user nodes and initiates gossip communication: Python import random class GossipNetwork: def __init__(self): self.users = {} def add_user(self, user_id): self.users[user_id] = User(user_id) def post_photo(self, user_id, photo): self.users[user_id].post_photo(photo) self.gossip(user_id, photo) def gossip(self, user_id, photo): user = self.users[user_id] for follower in user.followers: # Propagate the photo to the follower self.users[follower].posts.append(photo) # Continue gossiping with a random subset of the follower's followers if len(self.users[follower].followers) > 0: next_follower = random.choice(list(self.users[follower].followers)) self.gossip(next_follower, photo) The main method to test the behavior: Python if __name__ == "__main__": # Create a gossip network network = GossipNetwork() # Add users to the network for i in range(1, 6): network.add_user(i) # Establish follower relationships network.users[1].follow(2) network.users[2].follow(3) network.users[3].follow(4) network.users[4].follow(5) # Post a photo by user 1 network.post_photo(1, "photo1") # Print the posts of each user for i in range(1, 6): print(f"User {i}: {network.users[i].posts}") This code creates a simple network of five users with a chain of follower relationships (1 -> 2 -> 3 -> 4 -> 5). When user 1 posts a photo, it will be propagated through the gossip protocol to all users in the chain. The output will show that all users have received the posted photo: Plain Text User 1: ['photo1'] User 2: ['photo1'] User 3: ['photo1'] User 4: ['photo1'] User 5: ['photo1'] In this example, when a user posts a photo, the GossipNetwork.post_photo() method is called. This method initiates gossip communication by propagating the photo to the user's followers and their followers using the GossipNetwork.gossip() method. Conclusion The gossip protocol is an efficient and robust method for disseminating information among nodes in a distributed system. Its implementation in social media networks like Instagram can help propagate updates and notifications to users, ensuring timely delivery and fault tolerance. By understanding the inner workings of the gossip protocol in social media networks, developers can better appreciate its role in maintaining a consistent and reliable distributed platform.

By Arun Pandey

Garbage Collection: Unsung Hero

Garbage Collection is a facet often disregarded and underestimated, yet beneath its surface lies the potential for profound impacts on your organization that reach far beyond the realm of application performance. In this post, we embark on a journey to unravel the pivotal role of Garbage Collection analysis and explore seven critical points that underscore its significance. Improve Application Response Time Without Code Changes Automatic garbage collection (GC) is a critical memory management process, but it introduces pauses in applications. These pauses occur when GC scans and reclaims memory occupied by objects that are no longer in use. Depending on various factors, these pauses can range from milliseconds to several seconds or even minutes. During these pauses, no application transactions are processed, causing customer requests to be stranded. However, there’s a solution. By fine-tuning the GC behavior, you can significantly reduce GC pause times. This reduction ultimately leads to a decrease in the overall application’s response time, delivering a smoother user experience. A real-world case study from one of the world’s largest automobile manufacturers demonstrates the impact of GC tuning without making a single line of code change. Read the full case study here. They were able to reduce their response time by 50% just by tuning their GC settings without a single line of code change. Efficient Cloud Cost Reduction In the world of cloud computing, enterprises often unknowingly spend millions of dollars on inefficient garbage collection practices. A high GC Throughput percentage, such as 98%, may initially seem impressive, like achieving an ‘A grade’ score. However, this seemingly minor difference carries substantial financial consequences. Imagine a mid-sized company operating 1,000 AWS t2.2x.large 32G RHEL on-demand EC2 instances in the US West (North California) region. The cost of each EC2 instance is $0.5716 per hour. Let’s assume that their application’s GC throughput is 98%. Now, let’s break down the financial impact of this assumption: With a 98% GC Throughput, each instance loses approximately 28.8 minutes daily due to garbage collection. In a day, there are 1,440 minutes (equivalent to 24 hours x 60 minutes). Thus, 2% of 1,440 minutes equals 28.8 minutes. Over the course of a year, this adds up to 175.2 hours per instance. (i.e. 28.8 minutes x 365 days) For a fleet of 1,000 AWS EC2 instances, this translates to approximately $100.14K in wasted resources annually (calculated as 1,000 EC2 instances x 175.2 hours x $0.5716 per hour) due to garbage collection delays. This calculation vividly illustrates how seemingly insignificant pauses in GC activity can amass substantial costs for enterprises. It emphasizes the critical importance of optimizing garbage collection processes to achieve significant cost savings. Trimming Software Licensing Cost In today’s landscape, many of our applications run on commercial vendor software solutions like Dell Boomi, ServiceNow, Workday, and others. While these vendor software solutions are indispensable, their licensing costs can be exorbitant. What’s often overlooked is that the efficiency of our code and configurations within these vendor software platforms directly impacts software licensing costs. This is where proper Garbage Collection (GC) analysis comes into play. It provides insights into whether there is an overallocation or underutilization of resources within these vendor software environments. Surprisingly, overallocation often remains hidden until we scrutinize GC behavior. By leveraging GC analysis, enterprises gain the visibility needed to identify overallocation and reconfigure resources accordingly. This optimization not only enhances application performance but also results in significant cost savings by reducing the licensing footprint of these vendor software solutions. The impact on the bottom line can be substantial. Forecast Memory Problems in Production Garbage collection logs hold the key to vital predictive micrometrics that can transform how you manage your application’s availability and performance. Among these micrometrics, one stands out: ‘GC Throughput.’ But what is GC Throughput? Imagine your application’s GC throughput is at 98% — it means that your application spends 98% of its time efficiently processing customer activity, with the remaining 2% allocated to GC activity. The significance becomes apparent when your application faces a memory problem. Several minutes before a memory issue becomes noticeable, the GC throughput will begin to degrade. This degradation serves as an early warning, enabling you to take preventive action before memory problems impact your production environment. Troubleshooting tools like yCrash closely monitor ‘GC throughput’ to predict and forecast memory problems, ensuring your application remains robust and reliable. Unearthing Memory Issues One of the primary reasons for production outages is encountering an OutOfMemoryError. In fact, there are nine different types of OutOfMemoryErrors: java.lang.OutOfMemoryError: Java heap space java.lang.OutOfMemoryError: PermGen space java.lang.OutOfMemoryError: GC overhead limit exceeded java.lang.OutOfMemoryError: Requested array size exceeds VM limit java.lang.OutOfMemoryError: Unable to create new native thread java.lang.OutOfMemoryError: Metaspace java.lang.OutOfMemoryError: unable to create new native thread java.lang.OutOfMemoryError: Direct buffer memory java.lang.OutOfMemoryError: Compressed class space GC analysis provides valuable insights into the root cause of these errors and helps in effectively triaging the problem. By understanding the specific OutOfMemoryError type and its associated details, developers can take targeted actions to debug and resolve memory-related issues, minimizing the risk of production outages. Spotting Performance Bottlenecks During Development In the modern software development landscape, the “Shift Left” approach has become a key initiative for many organizations. Its goal is to identify and address production-related issues during the development phase itself. Garbage Collection (GC) analysis enables this proactive approach by helping to isolate performance bottlenecks early in the development cycle. One of the vital metrics obtained through GC analysis is the ‘Object Creation Rate.’ This metric signifies the average rate at which objects are created by your application. Here’s why it matters: If your application, which previously generated data at a rate of 100MB/sec, suddenly starts creating 150MB/sec without a corresponding increase in traffic volume, it’s a red flag indicating potential problems within the application. This increased object creation rate can lead to heightened GC activity, higher CPU consumption, and degraded response times. Moreover, this metric can be integrated into your Continuous Integration/Continuous Deployment (CI/CD) pipeline to gauge the quality of code commits. For instance, if your previous code commit resulted in an object creation rate of 50MB/sec and a subsequent commit increases it to 75MB/sec for the same traffic volume, it signifies an inefficient code change. To streamline this process, you can leverage the GCeasy REST API. This integration allows you to capture critical data and insights directly within your CI/CD pipeline, ensuring that performance issues are identified and addressed early in the development lifecycle. Efficient Capacity Planning Effective capacity planning is vital for ensuring that your application can meet its performance and resource requirements. It involves understanding your application’s demands for memory, CPU, network resources, and storage. In this context, analyzing garbage collection behavior emerges as a powerful tool for capacity planning, particularly when it comes to assessing memory requirements. When you delve into garbage collection behavior analysis, you gain insights into crucial micro-metrics such as the average object creation rate and average object reclamation rate. These micro-metrics provide a detailed view of how your application utilizes memory resources. By leveraging this data, you can perform precise and effective capacity planning for your application. This approach allows you to allocate resources optimally, prevent resource shortages or overprovisioning, and ensure that your application runs smoothly and efficiently. Garbage Collection analysis, with its focus on memory usage patterns, becomes an integral part of the capacity planning process, enabling you to align your infrastructure resources with your application’s actual needs. How To Do Garbage Collection Analysis While there are monitoring tools and JMX MBeans that offer real-time Garbage Collection metrics, they often lack the depth needed for thorough analysis. To gain a complete understanding of Garbage Collection behavior, turn to GC logs. Once you have GC logs, select a free GC log analysis tool that suits your needs. With your chosen GC log analysis tool, examine Garbage Collection behavior in the logs, looking for patterns and performance issues. Pay attention to key metrics, and based on your analysis, optimize your application to reduce GC pauses and enhance performance. Adjust GC settings, allocate memory efficiently, and monitor the impact of your changes over time. Conclusion In the fast-paced world of software development and application performance optimization, Garbage Collection (GC) analysis is often the unsung hero. While it may be considered an underdog, it’s high time for this perception to change. GC analysis wields the power to enhance performance, reduce costs, and empower proactive decision-making. From improving application response times to early issue detection and precise capacity planning, GC analysis stands as a pivotal ally in optimizing applications and resources.

By Ram Lakshmanan

CORE

Your Codebase Is a Cluttered Garage

Unused code adds time and burden to maintaining the codebase, and removing it is the only cure for this side of “more cowbell.” Unfortunately, it’s not always obvious whether developers can remove certain code without breaking the application. As the codebase becomes cluttered and unwieldy, development teams can become mired in mystery code that slows development and lowers morale. Do you remember the first time you walked into your garage, empty and sparkling, yawning with the promise of protecting your vehicles and power tools? How did it look the last time you walked in? If you’re like many of us, the clutter of long-closed boxes taunts you every time you walk around them, losing precious minutes before you can get to the objects you need while your car sits in the driveway. Sadly, development teams have a similar problem with their source code, which has grown into a cluttered mess. Over the last few months, I’ve been working on a way to help development teams maintain less code. Everything we normally read is about working with new frameworks, new tools, and new techniques — but one thing many of us ignore is improving velocity by simply getting rid of things we no longer need. Essentially, as it runs, the JVM streams off its first-call method invocation log to a central location to track "have we used this method recently." When the method appears in the code inventory, the answer is yes — if the method does not appear, then it becomes a candidate for removal of that unused code. Dead Code Removal If you’re a senior developer helping new teammates, consider the work it takes to onboard new members and for them to learn your codebase. Each time they change something, they scroll past methods. Although our IDEs and analyzers can identify fully dead code, the frustration point is code that looks alive but just isn’t used. Often, these are public methods or classes that just aren’t called or have commented/modified annotations. As I’ve talked to teams about the idea that we hoard unused code, I’ve heard comments like these: “I don’t know what this code does, so I don’t want to get rid of it, but I would love to.” "I could clean that up, but I have other priority issues and don’t have time for that." “We never prioritize clean up. We just do new features.” What if Java developers had an easier way to identify dead code for removal — a way where we could prioritize code cleanup during our sprints to reduce technical debt without taking time away from business needs to add features? Code removal is complex and generally takes a back seat to new features. Over time, code becomes unused as teams refactor without removal: commenting on an annotation, changing a path, or moving functionality. Most senior engineers would have to allocate time in their sprints to find what to remove: evaluating missing log statements or reviewing code with static analyzers. Both are problematic from a time perspective, so many teams just leave it in the code repository, active but dead: a problem for a future team lead or delayed until the next big rewrite. The JVM, however, has an overlooked capability to identify dead code and simplify the prioritization problem. By re-purposing the bytecode interpreter, the JVM can identify when a method is first called per execution. When tracked in a central location, these logs produce a treasure map you can follow to remove dead code. reducing the overall cognitive burden and improving team velocity. If a method hasn’t run in a year, you can probably remove it. Team leads can then take classes and methods that haven’t been executed and remove that code either at one time or throughout several sprints. Why remove unused code at all? For many groups, updating libraries and major Java versions requires touching a lot of code. Between Java 8 and Java 17, the XML libraries were deprecated and removed — as you port your application, do you still use all that XML processing? Instead of touching the code and all associated unit tests, what if you could get rid of that code and remove the test? If the code doesn’t run, team members shouldn’t spend hours changing the code and updating tests to pass: removing the dead code is faster and reduces the mental complexity of figuring that code out. Similar situations arise from updates to major frameworks like Spring, iText, and so on. Imagine you paid your neighbor’s kids to mow your lawn with your mower, and it was hidden behind a wall of boxes, expired batteries, old clothes, and old electronics. How hard do you think they would try to navigate around your junk before they gave up and went home? Senior engineers are doing the same thing. What should be an hour’s work of mowing becomes two hours. The problem of cluttered and unused code also affects teams working on decomposing a monolith or re-architecting for the cloud. Without a full measurement of what code is still used, teams end up breaking out huge microservices that are difficult to manage because they include many unnecessary pieces brought out of the monolith. Instead of producing the desired streamlined suite of microservices, these re-architecture projects take longer, cost more, and feel like they need to be rewritten right away because the clutter the team was trying to avoid was never removed. Difficulties stick with the project until teams can decrease the maintenance burden: removing unnecessary code is a rapid way to decrease that burden. Instead of clamoring for a rewrite, reduce the maintenance burden to tidy up what you have. The Benefits of Tracking Used/Unused Code The distinguishing benefit of tracking life vs. unused code from the JVM is that teams can gather data from production applications without impacting performance. The JVM knows when a method is first called, and logging it doesn’t add any measurable overhead. This way, teams that aren’t sure about the robustness of their test environments can rely on the result. A similar experience exists for projects that have had different levels of test-driven development over their lifetime. Changing a tiny amount of code could result in several hours of test refactoring to make tests pass and get that green bar. I’ve seen many projects where the unit tests were the only thing that used the code. Removing the code and the unnecessary tests was more satisfying than updating all the code to the newer library just to get a green bar. The best way of identifying unused code for removal is to passively track what code runs. Instead of figuring it out manually or taking time from sprints, tune your JVM to record the first invocation of each method. It’s like a map of your unused boxes next to your automatic garage door opener. Later on, during sprints or standard work, run a script to compare your code against the list to see what classes and methods never ran. While the team works to build new features and handle normal development, start removing code that never ran. Perform your standard tests – if tests fail, look into removing or changing the test as well because it was just testing unused code. By removing this unused code over time, teams will have less baggage, less clutter, and less mental complexity to sift through as they work on code. If you’ve been working on a project for a long time or just joined a team and your business is pressuring you to go faster, consider finally letting go of unnecessary code. Track Code Within the JVM The JVM provides plenty of capabilities that help development teams create fast-running applications. It already knows when a method will be first called, so unlike profilers, there’s no performance impact on tracking when this occurs. By consolidating this first-call information, teams can identify unused code and finally tidy up that ever-growing codebase.

By Erik Costlow

Avoid Merge Conflicts, Don't Manage Them

When discussing continuous integration on the interwebs, inevitably someone pops into the conversation with this hand grenade of wisdom: BUT ACTUALLY… you don’t need continuous integration. It’s good enough to just merge mainline into your feature branch regularly. You get the same benefit, without the effort of changing your workflow. Sounds nice. But it’s bullsh*t. And I’m going to prove it. There must be a looser version of Godwin’s law that explains this phenomenon, with a dash of Dunbar’s number. Something like: As the number of participants in a conversation grows, the probability of any particular fallacy being presented as truth, approaches 1. But I digress. Let’s talk about feature branches and continuous integration. First, let me set the backdrop for this discussion, with some general explanation and definitions. Right off the bat, when I talk about continuous integration, I’m talking about “the activity of very frequently integrating work to the trunk of version control and verifying that the work is, to the best of our knowledge, releasable.” In particular, I’m not talking about the simple use of Jenkins or GitHub Actions or CircleCI or any other tool that has CI in its name. CI is a practice, not a tool. The importance of this distinction will become apparent as we continue. In contrast, most software development teams these days seem to use long-lived feature branches to develop software. Usually one branch per developer, working more or less in isolation, then after a few days, or sometimes even weeks, of work, they’ll do an integration ritual where they see what changes have occurred on the mainline branch while they were working, and go through the effort of integrating their code into mainline. A simplified diagram of this workflow looks something like this: The practice of continuous integration all but eliminates this merge conflict problem. And now it should be obvious why using a tool called “CI” isn’t at all the same thing. (What’s more, this isn’t even the best thing about continuous integration, but that’s another topic.) And this is where the BUT ACTUALLY folks will chime in. Here are some recent examples, taken from LinkedIn. (Slightly edited, and left anonymous to protect the guilty): If work isn’t being merged “up” onto the main branch quickly enough then you need to merge “down” (i.e. bring the main branch into your working branch) regularly. If you want to stay updated, take the pull from the main branch into your feature branch time to time. This neglects to recognize that, no matter what, conflict resolution must occur at some point. If you don’t regularly rebase or merge your feature branch, and you don’t merge it until it’s “done,” then you will need to handle a lot of conflict resolution at the end. Now before I tear apart the approach that these three folks have recommended, let me make it clear exactly what they are recommending. If we imagine a feature that takes three weeks to complete, rather than waiting three weeks to see what has changed in mainline, we can periodically (perhaps daily?) merge (or rebase) mainline back into our feature branch. Now this approach is not entirely without merit. There is often a benefit in tracking a changing mainline in your local branch. It’s usually easier to incrementally update your code to track one day’s worth of mainline changes at a time, than three weeks all at once. Although even this small benefit is made completely obsolete by continuous integration, as we’ll see in just a moment. What matters for this discussion is that this workflow does absolutely nothing to eliminate code conflicts. Not even a little bit. In fact, it often increases the number of conflict resolutions you have to do. Any time a line or section of code has multiple changes applied in sequence, you’ll find yourself resolving conflicts every time you update mainline, rather than only once at the end. “Okay, so there are the same number of conflicts. At least you’re resolving your own conflicts, like a responsible programmer citizen.” This is just laughably naïve. Where do you think those conflicts you just resolved came from? A conflict is like a two-sided coin. Every conflict is the result of two developers working on the same code in the same time window. Simply re-ordering it, or adding rules about “conflict etiquette” doesn’t resolve the problem. The only way to reduce conflicts is to reduce this time window. But now I’m jumping ahead. Consider two active branches: Bob’s branch and Alice’s branch. Bob and Alice are both waiting until their 3-week feature is done before merging. Unbeknownst to each of them, they’ve introduced a couple of conflicts, represented by the dashed red line. Then Bob merges his work, and it all integrates cleanly. Well done Bob! But then the next day Alice tries to merge, and discovers two conflicts that need to be resolved. So let’s apply our “merge mainline into the branch frequently” strategy, and see how well it solves this particular problem. In this scenario, both Bob and Alice are updating their branch with any changes found in mainline. But their own work remains isolated in their respective branches. As a result, they’re both trodding along for weeks with no problems. Then Bob merges his change, without any fuss. Then Alice comes along and BANG!! She has a massive conflict, caused by Bob’s recently merged changes. Nothing actually changed. Whatever the benefit of merging mainline into your branch frequently has for your own sanity, it makes no difference for your teammates in terms of conflict management. None. Zero. Zilch. Here’s another quote from LinkedIn: When making a change, it is the responsibility of the person making the more recent change to reconcile their local state with the shared state — the following person who changes the same code after you must do the same, etc. This is like saying “When you’re in a car accident, make sure you’re not at fault. And clean up your half of the broken glass.” What the actual f@#$?!? Instead, let’s try to avoid collisions entirely! Let’s now imagine that Bob and Alice are practicing continuous integration. Specifically, this means that they’re integrating their work as frequently as possible. Many times per day, most likely. Every time they’ve added a bit of code that will eventually contribute to their goal, and the test suite passes, they make a commit, and merge it into mainline immediately. Here’s what that looks like: They’ve written the same functional changes. But with no conflicts. Magic! How is that possible? Well let’s remember what a conflict actually is. In another context that we should all be familiar with, we would call this a race condition. And we have various techniques for avoiding race conditions in our code. Can we apply those to our… eh... Code? Yes, Yes, we can. One approach is to lock our changes. This is the approach taken by some ancient version control systems like CVS. In CVS, when you did a “checkout,” you were asking for a lock on files you wanted to change. This prevented anyone else from working on the same files. Conflicts avoided! Of course, we know that when writing code, locks are slow and expensive, and introduce contention. So best to be avoided. The other applicable approach for race avoidance (although there are others I won’t discuss here) is the use of atomic operations. If your database, or CPU, or whatever underlying system, supports it, making an operation atomic avoids the possibility of a race condition. Can we make our code changes atomic? For all practical purposes, we can get extremely close (more on the exceptions in a moment). Continuous integration is as close as we can realistically get to atomic commits, and on human time scales, it’s usually as good. When Bob is ready to make a small change, he makes sure he has an up-to-date version of mainline on his machine. He then makes his small change, and merges it into mainline. Within a few minutes. Or maybe an hour or two at most. Then Alice comes along to add a small change that in our earlier, alternate, reality, would have conflicted with Bob’s change. But now Bob’s change is already in mainline. So Alice makes her change, integrating it into Bob’s change, without realizing there ever even was the possibility for a conflict. A few minutes later, she merges it. Neither Bob nor Alice realize they’ve just averted disaster. There is no glass to clean up. “That’s nice” some of you are saying. “But conflicts are still unavoidable. Someone is going to try update the same file as someone else. Eventually.” Yes. Of course this is technically true. But only technically. Here’s the thing: Such collisions are exceedingly rare. If you’re working in short iterations, on small bits of code, the chances of two random developers making a change to the same code at the same time is infinitesimal. But this isn’t random. So the odds are even smaller. This is why you don’t already have more conflicts than you do. Usually, developers, or teams, divvy up their work in logical ways. Bob may be working on the database access layer, while Alice is working on the logging infrastructure. Only where these two subsystems intersect is a conflict even possible. By way of anecdotal evidence, I once spent a year working on a monorepo with over 1,000 other developers, where we practiced continuous integration without feature branches. Not once in that year did I ever experience a code conflict. I’m sure they did happen. Occasionally. But it’s the exception. Not the rule. It’s absolutely not a problem worth optimizing for. And now here’s the real magic: Even when these conflicts do occur, nobody really cares. They’re super trivial and easy to resolve. And even if they weren’t, by definition, they represent, at most, a few minutes, or maybe hours of work. If you had to throw it all away and start from scratch, the one time per year this occurred, it wouldn’t really matter much. Let’s say you’re convinced that continuously integrating your work into mainline is the best way to avoid merge conflicts. What’s next? The hardest part of continuous integration is not the technical aspects, but rather the human aspect. Humans are habitual creatures, and often resist suggestions to change the way they work, even if the way they work leads to a lot of car accidents merge conflicts. This article is already long enough, so if you find yourself needing additional help in this area, MinimumCD.org is a great website that explains in simple terms the minimum requirements to achieve continuous integration and the related practice of continuous delivery. The Starting the Journey page is a great place to start.

By Jonathan Hall

The Systemic Process of Debugging

Debugging is an integral part of software development. However, as projects grow in size and complexity, the process of debugging requires more structure and collaboration. This process is probably something you already do, as this process is deeply ingrained into most teams. It's also a core part of the academic theory behind debugging. Its purpose is to prevent regressions and increase collaboration in a team environment. Without this process, any issue we fix might come back to haunt us in the future. This process helps developers work cohesively and efficiently. The Importance of Issue Tracking I'm sure we all use an issue tracker. In that sense, we should all be aligned. But do you sometimes "just fix a bug"? Without going through the issue tracker? Honestly, I do that a lot. Mostly in hobby projects but occasionally even in professional settings. Even when working alone, this can become a problem... Avoiding Parallel Work on the Same Bug When working on larger projects, it's crucial to avoid situations where multiple developers are unknowingly addressing the same issue. This can lead to wasted effort and potential conflicts in the codebase. To prevent this: Always log bugs in your issue-tracking system. Before starting work on a bug, ensure it's assigned to you and marked as active. This visibility allows the project manager and other team members to be aware, reducing the chances of overlapping work. Stay updated on other issues. By keeping an eye on the issues your teammates are tackling, you can anticipate potential areas of conflict and adjust your approach accordingly. Assuming you have a daily sync session or even a weekly session, it's important to discuss issues. This prevents collision, where a teammate can hear the description of the bug and might raise a flag. This also helps in pinpointing the root cause of the bug in some situations. An issue might be familiar, and communicating through it leaves a "paper trail." As the project grows, you will find that bugs keep coming back despite everything we do. History that was left behind in the issue tracker by teammates who are no longer on the team can be a lifesaver. Furthermore, the statistics we can derive from a properly classified issue tracker can help us pinpoint the problematic areas of the code that might need further testing and maybe refactoring. The Value of Issue Over Pull Requests We sometimes write the comments and information directly into the pull request instead of the issue tracker. This can work for some situations but isn't as ideal for the general case. Issues in a tracking system are often more accessible than pull requests or specific commits. When addressing a regression, linking the pull request to the originating issue is vital. This ensures that all discussions and decisions related to the bug are centralized and easily traceable. Communication: Issue Tracker vs. Ephemeral Channels I use Slack a lot. This is a problem; it's convenient, but it's ephemeral, and in more than one case, important information written in a Slack chat was gone. Emails aren't much of an improvement, especially in the long term. An email thread I had with a former colleague was cut short, and I had no context as to where it ended. Yes, having a conversation in the issue tracker is cumbersome and awkward, but we have a record. Why We Sometimes Avoid the Issue Tracker Developers might sometimes avoid discussing issues in the tracker because: Complex discussions: Some topics might feel too broad or intricate for the issue tracker. Fear of public criticism: No one wants to appear ignorant or criticize a colleague in a permanent record. As a result, some discussions might shift to private or ephemeral channels. However, while team cohesion and empathy are crucial, it's essential to log all relevant discussions in the issue tracker. This ensures that knowledge isn't lost, especially if a team member departs. The Role of Daily Meetings Daily meetings are invaluable for teams with multiple developers working on related tasks. These meetings provide a platform for: Sharing updates: Inform the team about your current theories and direction. Engaging in discussions: If a colleague's update sounds familiar, it's an opportunity to collaborate and avoid redundant work. However, it's essential to keep these meetings concise. Detailed discussions should transition to the issue tracker for a comprehensive record. I prefer two weekly meetings as I find it's the optimal number. The first day of the week is usually a ramp-up day. Then we have the first meeting in the morning of the second day of the week and the second meeting two days later. That reduces the load of a daily meeting while still keeping information fresh. The Role of Testing in Debugging We all use tests when developing (hopefully), but debugging theory has a special place for tests. Starting With Unit Tests A common approach to debugging is to begin by creating a unit test that reproduces the issue. However, this might not always be feasible before understanding the problem. Nevertheless, once the problem is understood, we should: Create a test before fixing the issue. This test should be part of the pull request that addresses the bug. Maintain a coverage ratio. Aim for a coverage ratio of 60% or higher per pull request to ensure that changes are adequately tested. A test acts as a safeguard against a regression. If the bug resurfaces, it will be a slightly different variant of that same bug. Unit Tests vs. Integration Tests While unit tests are fast and provide immediate feedback, they primarily prevent regressions. They might not be as effective in verifying overall quality. On the other hand, integration tests, though potentially slower, offer a comprehensive quality check. They can sometimes be the only way to reproduce certain issues. Most of the difficult bugs I ran into in my career were in the interconnect area between modules. This is an area that unit tests don't cover very well. That is why integration tests are far more important than unit tests for overall application quality. To ensure quality, focus on integration tests for coverage. Relying solely on unit test coverage can be misleading. It might lead to dead code and added complexity in the system. However, as part of the debugging process, it's very valuable to have a unit test as it's far easier to debug and much faster. Final Word A structured approach to debugging, combined with effective communication and a robust testing strategy, can significantly enhance the efficiency and quality of software development. This isn't about convenience; the process underlying debugging is like a paper trail for the debugging process. I start every debugging session by searching the issue tracker. In many cases, it yields gold that might not lead me to the issue directly but still points me in the right direction. The ability to rely on a unit test that was committed when solving a similar bug is invaluable. It gives me a leg up on resolving similar issues moving forward.

By Shai Almog

CORE

Future-Proofing Data Architecture for Better Data Quality

This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report Data quality is an undetachable part of data engineering. Because any data insight can only be as good as its input data, building robust and resilient data systems that consistently deliver high-quality data is the data engineering team's holiest responsibility. Achieving and maintaining adequate data quality is no easy task. It requires data engineers to design data systems with data quality in mind. In the hybrid world of data at rest and data in motion, engineering data quality could be significantly different for batch and event streaming systems. This article will cover key components in data engineering systems that are critical for delivering high-quality data: Monitoring data quality – Given any data pipeline, how to measure the correctness of the output data, and how to ensure the output is correct not only today but also in the foreseeable future. Data recovery and backfill – In case of application failures or data quality violations, how to perform data recovery to minimize impact on downstream users. Preventing data quality regressions – When data sources undergo changes or when adding new features to existing data applications, how to prevent unexpected regression. Monitoring Data Quality As the business evolves, the data also evolves. Measuring data quality is never a one-time task, and it is important to continuously monitor the quality of data in data pipelines to catch any regressions at the earliest stage possible. The very first step of monitoring data quality is defining data quality metrics based on the business use cases. Defining Data Quality Defining data quality is to set expectations for the output data and measure the deviation in the actual data from the established expectations in the form of quantitative metrics. When defining data quality metrics, the very first thing data engineers should consider is, "What truth does the data represent?" For example, the output table should contain all advertisement impression events that happened on the retail website. The data quality metrics should be designed to ensure the data system accurately captures that truth. In order to accurately measure the data quality of a data system, data engineers need to track not only the baseline application health and performance metrics (such as job failures, completion timestamp, processing latency, and consumer lag) but also customized metrics based on the business use cases the data system serves. Therefore, data engineers need to have a deep understanding of the downstream use cases and the underlying business problems. As the business model determines the nature of the data, business context allows data engineers to grasp the meanings of the data, traffic patterns, and potential edge cases. While every data system serves a different business use case, some common patterns in data quality metrics can be found in Table 1. METRICS FOR MEASURING DATA QUALITY IN A DATA PIPELINE Type Limitations Application health The number of jobs succeeded or running (for streaming) should be N. SLA/latency The job completion time should be by 8 a.m. PST daily. The max event processing latency should be < 2 seconds (for streaming). Schema Column account_id should be INT type and can't be NULL. Column values Column account_id must be positive integers. Column account_type can only have the values: FREE, STANDARD, or MAX. Comparison with history The total number of confirmed orders on any date should be within +20%/-20% of the daily average of the last 30 days. Comparison with other datasets The number of shipped orders should correlate to the number of confirmed orders. Table 1 Implementing Data Quality Monitors Once a list of data quality metrics is defined, these metrics should be captured as part of the data system and metric monitors should be automated as much as possible. In case of any data quality violations, the on-call data engineers should be alerted to investigate further. In the current data world, data engineering teams often own a mixed bag of batched and streaming data applications, and the implementation of data quality metrics can be different for batched vs. streaming systems. Batched Systems The Write-Audit-Publish (WAP) pattern is a data engineering best practice widely used to monitor data quality in batched data pipelines. It emphasizes the importance of always evaluating data quality before releasing the data to downstream users. Figure 1: Write-Audit-Publish pattern in batched data pipeline design Streaming Systems Unfortunately, the WAP pattern is not applicable to data streams because event streaming applications have to process data nonstop, and pausing production streaming jobs to troubleshoot data quality issues would be unacceptable. In a Lambda architecture, the output of event streaming systems is also stored in lakehouse storage (e.g., an Apache Iceberg or Apache Hudi table) for batched usage. As a result, it is also common for data engineers to implement WAP-based batched data quality monitors on the lakehouse table. To monitor data quality in near real-time, one option is to implement data quality checks as real-time queries on the output, such as an Apache Kafka topic or an Apache Druid datasource. For large-scale output, sampling is typically applied to improve the query efficiency of aggregated metrics. Helper frameworks such as Schema Registry can also be useful for ensuring output events have a compatible as-expected schema. Another option is to capture data quality metrics in an event-by-event manner as part of the application logic and log the results in a time series data store. This option introduces additional side output but allows more visibility into intermediate data stages/operations and easier troubleshooting. For example, assuming the application logic decides to drop events that have invalid account_id, account_type, or order_id, if an upstream system release introduces a large number of events with invalid account_id, the output-based data quality metrics will show a decline in the total number output events. However, it would be difficult to identify what filter logic or column is the root cause without metrics or logs on intermediate data stages/operations. Data Recovery and Backfill Every data pipeline will fail at some point. Some of the common failure causes include: Incompatible source data updates (e.g., critical columns were removed from source tables) Source or sink data systems failures (e.g., sink databases became unavailable) Altered truth in data (e.g., data processing logic became outdated after a new product release) Human errors (e.g., a new build introduces new edge-case errors left unhandled) Therefore, all data systems should be able to be backfilled at all times in order to minimize the impact of potential failures on downstream business use cases. In addition, in event streaming systems, the ability to backfill is also required for bootstrapping large stateful stream processing jobs. The data storage and processing frameworks used in batched and streaming architectures are usually different, and so are the challenges that lie behind supporting backfill. Batched Systems The storage solutions for batched systems, such as AWS S3 and GCP Cloud Storage, are relatively inexpensive and source data retention is usually not a limiting factor in backfill. Batched data are often written and read by event-time partitions, and data processing jobs are scheduled to run at certain intervals and have clear start and completion timestamps. The main technical challenge in backfilling batched data pipelines is data lineage: what jobs updated/read which partitions at what timestamp. Clear data lineage enables data engineers to easily identify downstream jobs impacted by problematic data partitions. Modern lakehouse table formats such as Apache Iceberg provide queryable table-level changelogs and history snapshots, which allow users to revert any table to a specific version in case a recent data update contaminated the table. The less queryable data lineage metadata, the more manual work is required for impact estimation and data recovery. Streaming Systems The source data used in streaming systems, such as Apache Kafka topics, often have limited retention due to the high cost of low-latency storage. For instance, for web-scale data streams, data retention is often set to several hours to keep costs reasonable. As troubleshooting failures can take data engineers hours if not days, the source data could have already expired before backfill. As a result, data retention is often a challenge in event streaming backfill. Below are the common backfill methodologies for event streaming systems: METHODS FOR BACKFILLING STREAMING DATA SYSTEMS Method Description Replaying source streams Reprocess source data from the problematic time period before those events expire in source systems (e.g., Apache Kafka). Tiered storage can help reduce stream retention cost. Lambda architecture Maintain a parallel batched data application (e.g., Apache Spark) for backfill, reading source data from a lakehouse storage with long retention. Kappa architecture The event streaming application is capable of streaming data from both data streams (for production) and lakehouse storage (for backfill) Unified batch and streaming Data processing frameworks, such as Apache Beam, support both streaming (for production) and batch mode (for backfill). Table 2 Preventing Data Quality Regressions Let's say a data pipeline has a comprehensive collection of data quality metrics implemented and a data recovery mechanism to ensure that reasonable historical data can be backfilled at any time. What could go wrong from here? Without prevention mechanisms, the data engineering team can only react passively to data quality issues, finding themselves busy putting out the same fire over and over again. To truly future-proof the data pipeline, data engineers must proactively establish programmatic data contracts to prevent data quality regression at the root. Data quality issues can either come from upstream systems or the application logic maintained by data engineers. For both cases, data contracts should be implemented programmatically, such as unit tests and/or integration tests to stop any contract-breaking changes from going into production. For example, let's say that a data engineering team owns a data pipeline that consumes advertisement impression logs for an online retail store. The expectations of the impression data logging should be implemented as unit and/or regression tests in the client-side logging test suite since it is owned by the client and data engineering teams. The advertisement impression logs are stored in a Kafka topic, and the expectation on the data schema is maintained in a Schema Registry to ensure the events have compatible data schemas for both producers and consumers. As the main logic of the data pipeline is attributing advertisement click events to impression events, the data engineering team developed unit tests with mocked client-side logs and dependent services to validate the core attribution logic and integration tests to verify that all components of the data system together produce the correct final output. Conclusion Data quality should be the first priority of every data pipeline and the data architecture should be designed with data quality in mind. The first step of building robust and resilient data systems is defining a set of data quality metrics based on the business use cases. Data quality metrics should be captured as part of the data system and monitored continuously, and the data should be able to be backfilled at all times to minimize potential impact to downstream users in case of data quality issues. The implementation of data quality monitors and backfill methods can be different for batched vs. event streaming systems. Last but not least, data engineers should establish programmatic data contracts as code to proactively prevent data quality regressions. Only when the data engineering systems are future-proofed to deliver qualitative data, data-driven business decisions can be made with confidence. This is an article from DZone's 2023 Data Pipelines Trend Report.For more: Read the Report

By Xinran Waibel

Revolutionizing Software Testing

This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report Artificial intelligence (AI) has revolutionized the realm of software testing, introducing new possibilities and efficiencies. The demand for faster, more reliable, and efficient testing processes has grown exponentially with the increasing complexity of modern applications. To address these challenges, AI has emerged as a game-changing force, revolutionizing the field of automated software testing. By leveraging AI algorithms, machine learning (ML), and advanced analytics, software testing has undergone a remarkable transformation, enabling organizations to achieve unprecedented levels of speed, accuracy, and coverage in their testing endeavors. This article delves into the profound impact of AI on automated software testing, exploring its capabilities, benefits, and the potential it holds for the future of software quality assurance. An Overview of AI in Testing This introduction aims to shed light on the role of AI in software testing, focusing on key aspects that drive its transformative impact. Figure 1: AI in testing Elastically Scale Functional, Load, and Performance Tests AI-powered testing solutions enable the effortless allocation of testing resources, ensuring optimal utilization and adaptability to varying workloads. This scalability ensures comprehensive testing coverage while maintaining efficiency. AI-Powered Predictive Bots AI-powered predictive bots are a significant advancement in software testing. Bots leverage ML algorithms to analyze historical data, patterns, and trends, enabling them to make informed predictions about potential defects or high-risk areas. By proactively identifying potential issues, predictive bots contribute to more effective and efficient testing processes. Automatic Update of Test Cases With AI algorithms monitoring the application and its changes, test cases can be dynamically updated to reflect modifications in the software. This adaptability reduces the effort required for test maintenance and ensures that the test suite remains relevant and effective over time. AI-Powered Analytics of Test Automation Data By analyzing vast amounts of testing data, AI-powered analytical tools can identify patterns, trends, and anomalies, providing valuable information to enhance testing strategies and optimize testing efforts. This data-driven approach empowers testing teams to make informed decisions and uncover hidden patterns that traditional methods might overlook. Visual Locators Visual locators, a type of AI application in software testing, focus on visual elements such as user interfaces and graphical components. AI algorithms can analyze screenshots and images, enabling accurate identification of and interaction with visual elements during automated testing. This capability enhances the reliability and accuracy of visual testing, ensuring a seamless user experience. Self-Healing Tests AI algorithms continuously monitor test execution, analyzing results and detecting failures or inconsistencies. When issues arise, self-healing mechanisms automatically attempt to resolve the problem, adjusting the test environment or configuration. This intelligent resilience minimizes disruptions and optimizes the overall testing process. What Is AI-Augmented Software Testing? AI-augmented software testing refers to the utilization of AI techniques — such as ML, natural language processing, and data analytics — to enhance and optimize the entire software testing lifecycle. It involves automating test case generation, intelligent test prioritization, anomaly detection, predictive analysis, and adaptive testing, among other tasks. By harnessing the power of AI, organizations can improve test coverage, detect defects more efficiently, reduce manual effort, and ultimately deliver high-quality software with greater speed and accuracy. Benefits of AI-Powered Automated Testing AI-powered software testing offers a plethora of benefits that revolutionize the testing landscape. One significant advantage lies in its codeless nature, thus eliminating the need to memorize intricate syntax. Embracing simplicity, it empowers users to effortlessly create testing processes through intuitive drag-and-drop interfaces. Scalability becomes a reality as the workload can be efficiently distributed among multiple workstations, ensuring efficient utilization of resources. The cost-saving aspect is remarkable as minimal human intervention is required, resulting in substantial reductions in workforce expenses. With tasks executed by intelligent bots, accuracy reaches unprecedented heights, minimizing the risk of human errors. Furthermore, this automated approach amplifies productivity, enabling testers to achieve exceptional output levels. Irrespective of the software type — be it a web-based desktop application or mobile application — the flexibility of AI-powered testing seamlessly adapts to diverse environments, revolutionizing the testing realm altogether. Figure 2: Benefits of AI for test automation Mitigating the Challenges of AI-Powered Automated Testing AI-powered automated testing has revolutionized the software testing landscape, but it is not without its challenges. One of the primary hurdles is the need for high-quality training data. AI algorithms rely heavily on diverse and representative data to perform effectively. Therefore, organizations must invest time and effort in curating comprehensive and relevant datasets that encompass various scenarios, edge cases, and potential failures. Another challenge lies in the interpretability of AI models. Understanding why and how AI algorithms make specific decisions can be critical for gaining trust and ensuring accurate results. Addressing this challenge requires implementing techniques such as explainable AI, model auditing, and transparency. Furthermore, the dynamic nature of software environments poses a challenge in maintaining AI models' relevance and accuracy. Continuous monitoring, retraining, and adaptation of AI models become crucial to keeping pace with evolving software systems. Additionally, ethical considerations, data privacy, and bias mitigation should be diligently addressed to maintain fairness and accountability in AI-powered automated testing. AI models used in testing can sometimes produce false positives (incorrectly flagging a non-defect as a defect) or false negatives (failing to identify an actual defect). Balancing precision and recall of AI models is important to minimize false results. AI models can exhibit biases and may struggle to generalize new or uncommon scenarios. Adequate training and validation of AI models are necessary to mitigate biases and ensure their effectiveness across diverse testing scenarios. Human intervention plays a critical role in designing test suites by leveraging their domain knowledge and insights. They can identify critical test cases, edge cases, and scenarios that require human intuition or creativity, while leveraging AI to handle repetitive or computationally intensive tasks. Continuous improvement would be possible by encouraging a feedback loop between human testers and AI systems. Human experts can provide feedback on the accuracy and relevance of AI-generated test cases or predictions, helping improve the performance and adaptability of AI models. Human testers should play a role in the verification and validation of AI models, ensuring that they align with the intended objectives and requirements. They can evaluate the effectiveness, robustness, and limitations of AI models in specific testing contexts. AI-Driven Testing Approaches AI-driven testing approaches have ushered in a new era in software quality assurance, revolutionizing traditional testing methodologies. By harnessing the power of artificial intelligence, these innovative approaches optimize and enhance various aspects of testing, including test coverage, efficiency, accuracy, and adaptability. This section explores the key AI-driven testing approaches, including differential testing, visual testing, declarative testing, and self-healing automation. These techniques leverage AI algorithms and advanced analytics to elevate the effectiveness and efficiency of software testing, ensuring higher-quality applications that meet the demands of the rapidly evolving digital landscape: Differential testing assesses discrepancies between application versions and builds, categorizes the variances, and utilizes feedback to enhance the classification process through continuous learning. Visual testing utilizes image-based learning and screen comparisons to assess the visual aspects and user experience of an application, thereby ensuring the integrity of its look and feel. Declarative testing expresses the intention of a test using a natural or domain-specific language, allowing the system to autonomously determine the most appropriate approach to execute the test. Self-healing automation automatically rectifies element selection in tests when there are modifications to the user interface (UI), ensuring the continuity of reliable test execution. Key Considerations for Harnessing AI for Software Testing Many contemporary test automation tools infused with AI provide support for open-source test automation frameworks such as Selenium and Appium. AI-powered automated software testing encompasses essential features such as auto-code generation and the integration of exploratory testing techniques. Open-Source AI Tools To Test Software When selecting an open-source testing tool, it is essential to consider several factors. Firstly, it is crucial to verify that the tool is actively maintained and supported. Additionally, it is critical to assess whether the tool aligns with the skill set of the team. Furthermore, it is important to evaluate the features, benefits, and challenges presented by the tool to ensure they are in line with your specific testing requirements and organizational objectives. A few popular open-source options include, but are not limited to: Carina – AI-driven, free forever, scriptless approach to automate functional, performance, visual, and compatibility tests TestProject – Offered the industry's first free Appium AI tools in 2021, expanding upon the AI tools for Selenium that they had previously introduced in 2020 for self-healing technology Cerberus Testing – A low-code and scalable test automation solution that offers a self-healing feature called Erratum and has a forever-free plan Designing Automated Tests With AI and Self-Testing AI has made significant strides in transforming the landscape of automated testing, offering a range of techniques and applications that revolutionize software quality assurance. Some of the prominent techniques and algorithms are provided in the tables below, along with the purposes they serve: KEY TECHNIQUES AND APPLICATIONS OF AI IN AUTOMATED TESTING Key Technique Applications Machine learning Analyze large volumes of testing data, identify patterns, and make predictions for test optimization, anomaly detection, and test case generation Natural language processing Facilitate the creation of intelligent chatbots, voice-based testing interfaces, and natural language test case generation Computer vision Analyze image and visual data in areas such as visual testing, UI testing, and defect detection Reinforcement learning Optimize test execution strategies, generate adaptive test scripts, and dynamically adjust test scenarios based on feedback from the system under test Table 1 KEY ALGORITHMS USED FOR AI-POWERED AUTOMATED TESTING Algorithm Purpose Applications Clustering algorithms Segmentation k-means and hierarchical clustering are used to group similar test cases, identify patterns, and detect anomalies Sequence generation models: recurrent neural networks or transformers Text classification and sequence prediction Trained to generate sequences such as test scripts or sequences of user interactions for log analysis Bayesian networks Dependencies and relationships between variables Test coverage analysis, defect prediction, and risk assessment Convolutional neural networks Image analysis Visual testing Evolutionary algorithms: genetic algorithms Natural selection Optimize test case generation, test suite prioritization, and test execution strategies by applying genetic operators like mutation and crossover on existing test cases to create new variants, which are then evaluated based on fitness criteria Decision trees, random forests, support vector machines, and neural networks Classification Classification of software components Variational autoencoders and generative adversarial networks Generative AI Used to generate new test cases that cover different scenarios or edge cases by test data generation, creating synthetic data that resembles real-world scenarios Table 2 Real-World Examples of AI-Powered Automated Testing AI-powered visual testing platforms perform automated visual validation of web and mobile applications. They use computer vision algorithms to compare screenshots and identify visual discrepancies, enabling efficient visual testing across multiple platforms and devices. NLP and ML are combined to generate test cases from plain English descriptions. They automatically execute these test cases, detect bugs, and provide actionable insights to improve software quality. Self-healing capabilities are also provided by automatically adapting test cases to changes in the application's UI, improving test maintenance efficiency. Quantum AI-Powered Automated Testing: The Road Ahead The future of quantum AI-powered automated software testing holds great potential for transforming the way testing is conducted. Figure 3: Transition of automated testing from AI to Quantum AI Quantum computing's ability to handle complex optimization problems can significantly improve test case generation, test suite optimization, and resource allocation in automated testing. Quantum ML algorithms can enable more sophisticated and accurate models for anomaly detection, regression testing, and predictive analytics. Quantum computing's ability to perform parallel computations can greatly accelerate the execution of complex test scenarios and large-scale test suites. Quantum algorithms can help enhance security testing by efficiently simulating and analyzing cryptographic algorithms and protocols. Quantum simulation capabilities can be leveraged to model and simulate complex systems, enabling more realistic and comprehensive testing of software applications in various domains, such as finance, healthcare, and transportation. Parting Thoughts AI has significantly revolutionized the traditional landscape of testing, enhancing the effectiveness, efficiency, and reliability of software quality assurance processes. AI-driven techniques such as ML, anomaly detection, NLP, and intelligent test prioritization have enabled organizations to achieve higher test coverage, early defect detection, streamlined test script creation, and adaptive test maintenance. The integration of AI in automated testing not only accelerates the testing process but also improves overall software quality, leading to enhanced customer satisfaction and reduced time to market. As AI continues to evolve and mature, it holds immense potential for further advancements in automated testing, paving the way for a future where AI-driven approaches become the norm in ensuring the delivery of robust, high-quality software applications. Embracing the power of AI in automated testing is not only a strategic imperative but also a competitive advantage for organizations looking to thrive in today's rapidly evolving technological landscape. This is an article from DZone's 2023 Automated Testing Trend Report.For more: Read the Report

By Tuhin Chattopadhyay

CORE

How To Handle Technical Debt in Scrum

If you work in software development, you likely encounter technical debt all the time. It accumulates over time as we prioritize delivering new features over maintaining a healthy codebase. Managing technical debt, or code debt, can be a challenge. Approaching it the right way in the context of Scrum won’t just help you manage your tech debt. It can allow you to leverage it to strategically ship faster and gain a very real competitive advantage. In this article, I’ll cover: The basics of technical debt and why it matters How tech debt impacts Scrum teams How to track tech debt How to prioritize tech debt and fix it Why continuous improvement matters in tech debt Thinking About Tech Debt in Scrum: The Basics Scrum is an Agile framework that helps teams deliver high-quality software in a collaborative and iterative way. By leveraging strategies like refactoring, incremental improvement, and automated testing, Scrum teams can tackle technical debt head-on. But it all starts with good issue tracking. Whether you're a Scrum master, product owner, or developer, I’m going to share some practical insights and strategies for you to manage tech debt. The Impact of Technical Debt on Scrum Teams Ignoring technical debt can lead to higher costs, slower delivery times, and reduced productivity. Tech debt makes it harder to implement new features or updates because it creates excessive complexity. Product quality suffers in turn. Then maintenance costs rise. There are more customer issues, and customers become frustrated. Unmanaged technical debt has the potential to touch every part of the business. Technical debt also brings the team down. It’s a serial destroyer of morale. Ignoring tech debt or postponing it is often frustrating and demotivating. It can also exacerbate communication problems and create silos, hindering project goals. Good management of tech debt, then, is absolutely essential for the modern Scrum team. How to Track Tech Debt Agile teams who are successful at managing their tech debt identify it early and often. Technical debt should be identified: During the act of writing code. Scrum teams should feel confident accruing prudent tech debt to ship faster. That’s so long as they track that debt immediately and understand how it could be paid off. Backlog refinement. This is an opportunity to discuss and prioritize the product backlog and have nuanced conversations about tech debt in the codebase. Sprint planning. How technical debt impacts the current sprint should always be a topic of conversation during sprint planning. Allocate resources to paying back tech debt consistently. Retrospectives. An opportunity to identify tech debt that has been accrued or which needs to be considered or prioritized. Use an in-editor issue tracker, which enables your engineers to track issues directly linked to code. This is a weakness of common issue-tracking software like Jira, which often undermines the process entirely. Prioritising Technical Debt in Scrum There are many ways to choose what to prioritize. I suggest choosing a theme for each sprint. Allocate 15-20% of your resources to fixing a specific subset of technical debt issues. For example, you might choose to prioritize issues based on… Their impact on a particular part of the codebase needed to ship new features Their impact on critical system functionality, security, or performance Their impact on team morale, employee retention, or developer experience The headaches around issue resolution often stem from poor issue tracking. Once your Scrum team members have nailed an effective issue-tracking system that feels seamless for engineers, solving tech debt becomes much easier. The Importance of Good Issue Tracking in Managing Technical Debt in Scrum Good issue tracking is the foundation of any effective technical debt management strategy. Scrum teams must be able to track technical debt issues systematically to prioritize and address them effectively. Using the right tools can make or break a tech debt management strategy. Modern engineering teams need issue-tracking tools that: Link issues directly to code. Make issues visible in the code editor Enable engineers to visualize tech debt in the codebase Create issues from the code editor in Stepsize Continuous Improvement in Scrum Identify tech debt early and consistently. Address and fix tech debt continuously. Use Scrum sessions such as retrospectives as an opportunity to reflect on how the team can improve their process for managing technical debt. Consider: Where does tech debt tend to accumulate? Is everybody following a good issue-tracking process? Are issues high-quality? Regularly review and update the team's “Definition of Done” (DoD), which outlines the criteria that must be met for a user story to be considered complete. Refining the DoD increases their likelihood of shipping high-quality code that is less likely to result in technical debt down the line. Behavioral change is most likely when teams openly collaborate, supported by the right tools. I suggest encouraging everybody to reflect on their processes and actively search for opportunities to improve. Wrapping Up Managing technical debt properly needs to be a natural habit for modern Scrum teams. Doing so protects the long-term performance of the team and product. Properly tracking technical debt is the foundation of any effective technical debt management strategy. By leveraging the right issue-tracking tools and prioritizing technical debt in the right way, Scrum teams can strategically ship faster. Doing so also promotes better product quality and maintains team morale and collaboration. Remember, technical debt is an unavoidable part of software development, but with the right approach and tools, it’s possible to drive behavioral change and safeguard the long-term success of your team.

By Ruth Dillon-Mansfield

How to Debug an Unresponsive Elasticsearch Cluster

Elasticsearch is an open-source search engine and analytics store used by a variety of applications from search in e-commerce stores, to internal log management tools using the ELK stack (short for “Elasticsearch, Logstash, Kibana”). As a distributed database, your data is partitioned into “shards” which are then allocated to one or more servers. Because of this sharding, a read or write request to an Elasticsearch cluster requires coordinating between multiple nodes as there is no “global view” of your data on a single server. While this makes Elasticsearch highly scalable, it also makes it much more complex to setup and tune than other popular databases like MongoDB or PostgresSQL, which can run on a single server. When reliability issues come up, firefighting can be stressful if your Elasticsearch setup is buggy or unstable. Your incident could be impacting customers which could negatively impact revenue and your business reputation. Fast remediation steps are important, yet spending a large amount of time researching solutions online during an incident or outage is not a luxury most engineers have. This guide is intended to be a cheat sheet for common issues that engineers running that can cause issues with Elasticsearch and what to look for. As a general purpose tool, Elasticsearch has thousands of different configurations which enables it to fit a variety of different workloads. Even if published online, a data model or configuration that worked for one company may not be appropriate for yours. There is no magic bullet getting Elasticsearch to scale, and requires diligent performance testing and trial/error. Unresponsive Elasticsearch Cluster Issues Cluster stability issues are some of the hardest to debug, especially if nothing changes with your data volume or code base. Check Size of Cluster State What Does It Do? Elasticsearch cluster state tracks the global state of our cluster, and is the heart of controlling traffic and the cluster. Cluster state includes metadata on nodes in your cluster, status of shards and how they are mapped to nodes, index mappings (i.e. the schema), and more. Cluster state usually doesn’t change often. However, certain operations such as adding a new field to an index mapping can trigger an update. Because cluster updates broadcast to all nodes in the cluster, it should be small (<100MB). A large cluster state can quickly make the cluster unstable. A common way this happens is through a mapping explosion (too many keys in an index) or too many indices. What to Look For Download the cluster state using the below command and look at the size of the JSON returned.curl -XGET 'http://localhost:9200/_cluster/state' In particular, look at which indices have the most fields in the cluster state which could be the offending index causing stability issues. If the cluster state is large and increasing. You can also get an idea looking at individual index or match against an index pattern like so:curl -XGET 'http://localhost:9200/_cluster/state/_all/my_index-*' You can also see the offending index’s mapping using the following command:curl -XGET 'http://localhost:9200/my_index/_mapping' How to Fix Look at how data is being indexed. A common way mapping explosion occurs is when high-cardinality identifiers are being used as a JSON key. Each time a new key is seen like “4” and”5”, the cluster state is updated. For example, the below JSON will quickly cause stability issues with Elasticsearch as each key is being added to the global state. { "1": { "status": "ACTIVE" }, "2": { "status": "ACTIVE" }, "3": { "status": "DISABLED" } } To fix, flatten your data into something that is Elasticsearch-friendly:{ [ { "id": "1", "status": "ACTIVE" }, { "id": "2", "status": "ACTIVE" }, { "id": "3", "status": "DISABLED" } ] } Check Elasticsearch Tasks Queue What Does It Do? When a request is made against elasticsearch (index operation, query operation, etc), it’s first inserted into the task queue, until a worker thread can pick it up. Once a worker pool has a thread free, it will pick up a task from the task queue and process it. These operations are usually made by you via HTTP requests on the :9200 and :9300 ports, but they can also be internal to handle maintenance tasks on an index At a given time there may be hundreds or thousands of in-flight operations, but should complete very quickly (like microseconds or milliseconds). What to Look For Run the below command and look for tasks that are stuck running for a long time like minutes or hours. This means something is starving the cluster and preventing it from making forward progress. It’s ok for certain long running tasks like moving an index to take a long time. However, normal query and index operations should be quick.curl -XGET 'http://localhost:9200/_cat/tasks?detailed' With the ?detailed param, you can get more info on the target index and query. Look for patterns in which tasks are consistently at the top of the list. Is it the same index? Is it the same node? If so, maybe something is wrong with that index’s data or the node is overloaded. How to Fix If the volume of requests is higher than normal, then look at ways to optimize the requests (such as using bulk APIs or more efficient queries/writes) If not change in volume and looks random, this implies something else is slowing down the cluster. The backup of tasks is just a symptom of a larger issue. If you don’t know where the requests come from, add the X-Opaque-Id header to your Elasticsearch clients to identify which clients are triggering the queries. Checks Elasticsearch Pending Tasks What Does It Do? Pending tasks are pending updates to the cluster state such as creating a new index or updating its mapping. Unlike the previous tasks queue, pending updates require a multi step handshake to broadcast the update to all nodes in the cluster, which can take some time. There should be almost zero in-flight tasks in a given time. Keep in mind, expensive operations like a snapshot restore can cause this to spike temporarily. What to Look For Run the command and ensure none or few tasks in-flight.curl curl curl -XGET 'http://localhost:9200/_cat/pending_tasks' If it looks to be a constant stream of cluster updates that finish quickly, look at what might be triggering them. Is it a mapping explosion or creating too many indices? If it’s just a few, but they seem stuck, look at the logs and metrics of the master node to see if there are any issues. For example, is the master node running into memory or network issues such that it can’t process cluster updates? Hot Threads What Does It Do? The hot threads API is a valuable built-in profiler to tell you where Elasticsearch is spending the most time. This can provide insights such as whether Elasticsearch is spending too much time on index refresh or performing expensive queries. What to Look For Make a call to the hot threads API. To improve accuracy, it’s recommended to capture many snapshots using the ?snapshotsparamcurl -XGET 'http://localhost:9200/_nodes/hot_threads?snapshots=1000' This will return stack traces seen when the snapshot was taken. Look for the same stack in many different snapshots. For example, you might see the text 5/10 snapshots sharing following 20 elements. This means a thread spends time in that area of the code during 5 snapshots. You should also look at the CPU %. If an area of code has both high snapshot sharing and high CPU %, this is a hot code path. By looking at the code module, disassemble what Elasticsearch is doing. If you see wait or park state, this is usually okay. How to Fix If a large amount of CPU time is spent on index refresh, then try increasing the refresh interval beyond the default 1 second. If you see a large amount in cache, maybe your default caching settings are suboptimal and causing a heavy miss. Memory Issues Check Elasticsearch Heap/Garbage Collection What Does It Do? As a JVM process, the heap is the area of memory where a lot of Elasticsearch data structures are stored and requires garbage collection cycles to prune old objects. For typical production setups, Elasticsearch locks all memory using mlockall on-boot and disables swapping. If you’re not doing this, do it now. If Heap is consistently above 85% or 90% for a node, this means we are coming close to out of memory. What to Look For Search for collecting in the last in Elasticsearch logs. If these are present, this means Elasticsearch is spending higher overhead on garbage collection (which takes time away from other productive tasks). A few of these every now and then ok as long as Elasticsearch is not spending the majority of its CPU cycles on garbage collection (calculate the percentage of time spent on collecting relative to the overall time provided). A node that is spending 100% time on garbage collection is stalled and cannot make forward progress. Nodes that appear to have network issues like timeouts may actually be due to memory issues. This is because a node can’t respond to incoming requests during a garbage collection cycle. How to Fix The easiest is to add more nodes to increase the heap available for the cluster. However, it takes time for Elasticsearch to rebalance shards to the empty nodes. If only a small set of nodes have high heap usage, you may need to better balance your customer. For example, if your shards vary in size drastically or have different query/index bandwidths, you may have allocated too many hot shards to the same set of nodes. To move a shard, use the reroute API. Just adjust the shard awareness sensitivity to ensure it doesn’t get moved back.curl -XPOST -H "Content-Type: application/json" localhost:9200/_cluster/reroute -d ' { "commands": [ { "move": { "index": "test", "shard": 0, "from_node": "node1", "to_node": "node2" } } ] }' If you are sending large bulk requests to Elasticsearch, try reducing the batch size so that each batch is under 100MB. While larger batches help reduce network overhead, they require allocating more memory to buffer the request which cannot be freed until after both the request is complete and the next GC cycle. Check Elasticsearch Old Memory Pressure What Does It Do? The old memory pool contains objects that have survived multiple garbage collection cycles and are long-living objects. If the old memory is over 75%, you might want to pay attention to it. As this fills up beyond 85%, more GC cycles will happen but the objects can’t be cleaned up. What to Look For Look at the old pool used/old pool max. If this is over 85%, that is concerning. How to Fix Are you eagerly loading a lot of field data? These reside in memory for a long time. Are you performing many long-running analytics tasks? Certain tasks should be offloaded to a distributed computing framework designed for map/reduce operations like Apache Spark. Check Elasticsearch FieldData Size What Does It Do? FieldData is used for computing aggregations on a field such as terms aggregation Usually, field data for a field is not loaded in memory until the first time an aggregation is performed on it. However, this can also be precomputed on index refresh if eager_load_ordinals is set. What to Look For Look at an index or all indices field data size, like so:curl -XGET 'http://localhost:9200/index_1/_stats/fielddata' An index could have very large field data structures if we are using it on the wrong type of data. Are you performing aggregations on very high-cardinality fields like a UUID or trace id? Field data is not suited for very high-cardinality fields as they will create massive field data structures. Do you have a lot of fields with eager_load_ordinals set or allocate a large amount to the field data cache. This causes the field data to be generated at refresh time vs query time. While it can speed up aggregations, it’s not optimal if you’re computing the field data for many fields at index refresh and never consume it in your queries. How to Fix Make adjustments to your queries or mapping to not aggregate on very high cardinality keys. Audit your mapping to reduce the number that have eager_load_ordinals set to true. Elasticsearch Networking Issues Node Left or Node Disconnected What Does It Do? A node will eventually be removed from the cluster if it does not respond to requests. This allows shards to be replicated to other nodes to meet the replication factor and ensure high availability even if a node was removed. What to Look For Look at the master node logs. Even though there are multiple masters, you should look at the master node that is currently elected. You can use the nodes API or a tool like Cerebro to do this. Look if there is a consistent node that times out or has issues. For example, you can see which nodes are still pending for a cluster update by looking for the phrase pending nodes in the master node’s logs. If you see the same node keep getting added but then removed, it may imply the node is overloaded or unresponsive. If you can’t reach the node from your master node, it could imply a networking issue. You could also be running into the NIC or CPU bandwidth limitations How to Fix Test with the setting transport.compression set to true This will compress traffic between nodes (such as from ingestion nodes to data nodes) reducing network bandwidth at the expense of CPU bandwidth. Note: Earlier versions called this setting transport.tcp.compression If you also have memory issues, try increasing memory. A node may become unresponsive due to a large time spent on garbage collection. Not Enough Master Node Issues What Does It Do? The master and other nodes need to discover each other to formulate a cluster. On the first boot, you must provide a static set of master nodes so you don’t have a split brain problem. Other nodes will then discover the cluster automatically as long as the master nodes are present. What to Look For Enable Trace logging to review discovery-related activities.curl -XPUT -H "Content-Type: application/json" localhost:9200/_cluster/_settings -d ' { "transient": {"logger.discovery.zen":"TRACE"} }' Review configurations such as minimum_master_nodes (if older than 6.x). Look at whether all master nodes in your initial master nodes list can ping each other. Review whether you have a quorum, which should be number of master nodes / 2 +1. If you have less than a quorum, no updates to the cluster state will occur to protect data integrity. How to Fix Sometimes network or DNS issues can cause the original master nodes to not be reachable. Review that you have at least number of master nodes / 2 +1 master nodes currently running. Shard Allocation Errors Elasticsearch in Yellow or Red State (Unassigned Shards) What Does It Do? When a node reboots or a cluster restore is started, the shards are not immediately available. Recovery is throttled to ensure the cluster does not get overwhelmed. Yellow state means primary indices are allocated, but secondary (replica) shards have not been allocated yet. While yellow indices are both readable and writable, availability is decreased. The yellow state is usually self-healable as the cluster replicates shards. Red indices mean primary shards are not allocated. This could be transient such as during a snapshot restore operation, but can also imply major problems such as missing data. What to Look For See the reason behind why allocation has stopped.curl -XGET 'http://localhost:9200/_cluster/allocation/explain' curl -XGET 'http://localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason' Get a list of red indices, to understand which indices are contributing to the red state. The cluster state will be in the red state as long as at least one index is red.curl -XGET 'http:localhost:9200/_cat/indices' | grep red For more detail on a single index, you can see the recovery status for the offending index.curl -XGET 'http:localhost:9200/index_1/_recovery' How to Fix If you see a timeout from max_retries (maybe the cluster was busy during allocation), you can temporarily increase the circuit breaker threshold (Default is 5). Once the number is above the circuit breaker, Elasticsearch will start to initialize the unassigned shards.curl -XPUT -H "Content-Type: application/json" localhost:9200/index1,index2/_settings -d ' { "index.allocation.max_retries": 7 }' Elasticsearch Disk Issues Index Is Read-Only What Does It Do? Elasticsearch has three disk-based watermarks that influence shard allocation. The cluster.routing.allocation.disk.watermark.low watermark prevents new shards from being allocated to a node with disk filling up. By default, this is 85% of the disk used. The cluster.routing.allocation.disk.watermark.high watermark will force the cluster to start moving shards off of the node to other nodes. By default, this is 90%. This will start to move data around until below the high watermark. If Elasticsearch disk exceeds the flood stage watermark cluster.routing.allocation.disk.watermark.flood_stage, is when the disk is getting so full that moving might not be fast enough before the disk runs out of space. When reached, indices are placed in a read-only state to avoid data corruption. What to Look For Look at your disk space for each node. Review logs for nodes for a message like below:high disk watermark [90%] exceeded on XXXXXXXX free: 5.9gb[9.5%], shards will be relocated away from this node Once the flood stage is reached, you’ll see logs like so:flood stage disk watermark [95%] exceeded on XXXXXXXX free: 1.6gb[2.6%], all indices on this node will be marked read-only Once this happens, the indices on that node are read-only. To confirm, see which indices have read_only_allow_delete set to true.curl -XGET 'http://localhost:9200/_all/_settings?pretty' | grep read_only How to Fix First, clean up disk space such as by deleting local logs or tmp files. To remove this block of read-only, make the command:curl -XPUT -H "Content-Type: application/json" localhost:9200/_all/_settings -d ' { "index.blocks.read_only_allow_delete": null }' Conclusion Troubleshooting stability and performance issues can be challenging. The best way to find the root cause is by using the scientific method of hypothesis and proving it correct or incorrect. Using these tools and the Elasticsearch management API, you can gain a lot of insights into how Elasticsearch is performing and where issues may be.

By Derric Gilling

CORE

Maximizing Uptime: How to Leverage AWS RDS for High Availability and Disaster Recovery

In today's digital era, businesses depend on their databases for storing and managing vital information. It's essential to guarantee high availability and disaster recovery capabilities for these databases to avoid data loss and reduce downtime. Amazon Web Services (AWS) offers a remarkable solution to meet these goals via its Relational Database Service (RDS). This article dives into implementing high availability and disaster recovery using AWS RDS. Grasping AWS RDS Amazon RDS is a managed database service, making database deployment, scaling, and handling more straightforward. It accommodates database engines like MySQL, PostgreSQL, Oracle, and SQL Server. AWS RDS oversees regular tasks such as backups, software patching, and hardware provisioning, thus enabling users to concentrate on their applications instead of database management. Achieving High Availability Through Multi-AZ Deployments High availability refers to the system's capacity to maintain operation and accessibility despite component failures. AWS RDS provides Multi-AZ (Availability Zone) deployments to ensure your database instances retain high availability. What Do Availability Zones Represent? AWS data centers span several geographic regions, each comprising at least two Availability Zones. These Zones represent distinct locations outfitted with duplicate power, networking, and connectivity. They offer fault tolerance and guarantee that shortcomings in one zone fail to influence others. How Multi-AZ Deployments Work Multi-AZ deployments operate on a system where AWS duplicates your principal database to a backup instance within a separate Availability Zone. Any alterations on the primary instance have synchronous replication on the standby instance. If an outage, planned or unplanned, impacts the primary instance, AWS promotes the standby instance to assume the role of the new primary, thus reducing downtime. Understanding Multi-AZ Deployments Setup Establishing Multi-AZ deployments is done directly via the AWS Management Console or Command Line Interface (CLI). When creating an RDS instance, one chooses the "Multi-AZ Deployment" option. AWS takes over from there, overseeing synchronization, failover, and monitoring tasks. Disaster Recovery Utilizing Read Replicas While Multi-AZ deployments ensure high availability inside one region, disaster recovery necessitates a strategy for managing regional shutdowns. As a solution, AWS RDS presents Read Replicas. Understanding Read Replicas Read Replicas represent asynchronous duplicates of the primary database instance. They permit the creation of numerous read-only copies in varied regions, distributing read traffic and offering options for disaster recovery. Functioning of Read Replicas The primary instance replicates data to the Read Replicas in an asynchronous manner. Although Read Replicas only allow read operations, they hold the potential to be elevated to standalone database instances during a disaster. By channeling read traffic toward the Read Replicas it lightens the load of read operations on the primary instance, enhancing overall performance. Establishing Disaster Recovery Using Read Replicas Through the AWS Management Console or CLI, one can create Read Replicas. The replication settings include the region and are configurable. Managing the promotion process is feasible manually or via automation using AWS tools such as AWS Lambda. Conclusion High availability and disaster recovery form the cornerstone of database management strategies. AWS RDS enables businesses to quickly deploy Multi-AZ for regional high availability and Read Replicas for cross-regional disaster recovery. Utilizing these features and monitoring tools, companies can maintain the resilience and accessibility of their databases, even when encountering unforeseen challenges. Therefore, AWS RDS is a dependable and sturdy option for managing databases in the cloud.

By Raghava Dittakavi

Maintenance

DZone's Featured Maintenance Resources

Top Maintenance Experts

The Latest Maintenance Topics