2024 DevOps Lifecycle: Share your expertise on CI/CD, deployment metrics, tech debt, and more for our Feb. Trend Report (+ enter a raffle!).
Kubernetes in the Enterprise: Join our Virtual Roundtable as we dive into Kubernetes over the past year, core usages, and emerging trends.
Big data comprises datasets that are massive, varied, complex, and can't be handled traditionally. Big data can include both structured and unstructured data, and it is often stored in data lakes or data warehouses. As organizations grow, big data becomes increasingly more crucial for gathering business insights and analytics. The Big Data Zone contains the resources you need for understanding data storage, data modeling, ELT, ETL, and more.
AWS Partition Projections: Enhancing Athena Query Performance
Introduction to Snowflake for Beginners
Event-driven architectures have been successfully used for quite an amount of time by a lot of organizations in various business cases. They excel at performance, scalability, evolvability, and fault tolerance, providing a good level of abstraction and elasticity. These strengths made them good choices when applications needed real or near real-time reactiveness. In terms of implementations, for standard messaging, ActiveMQ and RabbitMQ are good candidates, while for data streaming, platforms such as Apache Kafka and Redpanda are more suitable. Usually, when developers and architects need to opt for either one of these two directions they analyze and weigh from a bunch of angles – message payload, flow and usage of data, throughput, and solution topology. As the discussion around these aspects can get too big and complex, it is not going to be refined as part of this article. Conceptually, event-driven architectures involve at least three main actors: message producers, message brokers, and message consumers. Briefly, the purpose is to allow the producers and the consumers to communicate in a decoupled and asynchronous way that is accomplished with the help of the previously mentioned message brokers. In the optimistic scenario, a producer creates a message, publishes it to a topic owned by the broker from which the consumer reads it, deals with it, and out of courtesy provides a response back. Messages are serialized (marshaled) by the producers when sent to topics and de-serialized (unmarshaled) by consumers when received from topics. This article focuses on the situation in which a consumer experiences issues when de-serializing a received message and provides a way of being able to act further. A few examples of such actions may include constructing a default message or sending back feedback to the message broker. Developers are creative enough to decide on this behavior, depending on the particular implemented use cases. Setup Java 21 Maven 3.9.2 Spring Boot – version 3.1.5 Redpanda message broker running in Docker – image version 23.2.15 Redpanda is a lightweight message broker, and it was chosen for this proof of concept to give the readers the opportunity to experiment with a different option than the widely used Kafka one. As it is Kafka-compatible, the development and the configuration of the producers and consumers will not need to change at all if moving from one service provider to another. According to Redpanda documentation, the Docker support applies only to development and testing. For the purpose of this project, this is more than enough; thus, a single Redpanda message broker is set up to run in Docker. See Resource 1 at the conclusion of this article for details on how to accomplish the minimal setup. Once up and running, a topic called minifig is created with the following command: PowerShell >docker exec -it redpanda-0 rpk topic create minifig TOPIC STATUS minifig OK If the cluster is inspected, one may observe that a topic with one partition and one replica was created. PowerShell >docker exec -it redpanda-0 rpk cluster info CLUSTER ======= redpanda.581f9a24-3402-4a17-af28-63353a602421 BROKERS ======= ID HOST PORT 0* redpanda-0 9092 TOPICS ====== NAME PARTITIONS REPLICAS __consumer_offsets 3 1 _schemas 1 1 minifig 1 1 Implementation The flow is straightforward: the producer sends a request to the configured topic which is further read by the consumer, as it is able to. A request represents a mini-figure that is simplistically modeled by the following record: Java public record Minifig(String id, Size size, String name) { public Minifig(Size size, String name) { this(UUID.randomUUID().toString(), size, name); } public enum Size { SMALL, MEDIUM, BIG; } } id is the unique identifier of the Minifig which has a certain name and is of a certain size – small, medium, or big. For configuring a producer and a consumer, at least these properties are needed (application.properties file): Properties files # the path to the message broker broker.url=localhost:19092 # the name of the broker topic topic.minifig=minifig # the unique string that identifies the consumer group of the consumer topic.minifig.group.id=group-0 For sending messages, the producer needs a KafkaTemplate instance. Java @Configuration public class KafkaConfig { @Value("${broker.url}") private String brokerUrl; @Bean public KafkaTemplate<String, String> kafkaTemplate() { Map<String, Object> props = new HashMap<>(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, StringSerializer.class); ProducerFactory<String, String> producerFactory = new DefaultKafkaProducerFactory<>(props); return new KafkaTemplate<>(producerFactory); } } One may observe in the producer configuration that a StringSerializer was chosen for marshaling the payload value. Usually, a JsonSerializer provides more robustness to the producer-consumer contract. Nevertheless, the choice here was intentional to increase the experimental flexibility on the consumer side (as we will see later). Just as a reminder, the interest in this proof of concept is to act on the encountered deserialization errors. Once the messages reach the minifig topic, a consumer is configured to pick them up. Java @Configuration @EnableKafka public class KafkaConfig { @Value("${broker.url}") private String brokerUrl; @Value("${topic.minifig.group.id}") private String groupId; @Bean public ConcurrentKafkaListenerContainerFactory<String, Minifig> kafkaListenerContainerFactory() { Map<String, Object> props = new HashMap<>(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class); props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId); props.put(JsonDeserializer.TRUSTED_PACKAGES, Minifig.class.getPackageName()); props.put(JsonDeserializer.VALUE_DEFAULT_TYPE, Minifig.class.getName()); DefaultKafkaConsumerFactory<String, Minifig> defaultFactory = new DefaultKafkaConsumerFactory<>(props); ConcurrentKafkaListenerContainerFactory<String, Minifig> factory = new ConcurrentKafkaListenerContainerFactory<>(); factory.setConsumerFactory(defaultFactory); factory.setCommonErrorHandler(new DefaultErrorHandler()); return factory; } } The KafkaListenerContainerFactory interface is responsible for creating the listener container for a particular endpoint. The @EnableKafka annotation on the configuration class enables the detection of @KafkaListener annotations on any Spring-managed beans in the container. Thus, the actual listener (the message consumer) is developed next. Java @Component public class MinifigListener { private static final Logger LOG = LoggerFactory.getLogger(MinifigListener.class); @KafkaListener(topics = "${topic.minifig}", groupId = "${topic.minifig.group.id}") public void onReceive(@Payload Minifig minifig) { LOG.info("New minifig received - {}.", minifig); } } Its functionality is trivial. It only logs the messages read from the minifig topic, destined for the configured consumer group. If the application is started, provided the message broker is up, the listener is ready to receive messages. Plain Text INFO 10008 --- [main] o.a.k.clients.consumer.KafkaConsumer : [Consumer clientId=consumer-group-0-1, groupId=group-0] Subscribed to topic(s): minifig INFO 10008 --- [ntainer#0-0-C-1] org.apache.kafka.clients.Metadata : [Consumer clientId=consumer-group-0-1, groupId=group-0] Cluster ID: redpanda.581f9a24-3402-4a17-af28-63353a602421 INFO 10008 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-group-0-1, groupId=group-0] Discovered group coordinator localhost:19092 (id: 2147483647 rack: null) INFO 10008 --- [ntainer#0-0-C-1] o.a.k.c.c.internals.ConsumerCoordinator : [Consumer clientId=consumer-group-0-1, groupId=group-0] Found no committed offset for partition minifig-0 INFO 10008 --- [ntainer#0-0-C-1] o.s.k.l.KafkaMessageListenerContainer : group-0: partitions assigned: [minifig-0] In order to check the integration, the following simple test is used. Since a Minifig is expected by the listener, a compliance template was created for convenience. Java @SpringBootTest class AppTest { @Autowired private KafkaTemplate<String, String> kafkaTemplate; @Value("${topic.minifig}") private String topic; final String template = "{" + "\"id\":\"%s\"," + "\"size\":\"%s\"," + "\"name\":\"%s\"" + "}"; @Test void send_compliant() { final String minifig = String.format(template, UUID.randomUUID(), Minifig.Size.SMALL, "Spider-Man"); kafkaTemplate.send(topic, minifig); } } When running the test, a "compliant" message is sent to the broker, and as expected, it is successfully picked up by the local consumer. Plain Text INFO 10008 --- [ntainer#0-0-C-1] c.h.e.listener.MinifigListener : New minifig received - Minifig[id=0c75b9e4-511a-48b3-a984-404d2fc1d47b, size=SMALL, name=Spider-Man]. Redpanda Console can be helpful in observing what is happening at the broker level, particularly what is flowing through the minifig topic. In scenarios such as the one above, messages are sent from the producer to the consumer via the message broker, as planned. Recover on Deserialization Failures In the particular case of this proof of concept, it is assumed the type of a mini-figure can be SMALL, MEDIUM, or BIG, in line with the defined Type enum. In case the producer sends a mini-figure of an unknown type, one that deviates a bit from the agreed contract, the messages are basically rejected by the listener, as the payload cannot be de-serialized. To simulate this, the following test is run. Java @SpringBootTest class AppTest { @Autowired private KafkaTemplate<String, String> kafkaTemplate; @Value("${topic.minifig}") private String topic; final String template = "{" + "\"id\":\"%s\"," + "\"size\":\"%s\"," + "\"name\":\"%s\"" + "}"; @Test void send_non_compliant() { final String minifig = String.format(template, UUID.randomUUID(), "Unknown", "Spider-Man"); kafkaTemplate.send(topic, minifig); } } The message reaches the topic, but not the MinifigListener#onReceive() method. As expected, the error appeared when the payload was being unmarshaled. The causes can be depicted by looking deep down the stack trace. Plain Text Caused by: org.apache.kafka.common.errors.SerializationException: Can't deserialize data from topic [minifig] Caused by: com.fasterxml.jackson.databind.exc.InvalidFormatException: Cannot deserialize value of type `com.hcd.errhandlerdeserializer.domain.Minifig$Size` from String "Unknown": not one of the values accepted for Enum class: [BIG, MEDIUM, SMALL] at [Source: (byte[])"{"id":"fbc86874-55ac-4313-bbbb-0ed99341825a","size":"Unknown","name":"Spider-Man"}"; line: 1, column: 53] (through reference chain: com.hcd.errhandlerdeserializer.domain.Minifig["size"]) One other aspect is that the messages are continually tried to be read on the consumer side. This is unfortunate at least from the consumer point of view, as the logs are accumulating. In order to pass over such situations, the JsonDeserializer used for unmarshaling the payload value is decorated in an ErrorHandlingDeserializer as its actual delegate. Moreover, the ErrorHandlingDeserializer has a failedDeserializationFunction member that according to its JavaDoc, provides an alternative mechanism when the deserialization fails. The new consumer configuration looks as below: Java @Bean public ConcurrentKafkaListenerContainerFactory<String, Minifig> kafkaListenerContainerFactory() { Map<String, Object> props = new HashMap<>(); props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerUrl); props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class); props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, JsonDeserializer.class); props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId); props.put(JsonDeserializer.TRUSTED_PACKAGES, Minifig.class.getPackageName()); props.put(JsonDeserializer.VALUE_DEFAULT_TYPE, Minifig.class.getName()); JsonDeserializer<Minifig> jsonDeserializer = new JsonDeserializer<>(Minifig.class); ErrorHandlingDeserializer<Minifig> valueDeserializer = new ErrorHandlingDeserializer<>(jsonDeserializer); valueDeserializer.setFailedDeserializationFunction(new MinifigFailedDeserializationFunction()); DefaultKafkaConsumerFactory<String, Minifig> defaultFactory = new DefaultKafkaConsumerFactory<>(props, new StringDeserializer(), valueDeserializer); ConcurrentKafkaListenerContainerFactory<String, Minifig> factory = new ConcurrentKafkaListenerContainerFactory<>(); factory.setConsumerFactory(defaultFactory); factory.setCommonErrorHandler(new DefaultErrorHandler()); return factory; } The failedDeserializationFunction used here is simplistic, but the reason is to prove its utility. Java public class MinifigFailedDeserializationFunction implements Function<FailedDeserializationInfo, Minifig> { private static final Logger LOG = LoggerFactory.getLogger(MinifigFailedDeserializationFunction.class); @Override public Minifig apply(FailedDeserializationInfo failedDeserializationInfo) { final Exception exception = failedDeserializationInfo.getException(); LOG.info("Error deserializing minifig - {}", exception.getCause().getMessage()); return new Minifig("Default"); } } The FailedDeserializationInfo entity (the Function#apply() input) is constructed during the recovery from the de-serialization exception and it encapsulates various pieces of information (here, the exception is the one leveraged). Since the output of the apply() method is the actual deserialization result, one may return either null or whatever is suitable depending on the aimed behavior. If running the send_non_compliant() test again, the deserialization exception is handled and a default value is returned. Further, the MinifigListener is invoked and has the opportunity to deal with it. Plain Text INFO 30160 --- [ntainer#0-0-C-1] e.l.MinifigFailedDeserializationFunction : Error deserializing minifig - Cannot deserialize value of type `com.hcd.errhandlerdeserializer.domain.Minifig$Size` from String "Unknown": not one of the values accepted for Enum class: [BIG, MEDIUM, SMALL] at [Source: (byte[])"{"id":"f35a77bf-29e5-4f5c-b5de-cc674f22029f","size":"Unknown","name":"Spider-Man"}"; line: 1, column: 53] (through reference chain: com.hcd.errhandlerdeserializer.domain.Minifig["size"]) INFO 30160 --- [ntainer#0-0-C-1] c.h.e.listener.MinifigListener : New minifig received - Minifig[id=null, size=SMALL, name=Undefined]. Conclusion Configuring Kafka producers and consumers and fine-tuning them in order to achieve the desired performance in accordance with the used message brokers is not always straightforward. Controlling each step of the communication is by all means something desirable and, moreover, acting fast in unknown situations helps deliver robust and easy-to-maintain solutions. This post focused on the deserialization issues that might appear at the Kafka consumer level and provided a way of having a second plan when dealing with non-compliant payloads. Sample Code err-handler-deserializer Resources Redpanda Quickstart Spring for Apache Kafka The featured image was taken at Zoo Brasov, Romania
Learn how to launch an Apache Kafka with the Apache Kafka Raft (KRaft) consensus protocol and SASL/PLAIN authentication. PLAIN versus PLAINTEXT: Do not confuse the SASL mechanism PLAIN with the no TLS/SSL encryption option, which is called PLAINTEXT. Configuration parameters such as sasl.enabled.mechanisms or sasl.mechanism.inter.broker.protocol may be configured to use the SASL mechanism PLAIN, whereas security.inter.broker.protocol or listeners may be configured to use the no TLS/SSL encryption option, SASL_PLAINTEXT. Prerequisites An understanding of Apache Kafka, Kubernetes, and Minikube. The following steps were initially taken on a MacBook Pro with 32GB memory running MacOS Ventura v13.4. Make sure to have the following applications installed: Docker v23.0.5 Minikube v1.29.0 (running K8s v1.26.1 internally) It's possible the steps below will work with different versions of the above tools, but if you run into unexpected issues, you'll want to ensure you have identical versions. Minikube was chosen for this exercise due to its focus on local development. Deployment Components We need to configure brokers and clients to use SASL authentication. Refer to Kafka Broker and Controller Configurations for Confluent Platform page for a detailed explanation of the configurations used here. The deployment we will create will have the following components: Namespace: kafka. This is the namespace within which all components will be scoped. Service Account: kafka. Service accounts are used to control permissions and access to resources within the cluster. Headless Service: kafka-headless. It exposes ports 9092 (for SASL_PLAINTEXT communication). StatefulSet: kafka. It manages Kafka pods and ensures they have stable hostnames and storage. The source code for this deployment can be found in this GitHub repository. Broker Enable SASL/PLAIN mechanism in the server.properties file of every broker. YAML # List of enabled mechanisms, can be more than one - name: KAFKA_SASL_ENABLED_MECHANISMS value: PLAIN # Specify one of of the SASL mechanisms - name: KAFKA_SASL_MECHANISM_INTER_BROKER_PROTOCOL value: PLAIN Tell the Kafka brokers on which ports to listen for client and interbroker SASL connections. Configure listeners, and advertised.listeners: YAML - command: ... export KAFKA_ADVERTISED_LISTENERS=SASL://${POD_NAME}.kafka-headless.kafka.svc.cluster.local:9092 ... env: ... - name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP value: "CONTROLLER:PLAINTEXT,SASL:SASL_PLAINTEXT" - name: KAFKA_LISTENERS value: SASL://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093 Configure JAAS for the Kafka broker listener as follows: YAML - name: KAFKA_LISTENER_NAME_SASL_PLAIN_SASL_JAAS_CONFIG value: org.apache.kafka.common.security.plain.PlainLoginModule required username="admin" password="admin-secret" user_admin="admin-secret" user_kafkaclient1="kafkaclient1-secret"; Client Create a ConfigMap based on the sasl_client.properties file: YAML kubectl create configmap kafka-client --from-file sasl_client.properties -n kafka kubectl describe configmaps -n kafka kafka-client Output: configmap/kafka-client created Name: kafka-client Namespace: kafka Labels: <none> Annotations: <none> Data ==== sasl_client.properties: ---- sasl.mechanism=PLAIN security.protocol=SASL_PLAINTEXT sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \ username="kafkaclient1" \ password="kafkaclient1-secret"; BinaryData ==== Events: <none> Mount the ConfigMap as a volume: YAML ... volumeMounts: - mountPath: /etc/kafka/secrets/ name: kafka-client ... volumes: - name: kafka-client configMap: name: kafka-client Creating the Deployment Clone the repo: YAML git clone https://github.com/rafaelmnatali/kafka-k8s.git cd ssl Deploy Kafka using the following commands: YAML kubectl apply -f 00-namespace.yaml kubectl apply -f 01-kafka-local.yaml Verify Communication Across Brokers There should now be three Kafka brokers, each running on separate pods within your cluster. Name resolution for the headless service and the three pods within the StatefulSet is automatically configured by Kubernetes as they are created, allowing for communication across brokers. See the related documentation for more details on this feature. You can check the first pod's logs with the following command: YAML kubectl logs kafka-0 The name resolution of the three pods can take more time to work than it takes the pods to start, so you may see UnknownHostException warnings in the pod logs initially: YAML WARN [RaftManager nodeId=2] Error connecting to node kafka-1.kafka-headless.kafka.svc.cluster.local:29093 (id: 1 rack: null) (org.apache.kafka.clients.NetworkClient) java.net.UnknownHostException: kafka-1.kafka-headless.kafka.svc.cluster.local But eventually each pod will successfully resolve pod hostnames and end with a message stating the broker has been unfenced: YAML INFO [Controller 0] Unfenced broker: UnfenceBrokerRecord(id=1, epoch=176) (org.apache.kafka.controller.ClusterControlManager) Create a Topic Using the SASL_PLAINTEXT Endpoint The Kafka StatefulSet should now be up and running successfully. Now we can create a topic using the SSL endpoint. You can deploy Kafka Client using the following command: YAML kubectl apply -f 02-kafka-client.yaml Check if the Pod is Running: YAML kubectl get pods Output: YAML NAME READY STATUS RESTARTS AGE kafka-cli 1/1 Running 0 12m Connect to the pod kafka-cli: YAML kubectl exec -it kafka-cli -- bash Create a topic named test-ssl with three partitions and a replication factor of 3. YAML kafka-topics --create --topic test-sasl --partitions 3 --replication-factor 3 --bootstrap-server ${BOOTSTRAP_SERVER} --command-config /etc/kafka/secrets/sasl_client.properties Created topic test-sasl. The environment variable BOOTSTRAP_SERVER contains the list of the brokers; therefore, we save time in typing. List all the topics in Kafka: YAML kafka-topics --bootstrap-server ${BOOTSTRAP_SERVER} -list --command-config /etc/kafka/secrets/sasl_client.properties test test-sasl test-ssl test-test Summary and Next Steps This tutorial showed you how to get Kafka running in KRaft mode on a Kubernetes cluster with SASL authentication. This is another step to secure communication between clients and brokers in addition to the SSL encryption discussed in this article. I invite you to keep studying and investigating how to improve security in your environment.
Large-scale cluster management has become a critical challenge for organizations in the age of big data and cloud computing. To address this issue, Apache Mesos emerged as a powerful open-source platform that simplifies the management of distributed systems. Mesos allows businesses to abstract and pool computing resources, allowing for dynamic resource allocation and efficient utilization across multiple applications and frameworks. In this article, we will delve into the features, benefits, and use cases of Mesos, exploring how it revolutionizes cluster management. What Is Mesos? Apache Mesos is an open-source distributed systems kernel that provides a scalable and efficient platform for managing and running applications on large-scale clusters. It acts as a middle layer between the hardware infrastructure and the applications, abstracting and pooling computing resources to be dynamically allocated to different applications and frameworks. Mesos enables organizations to achieve high resource utilization, fault tolerance, and scalability while simplifying the management of distributed systems. Mesos was initially developed at the University of California, Berkeley, and later became an Apache Software Foundation project. It is designed to handle thousands of nodes in a cluster and is widely used by major companies like Twitter, Airbnb, Apple, and Netflix. At its core, Mesos allows multiple applications to run concurrently on the same cluster, providing resource isolation and sharing capabilities. It abstracts the cluster’s resources, including CPU, memory, storage, and network, into a unified resource pool. Applications can request resources from Mesos, and it dynamically allocates them based on the availability and requirements of each application. One of the key features of Mesos is its scalability and fault tolerance. It achieves scalability by utilizing a master-slave architecture, where the cluster is managed by one or more master nodes, and the actual tasks are executed on slave nodes. Multiple masters can be run simultaneously, ensuring high availability and fault tolerance. If a master fails, another master is automatically elected to take over its responsibilities. Mesos also offers dynamic resource allocation, which allows applications to accept or decline resource offers based on their needs. This flexibility enables efficient utilization of the cluster by adapting to changing workloads and optimizing resource allocation. Mesos supports various frameworks, such as Apache Hadoop, Apache Spark, Kubernetes, and more, allowing organizations to run different types of applications simultaneously on the same infrastructure. Furthermore, Mesos provides a rich set of APIs and interfaces for developers to build and integrate their own frameworks and applications. This extensibility allows customization and integration with different tools and technologies, making it a versatile platform for various use cases. Overall, Apache Mesos simplifies the management of large-scale clusters, improves resource utilization, and provides fault tolerance and scalability. It has gained popularity in the industry due to its ability to efficiently run diverse workloads, making it an essential tool for organizations operating in the era of big data and cloud computing. Key Features of Mesos Resource Sharing and Isolation: Mesos allows multiple applications and frameworks to share the same cluster while providing resource isolation. It abstracts and pools computing resources, such as CPU, memory, storage, and network, making them available for dynamic allocation to different applications. This enables efficient utilization of resources and prevents one application from impacting the performance of others. Scalability and Fault-Tolerance: Mesos is designed to handle large-scale clusters with thousands of nodes. It employs a master-slave architecture, where multiple master nodes manage the cluster and coordinate resource allocation. In case of failures, Mesos automatically elects a new leader, ensuring fault tolerance and high availability. The system scales horizontally by adding more slave nodes to the cluster, accommodating growing workloads. Dynamic Resource Allocation: Mesos uses resource offers to allocate resources to applications and frameworks. Applications receive offers containing available resources, and they can accept or decline the offers based on their requirements. This dynamic allocation allows for efficient utilization of resources, as applications can adapt to workload changes and only utilize resources when needed. Flexible Framework Support: Mesos provides an extensible framework API, allowing developers to build and integrate their own frameworks for specific use cases. It supports a wide range of frameworks, including popular ones like Apache Hadoop, Apache Spark, and Kubernetes. This flexibility enables organizations to leverage existing frameworks or develop custom ones, depending on their requirements. Fine-Grained Resource Allocation: Mesos allows for fine-grained resource allocation by specifying resource constraints and guarantees. Applications can request specific amounts of CPU, memory, and other resources, ensuring that they receive the necessary resources for their execution. This fine-grained control enables efficient resource utilization and allocation based on application requirements. Containerization Support: Mesos integrates well with containerization technologies such as Docker, enabling the deployment and management of containerized applications. It provides seamless integration with container orchestration platforms like Kubernetes, allowing organizations to leverage the benefits of containerization while benefiting from Mesos’ resource management capabilities. Health Monitoring and Fault Recovery: Mesos monitors the health of applications and automatically recovers from failures. It detects failed tasks or applications and reschedules them on healthy nodes, ensuring high availability and preventing data loss. This built-in fault recovery mechanism reduces downtime and improves the robustness of the system. Web-based User Interface and APIs: Mesos offers a web-based user interface that provides visibility into the cluster’s status, resource allocation, and application performance. It also exposes APIs for programmatic access, allowing developers to interact with Mesos programmatically and integrate it into their own systems and tools. Benefits of Mesos Efficient Resource Utilization: Mesos enables organizations to maximize the utilization of their computing resources. By abstracting and pooling resources, it eliminates resource silos and allows multiple applications and frameworks to share the same cluster. This results in better utilization of CPU, memory, storage, and network resources, reducing idle capacity and optimizing infrastructure costs. Simplified Cluster Management: Mesos provides a unified interface for managing applications and frameworks across the cluster. It abstracts the underlying infrastructure complexity, allowing administrators to focus on higher-level management tasks. With Mesos, organizations can easily deploy, monitor, and scale applications without the need for manual intervention on individual machines or nodes. Improved Fault Tolerance: Mesos is designed to handle failures gracefully. It employs a master-slave architecture with multiple master nodes, ensuring high availability. In the event of a master node failure, a new leader is automatically elected to take over its responsibilities. Additionally, Mesos monitors the health of applications and automatically recovers failed tasks or applications, minimizing downtime and improving the overall system’s robustness. Scalability and Elasticity: Mesos scales horizontally, allowing organizations to seamlessly expand their cluster as their workload and resource requirements grow. It supports adding more slave nodes to the cluster, providing scalability and elasticity to accommodate increasing demands. This scalability ensures that the cluster can handle large-scale workloads without compromising performance or resource allocation efficiency. Dynamic Workload Management: Mesos offers dynamic resource allocation and scheduling, allowing applications to adapt to changing workloads. Applications can accept or decline resource offers based on their requirements, enabling fine-grained control over resource allocation. This dynamic workload management ensures efficient resource utilization, as resources can be allocated where they are most needed at any given time. Flexibility with Frameworks: Mesos supports a wide range of frameworks, including popular ones like Apache Hadoop, Apache Spark, and Kubernetes. This flexibility allows organizations to choose the frameworks that best fit their specific requirements and seamlessly integrate them into the Mesos ecosystem. It also enables organizations to develop and integrate their own custom frameworks for specialized use cases. Community and Ecosystem: Mesos has a vibrant open-source community and a growing ecosystem of tools and frameworks built on top of it. This active community ensures continuous development, support, and improvement of Mesos. It also provides access to a wide range of resources, documentation, and best practices, making it easier for organizations to adopt and leverage Mesos for their cluster management needs. Conclusion Apache Mesos has emerged as a game-changer in the field of cluster management, offering a unified and scalable platform for running diverse applications and frameworks. Mesos makes managing large-scale clusters simpler by maximizing resource utilization and enhancing operational effectiveness through its resource sharing, fault tolerance, and dynamic allocation capabilities. As organizations continue to embrace the era of big data and cloud computing, Mesos proves to be an invaluable tool for streamlining distributed systems and enabling the seamless execution of diverse workloads. Mesos offers a robust set of features for managing distributed systems and huge clusters overall. Its resource sharing and isolation capabilities, scalability, fault tolerance, dynamic resource allocation, and support for frameworks and containers make it an attractive choice for organizations seeking efficient and flexible cluster management solutions. In conclusion, Mesos has a number of advantages, including effective resource management, streamlined cluster management, improved fault tolerance, scalability, dynamic workload management, flexibility with frameworks, and access to a thriving community and ecosystem. Because of these benefits, organizations looking to manage their large-scale clusters more effectively and run various workloads effectively should consider Mesos.
In the dynamic landscape of the Internet of Things (IoT), the convergence of Big Data and IoT software is both a boon and a puzzle for developers. The promise of harnessing vast volumes of real-time data from IoT devices to drive intelligent decision-making is enticing, but it comes with its share of complexities. Challenges such as managing data volume, ensuring quality and reliability, handling complexity, operating in distributed environments, and maintaining robust security demand innovative solutions. Fortunately, Big Data tools and technologies offer a lifeline, empowering developers to conquer these challenges and unlock the transformative potential of IoT applications. Challenges and Solutions in Integrating Big Data with IoT Software Integrating Big Data with IoT software offers immense potential, but it also comes with its set of challenges. Here, we'll discuss some of the key challenges and how Big Data tools and technologies can provide solutions. Challenge 1: Data Volume and Velocity One of the foremost challenges in the IoT landscape is the sheer volume and velocity of data generated by devices. IoT sensors and devices continuously produce data streams, overwhelming traditional data processing systems. Big Data technologies like Apache Kafka and Apache Flink provide solutions by offering real-time data stream processing capabilities. These tools can ingest, process, and store large volumes of data streams in a distributed and scalable manner. Solution 1: Real-time Stream Processing Tools like Apache Kafka and Apache Spark Streaming enable real-time stream processing, allowing IoT applications to analyze data as it's generated. This capability ensures timely responses to critical events and enhances the efficiency of IoT systems. Challenge 2: Data Quality and Veracity IoT data can be noisy and unreliable due to various factors, including sensor errors, connectivity issues, and environmental conditions. To ensure data quality, Big Data tools incorporate data validation and cleansing techniques. They can filter out outliers, correct missing values, and enhance data quality, making it more reliable for analysis. Solution 2: Data Validation and Cleansing Big Data platforms offer data cleansing and validation libraries that can be integrated into IoT pipelines. These libraries help identify and address data quality issues, ensuring that only accurate and relevant data is used for analysis. Challenge 3: Data Complexity IoT data is often complex, with diverse formats and structures. Big Data platforms, such as Hadoop and Spark, provide versatile storage and processing capabilities. They can handle both structured and unstructured data, making it easier for developers to work with heterogeneous data sources from IoT devices. Solution 3: Data Transformation and Integration Big Data frameworks support data transformation and integration processes. Developers can use tools like Apache Nifi or Talend to preprocess and correlate data from multiple IoT devices, creating a cohesive dataset for analysis. Challenge 4: Distributed IoT Applications Many IoT devices operate in remote or resource-constrained environments with limited network connectivity. Big Data tools offer solutions by supporting edge computing, where data processing occurs closer to the source. This approach reduces the need for constant data transmission to centralized servers, minimizing latency and bandwidth usage. Solution 4: Edge Computing Edge computing solutions like AWS Greengrass and Azure IoT Edge bring processing capabilities closer to IoT devices. This reduces latency, minimizes bandwidth usage, and enhances the overall performance of IoT applications. Challenge 5: Security and Privacy Security is a paramount concern in IoT applications. Big Data tools come equipped with robust security features, including data encryption, access controls, and authentication mechanisms. By implementing these security measures, IoT software developers can safeguard sensitive data generated by devices and protect against unauthorized access. Solution 5: Enhanced Security Leveraging the security features of Big Data platforms, IoT software developers can implement encryption, access controls, and authentication mechanisms to safeguard IoT data and prevent security breaches. As we navigate the intricate intersection of Big Data and IoT software, it becomes clear that challenges can be surmounted with the right tools and strategies in place. Real-time stream processing, data validation and cleansing, data transformation, edge computing, and robust security measures provided by Big Data platforms all play pivotal roles in transforming these challenges into opportunities. With these solutions at their disposal, IoT software developers can create applications that thrive in the face of massive data streams and contribute to the evolution of smarter, more efficient, and secure IoT ecosystems. As the synergy between Big Data and IoT continues to evolve, innovation in IoT software development is poised to reach new heights.
The Apache Kafka, a distributed event streaming technology, can process trillions of events each day and eventually demonstrate its tremendous throughput and low latency. That’s building trust and over 80% of Fortune 100 businesses use and rely on Kafka. To develop high-performance data pipelines, streaming analytics, data integration, etc., thousands of companies presently use Kafka around the globe. By leveraging the zero-copy principle, Kafka improves efficiency in terms of data transfer. In short, when doing computer processes, the zero-copy technique is employed to prevent the CPU from being used for data copying across memory regions. Additionally, it removes pointless data copies, conserving memory bandwidth and CPU cycles. Broadly, the zero-copy principle in Apache Kafka refers to a technique used to improve the efficiency of data transfer between producers and consumers by minimizing the amount of data copying performed by the operating system. By minimizing the CPU and memory overhead involved in copying data across buffers, the zero-copy technique can be very helpful for high-throughput, low-latency systems like Kafka. Without the zero-copy principle or adopting a traditional approach, when data is produced by a Kafka producer irrespective of any programming language, it is typically stored in a buffer (e.g., a byte array). When this data needs to be sent to the Kafka broker for storage in a designated topic, the data is copied from the producer’s buffer to a network buffer managed by the operating system. Eventually, the operating system then copies the data from its buffer to the network interface for transmission. But with the zero-copy approach, the operating system leverages features such as scatter-gather I/O or memory-mapped files to avoid unnecessary data copying. Instead of copying the data multiple times between different buffers, the operating system allows the data to be read directly from the producer’s buffer and written directly to the network interface, or vice versa. This eliminates the need for intermediary buffers and reduces the CPU and memory overhead associated with copying large amounts of data. The advantages of Apache Kafka’s zero-copy principle include: Reduced CPU usage: The zero-copy technique lowers the CPU use related to data transfer activities by minimizing needless data copying. In order to get high throughput and low latency performance with Kafka, this is essential. Improved latency: Minimizing data copying can lead to lower latency, as data can be transmitted more quickly without the additional overhead of copying between buffers. Efficient use of system resources: In situations when there are large volumes of data and strict performance requirements, Kafka can utilize system resources more effectively thanks to the zero-copy principle. It’s crucial to remember that the capabilities of the underlying operating system and network hardware affect how successful zero-copy is. Kafka maximizes data flow between producers and brokers by utilizing the zero-copy principle, which enhances the overall effectiveness and performance of Apache Kafka, the distributed event streaming system. Hope you have enjoyed this read. Please like and share if you find this composition valuable.
In the world of data-driven decision-making, ETL (Extract, Transform, Load) processes play a pivotal role. The effective management and transformation of data are essential to ensure that businesses can make informed choices based on accurate and relevant information. Data lakes have emerged as a powerful way to store and analyze massive amounts of data, and Apache NiFi is a robust tool for streamlining ETL processes in a data lake environment. Understanding Data Lake ETL Before diving into Apache NiFi, let's clarify what ETL means in the context of data lakes. Data Lakes: What Are They? Data lakes are repositories for storing vast amounts of structured and unstructured data. Unlike traditional databases, data lakes do not require data to be pre-structured before it's stored. This makes data lakes suitable for storing raw, diverse data, which can then be processed and analyzed as needed. ETL in Data Lakes ETL stands for Extract, Transform, Load. It's a process that involves: Extracting data from various sources Transforming the data to make it suitable for analysis Loading the transformed data into the data lake ETL is crucial for ensuring that the data in the data lake is clean, consistent, and ready for analysis. Challenges in Data Lake ETL Handling ETL processes in a data lake can be challenging for several reasons: Data variety: Data lakes store different data types, including structured and unstructured data, which must be transformed and processed differently. Data volume: Data lakes handle vast amounts of data, often in the petabyte range, making efficient data movement and processing critical. Data velocity: Data is continually ingested into the data lake, and ETL processes must keep up with this fast data flow. Data quality: Ensuring data quality is essential, as poor-quality data can lead to inaccurate insights. Introduction to Apache NiFi Apache NiFi is an open-source data integration tool that provides a powerful and user-friendly way to design data flows. It is well-suited for ETL processes in data lakes due to its flexibility, scalability, and data provenance capabilities. Key Features of Apache NiFi User-friendly interface: NiFi offers a drag-and-drop interface, making it accessible to both technical and non-technical users. Data provenance: NiFi tracks the data's journey from source to destination, allowing you to trace data lineage and monitor data quality. Scalability: NiFi can scale horizontally to handle large data volumes and is designed for high availability. Why Choose Apache NiFi for Data Lake ETL? NiFi's flexibility and versatility make it an excellent choice for data lake ETL: It supports various data sources and destinations, including Hadoop HDFS, AWS S3, Azure Data Lake Store, and many others. Its data transformation capabilities enable you to process data in real-time. Built-in security features ensure that data is protected during the ETL process. Setting up Apache NiFi Let's get started with setting up Apache NiFi for your data lake ETL. 1. Installation You can download Apache NiFi from the official website. Follow the installation instructions for your specific environment, whether it's on-premises or in the cloud. Be sure to meet the system requirements and install any necessary dependencies. 2. Configuration After installation, you'll need to configure NiFi to suit your needs. This involves defining data sources, configuring processors, and setting up connections between components. The NiFi interface is intuitive and user-friendly. You'll create a data flow by dragging processors onto the canvas and connecting them to define the flow of data. Building ETL Workflows With NiFi Now, let's explore how to build ETL workflows using Apache NiFi. Creating Data Pipelines To create an ETL workflow in NiFi, follow these steps: Define data sources and destinations. Add processors to perform data extraction, transformation, and loading. Connect processors to define the flow of data. For instance, you can set up a data pipeline that extracts data from an FTP server, transforms it into a structured format, and loads it into your data lake. Data Transformation NiFi provides various processors for data transformation, including: ConvertRecord: Convert data from one format to another. SplitText: Split text data into individual records. MergeContent: Merge multiple records into a single data flow file. By configuring these processors, you can tailor your data transformation to meet your specific ETL requirements. Data Ingestion and Loading NiFi supports a wide range of data destinations. You can easily configure processors to send data to Hadoop HDFS, cloud storage services like AWS S3, databases, or other data lake storage platforms. This flexibility allows you to adapt your ETL processes to your data lake's requirements. Data Lake Integration One of the strengths of Apache NiFi is its seamless integration with various data lake platforms. Hadoop HDFS Integration To integrate NiFi with Hadoop HDFS: Configure the PutHDFS processor to define the destination directory and set up Hadoop connection properties. You can also use the ListHDFS processor to retrieve file listings from HDFS. AWS S3 Integration For integration with AWS S3: Configure the PutS3Object processor to specify the S3 bucket, key, and access credentials. The GetS3Object processor can be used to retrieve objects from S3. Azure Data Lake Store Integration To connect NiFi to Azure Data Lake Store: Configure the PutAzureDataLakeStore processor with your Azure Data Lake Store credentials and target path. Use the FetchAzureDataLakeStore processor to retrieve data from the data lake. This flexibility allows you to seamlessly integrate NiFi with your chosen data lake platform. Monitoring and Management Apache NiFi provides tools for monitoring and managing ETL processes. Data Provenance Data provenance in NiFi is a powerful feature that allows you to track the data's journey. It records all actions on data flow files, helping you trace the origins of your data and identify any issues in your ETL pipeline. Logging and Alerts NiFi offers extensive logging capabilities, which can be essential for troubleshooting. You can set up alerts and notifications to be informed of any errors or issues in your ETL processes. Performance Optimization Optimizing ETL performance is critical for data lake operations. Load Balancing For high data volumes, consider setting up load balancing between multiple NiFi instances. This helps distribute the workload and ensures better performance and fault tolerance. Clustering NiFi can be configured in a clustered setup, providing scalability and high availability. In a cluster, NiFi instances work together to manage data flows and provide redundancy. Resource Allocation Properly allocate system resources (CPU, memory, and network bandwidth) to ensure that NiFi can efficiently process data. Resource allocation ensures that your ETL workflows run smoothly and meet the performance demands of your data lake. Security and Data Governance In a data lake environment, security and data governance are paramount. Apache NiFi offers features to ensure data protection and compliance. 1. Data Encryption NiFi supports data encryption both at rest and in transit. You can configure SSL/TLS to secure data while it's being transferred between components, ensuring data confidentiality and integrity. 2. Authentication and Authorization NiFi allows you to set up user authentication and authorization, ensuring that only authorized users can access and modify ETL processes. This is crucial for maintaining data security and compliance with data governance regulations. 3. Data Lineage and Auditing With NiFi's data provenance and auditing features, you can track every action taken on your data. This audit trail helps in compliance with data governance requirements and provides transparency in data management. Real-World Use Cases To illustrate the practical application of Apache NiFi in streamlining data lake ETL, let's explore a couple of real-world use cases. Use Case 1: E-commerce Data Processing Imagine an e-commerce company that collects massive amounts of customer data, including browsing history, purchase records, and customer reviews. This data needs to be ingested into a data lake, transformed into a structured format, and loaded for analysis. By implementing Apache NiFi, the company can create ETL pipelines that extract data from various sources, transform it to meet analysis requirements and load it into their data lake. NiFi's real-time processing capabilities ensure that the latest data is available for analysis. Use Case 2: Financial Services A financial services institution deals with a constant stream of financial transactions, customer records, and market data. It's crucial to efficiently process this data and make it available for risk assessment and compliance reporting. Using Apache NiFi, the institution can create ETL workflows that continuously ingest and process this data. Data is transformed, enriched, and loaded into the data lake, providing real-time insights and ensuring compliance with financial regulations. In both use cases, Apache NiFi's flexibility, scalability, and data lineage features make it an ideal tool for handling complex ETL processes in data lake environments. Conclusion Streamlining ETL processes in a data lake is essential for organizations aiming to leverage their data effectively. Apache NiFi provides a user-friendly, powerful solution for designing and managing data flows, making it a valuable tool for data engineers and analysts. In this practical tutorial, we've covered the fundamentals of data lake ETL, introduced Apache NiFi, and explored its features and benefits. You've learned how to set up NiFi, create ETL workflows, integrate it with data lake platforms, monitor and manage ETL processes, optimize performance, and ensure data security and governance. By following the steps outlined in this tutorial, you can harness the capabilities of Apache NiFi to streamline your data lake ETL processes, making your data more accessible, reliable, and valuable for data-driven decision-making. Whether you're working with a small-scale data lake or managing petabytes of data, Apache NiFi can help you meet the challenges of data lake ETL with confidence.
It’s not easy for data teams working with batch workflows to keep up with today’s real-time requirements. Why? Because the batch workflow – from data delivery and processing to analytics – involves a lot of waiting. There’s waiting for data to be sent to an ETL tool, waiting for data to be processed in bulk, waiting for data to be loaded in a data warehouse, and even waiting for the queries to finish running. But there’s a solution for this from the open source world. Apache Kafka, Flink, and Druid, when used together, create a real-time data architecture that eliminates all these wait states. In this blog post, we’ll explore how the combination of these tools enables a wide range of real-time applications. Architecting Real-Time Applications Kafka-Flink-Druid creates a data architecture that can seamlessly deliver the data freshness, scale, and reliability across the entire data workflow from event to analytics to application. Open-source data architecture for real-time applications. Companies like Lyft, Pinterest, Reddit, and Paytm use the three together because they are each built from complementary stream-native technologies that together handle the full gamut of real-time use cases. This architecture makes it simple to build real-time applications such as observability, IoT/telemetry analytics, security detection/diagnostics, customer-facing insights, and personalized recommendations. Let’s take a closer look at each and how they can be used together. Streaming Pipeline: Apache Kafka Apache Kafka has emerged over the past several years as the de facto standard for streaming data. Prior to it, RabbitMQ, ActiveMQ, and other message queuing systems were used to provide various messaging patterns to distribute data from producers to consumers, but with scale limitations. Fast forward to today, Kafka has become ubiquitous, with at least 80% of the Fortune 100 using it. And it’s because Kafka’s architecture extends well beyond simple messaging. The versatility of its architecture makes Kafka very well suited for streaming at a massive ‘internet’ scale with fault tolerance and data consistency to support mission-critical applications – and its wide range of connectors via Kafka Connect integrate with any data sources. Apache Kafka as the streaming platform for real-time data. Stream Processing: Apache Flink With Kafka delivering real-time data, the right consumers are needed to take advantage of its speed and scale in real time. One of the popular choices is Apache Flink. Why Flink? For starters, Flink’s a high throughput, unified batch and stream processing engine, with its unique strengths lying in its ability to process continuous data streams at scale. Flink is a natural fit as a stream processor for Kafka as it integrates seamlessly and supports exactly-once semantics, guaranteeing that each event is processed exactly once, even with system failures. Simply put, connect to a Kafka topic, define the query logic, and then emit the result continuously – i.e., ‘set it and forget it.’ This makes Flink pretty versatile for use cases where immediate processing of streams and reliability are essential. Here are some of Flink’s common use cases: Enrichment and Transformation If a stream needs to undergo any data manipulation (e.g., modifying, enhancing, or restructuring data) before it can be used, Flink is an ideal engine to make changes or enhancements to those streams as it can keep the data fresh with continuous processing. For example, let’s say we have an IoT/telemetry use case for processing temperature sensors in a smart building. And each event coming into Kafka has the following JSON structure: { “sensor_id”: “SensorA,” “temperature”: 22.5, “timestamp”: “2023-07-10T10:00:00” }. If each sensor ID needs to be mapped with a location and the temperature needs to be in Fahrenheit, Flink can update the JSON structure to { “sensor_id”: “SensorA,” “location”: “Room 101”, “temperature_Fahreinheit”: 73.4, “timestamp”: “2023-07-10T10:00:00” }, emitting it directly to an application or sending it back to Kafka. An illustrative example of Flink’s data processing as a structured table for clarity. An advantage for Flink here is its speed at scale to handle massive Kafka streams in real time. Also, enrichment/transformation is often a stateless process where each data record can be modified without needing to maintain persistent state, making it minimal effort and highly performant too. Continuous Monitoring and Alerting The combination of Flink’s real-time continuous processing and fault tolerance also makes it an ideal solution for real-time detection and response across various critical applications. When the sensitivity to detection is very high – think sub-second – and the sampling rate is also high, Flink’s continuous processing is well suited as a data serving layer for monitoring conditions and triggering alerts and action accordingly. An advantage for Flink with alerts is that it can support both stateless and stateful alerting. Threshold or event triggers like “notify the fire department when temp reaches X” are straightforward but not always intelligent enough. So, in use cases where the alert needs to be driven by complex patterns that require remembering state – or even aggregating metrics (e.g., sum, avg, min, max, count, etc) – within a continuous stream of data, Flink can monitor and update state to identify deviations and anomalies. Something to consider is that using Flink for monitoring and alerting involves continuous CPU to evaluate conditions against thresholds and patterns, which is different from, say, a database that only utilizes CPU during query execution. So it’s a good idea to understand if continuous is required. Real-Time Analytics: Apache Druid Apache Druid rounds out the data architecture, joining Kafka and Flink as the consumer of streams for powering real-time analytics. While it is a database for analytics, its design center and use is much different than that of other databases and data warehouses. For starters, Druid is like a brother to Kafka and Flink. It too, is stream-native. In fact, there is no connector between Kafka and Druid as it connects directly into Kafka topics, and it supports exactly-once semantics. Druid is also designed for rapid ingestion of streaming data at scale and immediate querying of events in-memory on arrival. How Apache Druid natively integrates with Apache Kafka for stream ingestion. On the query side of things, Druid is a high performance, real-time analytics database that delivers sub-second queries at scale and under load. If the use case is performance-sensitive and requires handling TBs to PBs of data (e.g., aggregations, filters, GroupBys, complex joins, etc.) with high query volume, Druid is an ideal database as it consistently delivers lightning fast queries and can easily scale from a single laptop to a cluster of 1000s of nodes. This is why Druid is known as a real-time analytics database: it’s for when real-time data meets real-time queries. Here’s how Druid complements Flink: Highly Interactive Queries At its core, engineering teams use Druid to power analytics applications. These are data-intensive applications that include both internal (i.e., operational) and external (i.e., customer-facing) use cases across observability, security, product analytics, IoT/telemetry, manufacturing operations, etc. The applications powered with Druid generally have these characteristics: Performant at scale: Applications that need sub-second read performance on analytics-rich queries against large data sets without pre-computation. Druid is highly performant even if the application’s users are arbitrarily grouping, filtering, and slicing/dicing through lots of random queries at the TB-PB scale. High query volume: Applications that demand high QPS for analytical queries. An example here would be for any external-facing application – i.e., data product – where sub-second SLAs are needed for workloads producing 100s to 1000s of (different) concurrent queries. Time-series data: Applications that present insights on data with a time dimension (a strength of Druid’s but not a limitation). Druid can process time-series data at scale very quickly because of its time partitioning and data format. This makes time-based WHERE filters incredibly fast. These applications either have a very interactive data visualization / synthesized result-set UI with lots of flexibility in changing the queries on the fly (because Druid is that fast), or in many cases, they are leveraging Druid’s API for query speed at scale to power a decision workflow. Here’s an example of an analytics application powered by Apache Druid. Credit: Confluent – Confluent Health+ dashboard. Confluent, the original creators of Apache Kafka, provide analytics to their customers via Confluent Health+. This application above is highly interactive and packed with insights into their customers’ Confluent environment. Under the cover, events are streaming into Kafka and Druid at five million events per second, with the application serving 350 QPS. Real-Time With Historical Data While the example above shows Druid powering a pretty interactive analytics application, you might be wondering, “What’s love streaming got to do with it?” It’s a good question as Druid is not limited to streaming data. It’s very capable of ingesting large batch files as well. But what makes Druid relevant in the real-time data architecture is that it can provide the interactive data experience on real-time data combined with historical data for an even richer context. While Flink is great at answering “what is happening now” (i.e., emitting the current status of a Flink job), Druid is in a technical position to answer “what is happening now, how does that compare to before, and what factors/conditions impacted that outcome.” These questions together are quite powerful as they, for example, can eliminate false positives, help detect new trends, and lead to more insightful real-time decisions. Answering “How does this compare to before?” requires historical context – a day, a week, a year, or other time horizons – for correlation. And “what factors/conditions impacted the outcome” require mining through a full data set. As Druid is a real-time analytics database, it ingests streams to give real-time insights, but it also persists data so it can query historical data and all the other dimensions for ad-hoc exploration, too. How Druid’s query engine handles both real-time and historical data. For example, let’s say we are building an application that monitors security logins for suspicious behavior. We might want to set a threshold in a five-minute window, i.e., update and emit the state of login attempts. That’s easy for Flink. But with Druid, current login attempts can also be correlated with historical data to identify similar login spikes in the past that didn’t have security breaches. So, the historical context here helps determine whether a present spike is indicative of a problem or just normal behavior. So when you have an application that needs to present a lot of analytics – e.g., current status, variety of aggregations, grouping, time windows, complex joins, etc – on rapidly changing events but also provides historical context and explore that data set via a highly flexible API, that’s Druid’s sweet spot. Flink and Druid Checklist Flink and Druid are both built for streaming data. While they share some high-level similarities – both in-memory, both can scale, and both can parallelize – their architectures are really built for entirely different use cases, as we saw above. Here’s a simple workload-based decision checklist: Do you need to transform or join data in real-time on streaming data? Look at Flink as this is its “bread and butter” as it’s designed for real-time data processing. Do you need to support many different queries concurrently?Look at Druid, as it supports high QPS analytics without needing to manage queries/jobs. Do the metrics need to be updated or aggregated continuously?Look at Flink for this because it supports stateful complex event processing. Are the analytics more complex, and is historical data needed for comparison? Look at Druid, as it can easily and quickly query real-time data with historical data. Are you powering a user-facing application or data visualization?Look at Flink for enrichment, then send that data to Druid as the data serving layer. In most cases, the answer isn’t Druid or Flink, but rather Druid and Flink. Each provides technical characteristics that make them together well-suited to support a wide range of real-time applications. Conclusion Businesses are increasingly demanding real-time from data teams. And that means the data workflow needs to be reconsidered end-to-end. That’s why many companies are turning to Kafka-Flink-Druid as the de facto open-source data architecture for building real-time applications.
Learn how to launch an Apache Kafka with the Apache Kafka Raft (KRaft) consensus protocol and SSL encryption. This article is a continuation of my previous article Running Kafka in Kubernetes with KRaft mode. Prerequisites An understanding of Apache Kafka, Kubernetes, and Minikube. The following steps were initially taken on a MacBook Pro with 32GB memory running MacOS Ventura v13.4. Make sure to have the following applications installed: Docker v23.0.5 Minikube v1.29.0 (running K8s v1.26.1 internally) It's possible the steps below will work with different versions of the above tools, but if you run into unexpected issues, you'll want to ensure you have identical versions. Minikube was chosen for this exercise due to its focus on local development. Deployment Components Server Keys and Certificates The first step to enable SSL encryption is to a create public/private key pair for every server. ⚠️ The commands in this section were executed in a Docker container running the image openjdk:11.0.10-jre because it's the same Java version (Java 11) that Confluent runs. With this approach, any possible Java version-related issue is prevented. The next commands were executed following the Confluent Security Tutorial: Shell docker run -it --rm \ --name openjdk \ --mount source=kafka-certs,target=/app \ openjdk:11.0.10-jre Once in the Docker container: Shell keytool -keystore kafka-0.server.keystore.jks -alias kafka-0 -keyalg RSA -genkey Output: Enter keystore password: Re-enter new password: What is your first and last name? [Unknown]: kafka-0.kafka-headless.kafka.svc.cluster.local What is the name of your organizational unit? [Unknown]: test What is the name of your organization? [Unknown]: test What is the name of your City or Locality? [Unknown]: Liverpool What is the name of your State or Province? [Unknown]: Merseyside What is the two-letter country code for this unit? [Unknown]: UK Is CN=kafka-0.kafka-headless.kafka.svc.cluster.local, OU=test, O=test, L=Liverpool, ST=Merseyside, C=UK correct? [no]: yes Repeating the command for each broker: Shell keytool -keystore kafka-1.server.keystore.jks -alias kafka-1 -keyalg RSA -genkey Shell keytool -keystore kafka-2.server.keystore.jks -alias kafka-2 -keyalg RSA -genkey Create Your Own Certificate Authority (CA) Generate a CA that is simply a public-private key pair and certificate, and it is intended to sign other certificates. Shell openssl req -new -x509 -keyout ca-key -out ca-cert -days 90 Output: Generating a RSA private key ...+++++ ........+++++ writing new private key to 'ca-key' Enter PEM pass phrase: Verifying - Enter PEM pass phrase: ----- You are about to be asked to enter information that will be incorporated into your certificate request. What you are about to enter is what is called a Distinguished Name or a DN. There are quite a few fields but you can leave some blank For some fields there will be a default value, If you enter '.', the field will be left blank. ----- Country Name (2 letter code) [AU]:UK State or Province Name (full name) [Some-State]:Merseyside Locality Name (eg, city) []:Liverpool Organization Name (eg, company) [Internet Widgits Pty Ltd]:test Organizational Unit Name (eg, section) []:test Common Name (e.g. server FQDN or YOUR name) []:*.kafka-headless.kafka.svc.cluster.local Email Address []: Add the generated CA to the clients’ truststore so that the clients can trust this CA: Shell keytool -keystore kafka.client.truststore.jks -alias CARoot -importcert -file ca-cert Add the generated CA to the brokers’ truststore so that the brokers can trust this CA. Shell keytool -keystore kafka-0.server.truststore.jks -alias CARoot -importcert -file ca-cert keytool -keystore kafka-1.server.truststore.jks -alias CARoot -importcert -file ca-cert keytool -keystore kafka-2.server.truststore.jks -alias CARoot -importcert -file ca-cert Sign the Certificate To sign all certificates in the keystore with the CA that you generated: Export the certificate from the keystore: Shell keytool -keystore kafka-0.server.keystore.jks -alias kafka-0 -certreq -file cert-file-kafka-0 keytool -keystore kafka-1.server.keystore.jks -alias kafka-1 -certreq -file cert-file-kafka-1 keytool -keystore kafka-2.server.keystore.jks -alias kafka-2 -certreq -file cert-file-kafka-2 Sign it with the CA: Shell openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file-kafka-0 -out cert-signed-kafka-0 -days 90 -CAcreateserial -passin pass:${ca-password} openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file-kafka-1 -out cert-signed-kafka-1 -days 90 -CAcreateserial -passin pass:${ca-password} openssl x509 -req -CA ca-cert -CAkey ca-key -in cert-file-kafka-2 -out cert-signed-kafka-2 -days 90 -CAcreateserial -passin pass:${ca-password} ⚠️ Don't forget to substitute ${ca-password} Import both the certificate of the CA and the signed certificate into the broker keystore: Shell keytool -keystore kafka-0.server.keystore.jks -alias CARoot -importcert -file ca-cert keytool -keystore kafka-0.server.keystore.jks -alias kafka-0 -importcert -file cert-signed-kafka-0 keytool -keystore kafka-1.server.keystore.jks -alias CARoot -importcert -file ca-cert keytool -keystore kafka-1.server.keystore.jks -alias kafka-1 -importcert -file cert-signed-kafka-1 keytool -keystore kafka-2.server.keystore.jks -alias CARoot -importcert -file ca-cert keytool -keystore kafka-2.server.keystore.jks -alias kafka-2 -importcert -file cert-signed-kafka-2 ⚠️ The keystore and truststore files will be used to create the ConfigMap for our deployment. ConfigMaps Create two ConfigMaps, one for the Kafka Broker and another one for our Kafka Client. Kafka Broker Create a local folder kafka-ssl and copy the keystore and truststore files into the folder. In addition, create a file broker_credswith the ${ca-password}. Your folder should look similar to this: Shell ls kafka-ssl broker_creds kafka-0.server.truststore.jks kafka-1.server.truststore.jks kafka-2.server.truststore.jks kafka-0.server.keystore.jks kafka-1.server.keystore.jks kafka-2.server.keystore.jks Create the ConfigMap: Shell kubectl create configmap kafka-ssl --from-file kafka-ssl -n kafka kubectl describe configmaps -n kafka kafka-ssl Output: Shell Name: kafka-ssl Namespace: kafka Labels: <none> Annotations: <none> Data ==== broker_creds: ---- <redacted> BinaryData ==== kafka-0.server.keystore.jks: 5001 bytes kafka-0.server.truststore.jks: 1306 bytes kafka-1.server.keystore.jks: 5001 bytes kafka-1.server.truststore.jks: 1306 bytes kafka-2.server.keystore.jks: 5001 bytes kafka-2.server.truststore.jks: 1306 bytes Events: <none> Kafka Client Create a local folder kafka-client and copy the kafka.client.truststore.jks file into the folder. In addition, create a file broker_creds with the ${ca-password} and a file client_security.properties. Shell #client_security.properties security.protocol=SSL ssl.truststore.location=/etc/kafka/secrets/kafka.client.truststore.jks ssl.truststore.password=<redacted> Your folder should look similar to this: Shell ls kafka-client broker_creds client_security.properties kafka.client.truststore.jks Create the ConfigMap: Shell kubectl create configmap kafka-client --from-file kafka-client -n kafka kubectl describe configmaps -n kafka kafka-client Output: Shell Name: kafka-client Namespace: kafka Labels: <none> Annotations: <none> Data ==== broker_creds: ---- <redacted> client_security.properties: ---- security.protocol=SSL ssl.truststore.location=/etc/kafka/secrets/kafka.client.truststore.jks ssl.truststore.password=test1234 ssl.endpoint.identification.algorithm= BinaryData ==== kafka.client.truststore.jks: 1306 bytes Events: <none> Confluent Kafka This yaml file deploys a Kafka cluster within a Kubernetes namespace named kafka. It defines various Kubernetes resources required for setting up Kafka in a distributed manner. YAML --- apiVersion: v1 kind: ServiceAccount metadata: name: kafka namespace: kafka --- apiVersion: v1 kind: Service metadata: labels: app: kafka name: kafka-headless namespace: kafka spec: clusterIP: None clusterIPs: - None internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: tcp-kafka-int port: 9092 protocol: TCP targetPort: tcp-kafka-int - name: tcp-kafka-ssl port: 9093 protocol: TCP targetPort: tcp-kafka-ssl selector: app: kafka sessionAffinity: None type: ClusterIP --- apiVersion: apps/v1 kind: StatefulSet metadata: labels: app: kafka name: kafka namespace: kafka spec: podManagementPolicy: Parallel replicas: 3 revisionHistoryLimit: 10 selector: matchLabels: app: kafka serviceName: kafka-headless template: metadata: labels: app: kafka spec: serviceAccountName: kafka containers: - command: - sh - -exc - | export KAFKA_NODE_ID=${HOSTNAME##*-} && \ export KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://${POD_NAME}.kafka-headless.kafka.svc.cluster.local:9092,SSL://${POD_NAME}.kafka-headless.kafka.svc.cluster.local:9093 export KAFKA_SSL_TRUSTSTORE_FILENAME=${POD_NAME}.server.truststore.jks export KAFKA_SSL_KEYSTORE_FILENAME=${POD_NAME}.server.keystore.jks export KAFKA_OPTS="-Djavax.net.debug=all" exec /etc/confluent/docker/run env: - name: KAFKA_SSL_KEY_CREDENTIALS value: "broker_creds" - name: KAFKA_SSL_KEYSTORE_CREDENTIALS value: "broker_creds" - name: KAFKA_SSL_TRUSTSTORE_CREDENTIALS value: "broker_creds" - name: KAFKA_LISTENER_SECURITY_PROTOCOL_MAP value: "CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT,SSL:SSL" - name: CLUSTER_ID value: "6PMpHYL9QkeyXRj9Nrp4KA" - name: KAFKA_CONTROLLER_QUORUM_VOTERS value: "0@kafka-0.kafka-headless.kafka.svc.cluster.local:29093,1@kafka-1.kafka-headless.kafka.svc.cluster.local:29093,2@kafka-2.kafka-headless.kafka.svc.cluster.local:29093" - name: KAFKA_PROCESS_ROLES value: "broker,controller" - name: KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR value: "3" - name: KAFKA_NUM_PARTITIONS value: "3" - name: KAFKA_DEFAULT_REPLICATION_FACTOR value: "3" - name: KAFKA_MIN_INSYNC_REPLICAS value: "2" - name: KAFKA_CONTROLLER_LISTENER_NAMES value: "CONTROLLER" - name: KAFKA_LISTENERS value: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:29093,SSL://0.0.0.0:9093 - name: POD_NAME valueFrom: fieldRef: apiVersion: v1 fieldPath: metadata.name name: kafka image: docker.io/confluentinc/cp-kafka:7.5.0 imagePullPolicy: IfNotPresent livenessProbe: failureThreshold: 6 initialDelaySeconds: 60 periodSeconds: 60 successThreshold: 1 tcpSocket: port: tcp-kafka-int timeoutSeconds: 5 ports: - containerPort: 9092 name: tcp-kafka-int protocol: TCP - containerPort: 29093 name: tcp-kafka-ctrl protocol: TCP - containerPort: 9093 name: tcp-kafka-ssl protocol: TCP resources: limits: cpu: "1" memory: 1400Mi requests: cpu: 250m memory: 512Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL runAsGroup: 1000 runAsUser: 1000 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/kafka/secrets/ name: kafka-ssl - mountPath: /etc/kafka name: config - mountPath: /var/lib/kafka/data name: data - mountPath: /var/log name: logs dnsPolicy: ClusterFirst restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1000 terminationGracePeriodSeconds: 30 volumes: - emptyDir: {} name: config - emptyDir: {} name: logs - name: kafka-ssl configMap: name: kafka-ssl updateStrategy: type: RollingUpdate volumeClaimTemplates: - apiVersion: v1 kind: PersistentVolumeClaim metadata: name: data spec: accessModes: - ReadWriteOnce resources: requests: storage: 10Gi storageClassName: standard volumeMode: Filesystem status: phase: Pending The deployment we will create will have the following components: Namespace: kafka This is the namespace within which all components will be scoped. Service Account: kafka Service accounts are used to control permissions and access to resources within the cluster. Headless Service: kafka-headless It exposes ports 9092 (for PLAINTEXT communication) and 9093 (for SSL traffic). StatefulSet: kafka It manages Kafka pods and ensures they have stable hostnames and storage. The source code for this deployment can be found in this GitHub repository. Specifically for the SSL configurations, the next parameters were configured in the StatefulSet: Configure the truststore, keystore, and password: KAFKA_SSL_KEY_CREDENTIALS KAFKA_SSL_KEYSTORE_CREDENTIALS KAFKA_SSL_TRUSTSTORE_CREDENTIALS Configure the ports for the Kafka brokers to listen for SSL:KAFKA_ADVERTISED_LISTENERS KAFKA_LISTENER_SECURITY_PROTOCOL_MAP KAFKA_LISTENERS Creating the Deployment Clone the repo:git clone https://github.com/rafaelmnatali/kafka-k8s.git cd ssl Deploy Kafka using the following commands: kubectl apply -f 00-namespace.yaml kubectl apply -f 01-kafka-local.yaml Verify Communication Across Brokers There should now be three Kafka brokers each running on separate pods within your cluster. Name resolution for the headless service and the three pods within the StatefulSet is automatically configured by Kubernetes as they are created,allowing for communication across brokers. See the related documentation for more details on this feature. You can check the first pod's logs with the following command: kubectl logs kafka-0The name resolution of the three pods can take more time to work than it takes the pods to start, so you may see UnknownHostException warnings in the pod logs initially: WARN [RaftManager nodeId=2] Error connecting to node kafka-1.kafka-headless.kafka.svc.cluster.local:29093 (id: 1 rack: null) (org.apache.kafka.clients.NetworkClient) java.net.UnknownHostException: kafka-1.kafka-headless.kafka.svc.cluster.local ... But eventually each pod will successfully resolve pod hostnames and end with a message stating the broker has been unfenced: INFO [Controller 0] Unfenced broker: UnfenceBrokerRecord(id=1, epoch=176) (org.apache.kafka.controller.ClusterControlManager) Create a Topic Using the SSL Endpoint The Kafka StatefulSet should now be up and running successfully. Now we can create a topic using the SSL endpoint. You can deploy Kafka Client using the following command: kubectl apply -f 02-kafka-client.yaml Check if the Pod is Running: kubectl get pods Output: NAME READY STATUS RESTARTS AGE kafka-cli 1/1 Running 0 12m Connect to the pod kafka-cli: kubectl exec -it kafka-cli -- bash Create a topic named test-ssl with three partitions and a replication factor of 3. kafka-topics --create --topic test-ssl --partitions 3 --replication-factor 3 --bootstrap-server ${BOOTSTRAP_SERVER} --command-config /etc/kafka/secrets/client_security.properties Created topic test-ssl. The environment variable BOOTSTRAP_SERVER contains the list of the brokers, therefore, we save time in typing. List all the topics in Kafka: kafka-topics --bootstrap-server kafka-0.kafka-headless.kafka.svc.cluster.local:9093 --list --command-config /etc/kafka/secrets/client_security.properties test test-ssl test-test Summary and Next Steps This tutorial showed you how to get Kafka running in KRaft mode on a Kubernetes cluster with SSL encryption. This is a step to secure communication between clients and brokers. I invite you to keep studying and investigating how to improve security in your environment.
Commonly known simply as Kafka, Apache Kafka is an open-source event streaming platform maintained by the Apache Software Foundation. Initially conceived at LinkedIn, Apache Kafka was collaboratively created by Jay Kreps, Neha Narkhede, and Jun Rao and subsequently released as an open-source project in 2011. Today, Kafka is one of the most popular event streaming platforms designed to handle real-time data feeds. It is widely used to build scalable, fault-tolerant, and high-performance streaming data pipelines. Kafka's uses are continually expanding, with the top five cases nicely illustrated by Brij Pandey in the accompanying image. As a brief primer, it is important to understand the components of the Kafka platform and how they work. Kafka works as a distributed event streaming platform designed to handle real-time data feeds efficiently. It operates based on the publish-subscribe messaging model and follows a distributed and fault-tolerant architecture. It maintains a persistent, ordered, and partitioned sequence of records called "topics." Producers write data on these topics, and consumers read from them. This enables decoupling between data producers and consumers and allows multiple applications to consume the same data stream independently. Key components of Kafka include: Topics and Partitions: Kafka organizes data into topics. Each topic is a stream of records, and the data within a topic is split into multiple partitions. Each partition is an ordered, immutable sequence of records. Partitions enable horizontal scalability and parallelism by allowing data to be distributed across multiple Kafka brokers. Producers: Producers are applications that write data to Kafka topics. They publish records to specific topics, which are then stored in the topic's partitions. Producers can send records to a particular partition explicitly or allow Kafka to determine the partition using a partitioning strategy. Consumers: Consumers are applications that read data from Kafka topics. They subscribe to one or more topics and consume records from the partitions they are assigned to. Consumer groups are used to scale consumption, and each partition within a topic can be consumed by only one consumer within a group. This allows multiple consumers to work in parallel to process the data from different partitions of the same topic. Brokers: Kafka runs as a cluster of servers, and each server is called a broker. Brokers are responsible for handling read and write requests from producers and consumers, as well as managing the topic partitions. A Kafka cluster can have multiple brokers to distribute the load and ensure fault tolerance. Partitions/Replication: To achieve fault tolerance and data durability, Kafka allows configuring replication for topic partitions. Each partition can have multiple replicas, with one replica designated as the leader and the others as followers. The leader replica handles all read and write requests for that partition, while followers replicate the data from the leader to stay in sync. If a broker with a leader replica fails, one of the followers automatically becomes the new leader to ensure continuous operation. Offset Management: Kafka maintains the concept of offsets for each partition. An offset represents a unique identifier for a record within a partition. Consumers keep track of their current offset, allowing them to resume consumption from where they left off in case of failure or reprocessing. ZooKeeper: While not part of Kafka itself, ZooKeeper is often used to manage the metadata and coordinate the brokers in a Kafka cluster. It helps with leader election, topic, and partition information, and managing consumer group coordination. [Note: Zookeeper metadata management tool will soon be phased out in favor of Kafka Raft, or KRaft, a protocol for internally managed metadata] Overall, Kafka's design and architecture make it a highly scalable, fault-tolerant, and efficient platform for handling large volumes of real-time data streams. It has become a central component in many data-driven applications and data infrastructure, facilitating data integration, event processing, and stream analytics. A typical Kafka architecture would then be as follows : Kafka clustering refers to the practice of running multiple Kafka brokers together as a group to form a Kafka cluster. Clustering is a fundamental aspect of Kafka's architecture, providing several benefits, including scalability, fault tolerance, and high availability. A Kafka cluster is used to handle large-scale data streams and ensure that the system remains operational even in the face of failures. In the cluster, Kafka topics are divided into multiple partitions to achieve scalability and parallelism. Each partition is a linearly ordered, immutable sequence of records. Partitions, therefore, allow data to be distributed across multiple brokers in the cluster. It should be noted that a minimum Kafka cluster consists of three Kafka brokers, each of which can be run on a separate server (virtual or physical). The 3-node guidance is to help avoid a split-brain scenario in case of a broker failure. (Nice article by Dhinesh Sunder Ganapathi that goes into more detail.) Kafka and Kubernetes As more companies adopt Kafka, there is also an increasing interest in deploying Kafka on Kubernetes. In fact, the most recent Kubernetes in the Wild report 2023 by Dynatrace shows that over 40% of large organizations run their open-source messaging platform within Kubernetes, the majority of this being Kafka. The same report also makes a bold claim that “Kubernetes is emerging as the ‘operating system’ of the cloud.” It is imperative, then, for Kafka administrators to understand the interplay between Kafka and Kubernetes and how to implement these appropriately for scale. The Case for Multi-Cluster Kafka Running a Kafka cluster in a single Kubernetes cluster setup is fairly straightforward and enables scalability as needed in theory. In production, however, the picture can get a bit murky. We should distinguish the use of the term cluster between Kafka and Kubernetes. A Kubernetes deployment also uses the term cluster to designate a grouping of connected nodes, referred to as a Kubernetes cluster. When the Kafka workload is deployed on Kubernetes, you will end up with a Kafka cluster running inside a Kubernetes cluster, but more relevant to our discussion, you may also have a Kafka cluster that spans multiple Kubernetes clusters - for resiliency, performance, data sovereignty, etc. To begin with, Kafka is not designed for multi-tenant setups. In technical terms, Kafka does not understand concepts such as Kubernetes namespaces or resource isolation. Within a particular topic, there is no easy mechanism to enforce security access restrictions between multiple user groups. Additionally, different workloads may have different update frequency and scale requirements, e.g., batch application vs. real-time application. Combining the two workloads into a single cluster could cause adverse impacts or consume much more resources than necessary. Data sovereignty and regulatory compliance can also impose restrictions on co-locating data and topics in a specific region or application. Resiliency, of course, is another strong driving force behind the need for multiple Kafka clusters. While Kafka clusters are designed for fault tolerance of topics, we still have to plan for a catastrophic failure of an entire cluster. In such cases, the need for a fully replicated cluster enables proper business continuity planning. For businesses that are migrating workloads to the cloud or have a hybrid cloud strategy, you may want to set up multiple Kafka clusters and perform a planned workload migration over time rather than a risky full-scale Kafka migration. These are just a few of the reasons why, in practice, enterprises find themselves having to create multiple Kafka clusters that nevertheless need to interact with each other. Multi-Cluster Kafka In order to have multiple Kafka clusters that are connected to each other, key items from one cluster must be replicated to the other cluster(s). These include the topics, offsets, and metadata. In Kafka terms, this duplication is considered Mirroring. There are two approaches to multi-cluster setups that are possible. Stretched Clusters or Connected Clusters. Stretched Clusters: Synchronous Replication A stretched cluster is a logical cluster that is ‘stretched’ across several physical clusters. Topics and replicas are distributed across the physical clusters, but since they are represented as a logical cluster, the applications themselves are not aware of this multiplicity. Stretched clusters have strong consistency and are easier to manage and administer. Since applications are unaware of the existence of multiple clusters, they are easier to deploy on stretched clusters compared to connected clusters. The downside of stretched clusters is that it requires a synchronous connection between the clusters. They are not ideal for a hybrid cloud deployment and will require a quorum of at least three clusters to avoid a ‘split-brain’ scenario. Connected Clusters: Asynchronous Replication A Connected Cluster, on the other hand, is deployed by connecting multiple independent clusters. These independent clusters could be running in different regions or cloud platforms and are managed individually. The primary benefit of the connected cluster model is that there is no downtime in cases of a cluster failure since the other clusters are running independently. Each cluster can also be optimized for its particular resources. The major downside of connected clusters is that it relies on asynchronous connection between the clusters. Topics that are replicated between the clusters are not ‘copy on write’ but rather depend on eventual consistency. This can lead to possible data loss during the async mirroring process. Additionally, applications that work across connected clusters have to be modified to be aware of the multiple clusters. Before we address the solution to this conundrum, I’ll briefly cover the common tools on the market to enable Kafka cluster connectivity. Open Source Kafka itself ships with a mirroring tool called Mirror Maker. Mirror Maker duplicates topics between different clusters via a built-in producer. This way, data is cross-replicated between clusters with eventual consistency but without interrupting individual processes. It is important to note that while Mirror Maker is simple in its concept, setting up Mirror Maker at scale can be quite a challenge for IT organizations. Managing IP addresses, naming conventions, number of replicas, etc., must be done correctly, or it could lead to what is known as ‘infinite replication’ where a topic is infinitely replicated, leading to an eventual crash. Another downside of Mirror Maker is the lack of dynamic configuration of allowed/disallowed lists for updates. Mirror Maker also does not sync topic properties properly, which makes it an operational headache at scale when adding or removing topics to be replicated. Mirror Maker Two attempts to fix some of these challenges, but many IT shops still struggle to get Mirror Maker set up correctly. Other open-source tools for Kafka replication include Mirus from Salesforce, uReplicator from Uber, and customized Flink from Netflix. For commercial licensed options, Confluent offers two options: Confluent Replicator and Cluster Linking. Confluent Replicator is essentially a Kafka Connect connector that provides a high-performance and resilient way to copy topic data between clusters. Cluster Linking is another offering, developed internally, and is targeted at multi-region replication while preserving topic offsets. Even so, Cluster Linking is an asynchronous replication tool with data having to cross network boundaries and traverse public traffic pathways. As should be clear by now, Kafka replication is a crucial strategy for production applications at scale; the question is which option to choose. Imaginative Kafka administrators will quickly realize that you may need connected clusters and stretched clusters, or a combination of these deployments, depending on the application performance and resiliency requirements. What is daunting, however, is the exponential challenges of setting up the cluster configurations and managing these at scale across multiple clusters. What is a more elegant way to solve this nightmare? The Answer Is Yes! KubeSlice by Avesha is a simple way to get the best of both worlds. By creating a direct Service Connectivity between clusters or namespaces, KubeSlice obviates the need for manually configuring individual connectivity between Kafka clusters. At its core, KubeSlice creates a secure, synchronous layer three network gateway between clusters, isolated at the application or namespace level. Once this is set up, Kafka administrators are free to deploy Kafka brokers in any of the clusters. Each broker has synchronous connectivity to every other broker that is joined via the slice, even though the brokers themselves may be on separate clusters. This effectively creates a stretched cluster between the brokers and provides the benefit of strong consistency and low administration overhead. Have Your Cake and Eat It Too! For those who may want to deploy Mirror Maker into their clusters, this can be done with minimal effort since the connectivity between the clusters is delegated to KubeSlice. Thus, Kafka applications can have the benefits of synchronous (speed, resiliency) AND asynchronous (independence, scale) replication in the same deployment with the ability to mix and match the capabilities as needed. This is true of on-prem data centers, across public clouds, or any combination of these in a hybrid setup. KubeSlice is a non-disruptive deployment, meaning that there is no need to uninstall any tool already deployed. It is simply a matter of establishing a slice and adding the Kafka deployment onto that slice. This blog provided a brief overview of Apache Kafka and has touched on some of the more common use cases. We covered the current tools available to scale Kafka deployments across multiple clusters and discussed the advantages/disadvantages of each. Finally, the article also introduced Kubeslice, the emerging service connectivity solution that simplifies Kafka multi-cluster deployments and removes the headaches associated with configuring Kafka replication across multiple clusters at scale.
This blog post explores the state of data streaming for the public sector in 2023. The evolution of government digitalization, citizen expectations, and cybersecurity risks requires optimized end-to-end visibility into information, comfortable mobile apps, and integration with legacy platforms like mainframe in conjunction with pioneering technologies like social media. Data streaming provides consistency across all layers and allows integrating and correlating data in real-time at any scale. I look at public sector trends to explore how data streaming leverages Apache Kafka and to help as a business enabler, including customer stories from the US Department of Defense (DoD), NASA, Deutsche Bahn (German Railway), and others. A complete slide deck and on-demand video recording are included. General Trends in the Public Sector The public sector covers so many different areas. Examples include defense, law enforcement, national security, healthcare, public administration, police, judiciary, finance and tax, research, aerospace, agriculture, etc. Many of these terms and sectors overlap. Many of these use cases are applicable across many sectors. Several disruptive trends impact innovation in the public sector to automate processes, provide a better experience for citizens, and strengthen cybersecurity defense tactics. The two critical pillars across departments in the public sector are IT modernization and data-driven applications. IT Modernization in the Government for 2023 The research company Gartner identified the following technology trends for the government to accelerate the digital transformation as they prepare for post-digital government: These trends differ not much from traditional companies in the private sector like banking or insurance. Data consistency across monolithic legacy infrastructure and cloud-native applications matters. Accelerating Data Maturity in the Public Sector The public sector is often still very slow in innovation. Time-to-market is crucial. IT modernization requires up-to-date technologies and development principles. Data sharing across applications, departments, or states requires a data-driven enterprise architecture. McKinsey & Company says, "Government entities have created real-time pandemic dashboards, conducted geospatial mapping for drawing new public transportation routes, and analyzed public sentiment to inform economic recovery investment. While many of these examples were born out of necessity, public-sector agencies are now embracing the impact that data-driven decision-making can have on residents, employees, and other agencies. Embedding data and analytics at the core of operations can help optimize government resources by targeting them more effectively and enable civil servants to focus their efforts on activities that deliver the greatest results." AI and Machine Learning help with automation. Chatbots and other conversational AI improve the total experience of citizens and public sector employees. Data Streaming in the Government and Public Sector Real-time data beats slow data in almost all use cases. No matter which agency or department you look at in the government and public sector: Data streaming combines the power of real-time messaging at any scale with storage for true decoupling, data integration, and data correlation capabilities. Apache Kafka is the de facto standard for data streaming. Check out the below links for a broad spectrum of examples and best practices. Additionally, here are a few new customer stories from the last months. New Customer Stories for Data Streaming in the Public Sector and Government So much innovation is happening worldwide, even in the "slow" public sector. Automation and digitalization change how we search and buy products and services, communicate with partners and customers, provide hybrid shopping models, and more. Most and more governments and non-profit organizations use a cloud-first approach to improve time-to-market, increase flexibility, and focus on business logic instead of operating IT infrastructure. Here are a few customer stories from worldwide organizations in the public sector and government: University of California, San Diego: Integration Platform as a Service (iPaaS) as "Swiss army knife" of integration U.S. Citizenship and Immigration Services (USCIS): Real-time inter-agency data sharing Deutsche Bahn (German Railway): Customer data platform for real-time notification about delays and cancellations, plus B2B integration with Google Maps NASA: General Coordinates Network (GCN) for multi-messenger astronomy alerts between space- and ground-based observatories, physics experiments, and thousands of astronomers worldwide US Department of Defense (DOD): Joint All Domain Command and Control (JADC2), a strategic warfighting concept that connects the data sensors, shooters, and related communications devices of all U.S. military services; DOD uses the ride-sharing service Uber as an analogy to describe its desired end state for JADC2 leveraging data streaming. Resources To Learn More This blog post is just the starting point. I wrote a blog series exploring why many governments and public infrastructure sectors leverage data streaming for various use cases. Learn about real-world deployments and different architectures for data streaming with Apache Kafka in the public sector: Life is a Stream of Events Smart City Citizen Services Energy and Utilities National Security Learn more about data streaming for the government and public sector in the following on-demand webinar recording, the related slide deck, and further resources, including pretty cool lightboard videos about use cases. I presented with my colleague and SME for the public sector and government, Will LaForest. On-Demand Video Recording and Slides The video recording explores public sector trends and architectures for data streaming leveraging Apache Kafka and other modern and cloud-native technologies. The primary focus is the data streaming case studies. If you prefer learning from slides, check out the deck used for the above recording, also at the video recording link above. Case Studies and Lightboard Videos for Data Streaming in the Public Sector and Government The state of data streaming for the public sector in 2023 is fascinating. New use cases and case studies come up every month. Mission-critical deployments at governments in the United States and Germany prove the maturity of data streaming concerning security and data privacy. The success stories prove better data governance across the entire organization, secure data collection and processing in real-time, data sharing and cross-agency partnerships with Open APIs for new business models, and many more scenarios. We recorded lightboard videos showing the value of data streaming simply and effectively. These five-minute videos explore the business value of data streaming, related architectures, and customer stories. Stay tuned; I will update the links in the next few weeks and publish a separate blog post for each story and lightboard video. And this is just the beginning. Every month, we will talk about the status of data streaming in a different industry. Manufacturing was the first. Financial services second, then retail, telcos, gaming, and so on... Let’s connect on LinkedIn and discuss it!
Miguel Garcia
VP of Engineering,
Nextail Labs
Gautam Goswami
Founder,
DataView
Alexander Eleseev
Full Stack Developer,
First Line Software