2024 DevOps Lifecycle: Share your expertise on CI/CD, deployment metrics, tech debt, and more for our Feb. Trend Report (+ enter a raffle!).
Kubernetes in the Enterprise: Join our Virtual Roundtable as we dive into Kubernetes over the past year, core usages, and emerging trends.
A database is a collection of structured data that is stored in a computer system, and it can be hosted on-premises or in the cloud. As databases are designed to enable easy access to data, our resources are compiled here for smooth browsing of everything you need to know from database management systems to database languages.
Performance of ULID and UUID in Postgres Database
How To Use KubeDB and Postgres Sidecar for Database Integrations in Kubernetes
Getting your data modeling wrong is one of the easiest ways to ruin your performance. And it’s especially easy to screw this up when you’re working with NoSQL, which (ironically) tends to be used for the most performance-sensitive workloads. NoSQL data modeling might initially appear quite simple: just model your data to suit your application’s access patterns. But in practice, that’s much easier said than done. Fixing data modeling is no fun, but it’s often a necessary evil. If your data modeling is fundamentally inefficient, your performance will suffer once you scale to some tipping point that varies based on your specific workload and deployment. Even if you adopt the fastest database on the most powerful infrastructure, you won’t be able to tap its full potential unless you get your data modeling right. This article explores three of the most common ways to ruin your NoSQL database performance, along with tips on how to avoid or resolve them. Not Addressing Large Partitions Large partitions commonly emerge as teams scale their distributed databases. Large partitions are partitions that grow too big up to the point when they start introducing performance problems across the cluster’s replicas. One of the questions that we hear often – at least once a month – is, “What constitutes a large partition?” Well, it depends. Some things to consider: Latency expectations: The larger your partition grows, the longer it will take to be retrieved. Consider your page size and the number of client-server round trips needed to fully scan a partition. Average payload size: Larger payloads generally lead to higher latency. They require more server-side processing time for serialization and deserialization and also incur a higher network data transmission overhead. Workload needs: Some workloads organically require larger payloads than others. For instance, I’ve worked with a Web3 blockchain company that would store several transactions as BLOBs under a single key, and every key could easily get past 1 megabyte in size. How you read from these partitions: For example, a time series use case will typically have a timestamp clustering component. In that case, reading from a specific time window will retrieve much less data than if you were to scan the entire partition. The following table illustrates the impact of large partitions under different payload sizes, such as 1, 2, and 4 kilobytes. As you can see, the higher your payload gets under the same row count, the larger your partition is going to be. However, if your use case frequently requires scanning partitions as a whole, then be aware that databases have limits to prevent unbounded memory consumption. For example, ScyllaDB cuts off pages at every 1MB to prevent the system from potentially running out of memory. Other databases (even relational ones) have similar protection mechanisms to prevent an unbounded bad query from starving the database resources. To retrieve a payload size of 4KB and 10K rows with ScyllaDB, you would need to retrieve at least 40 pages to scan the partition with a single query. This may not seem a big deal at first. However, as you scale over time, it could affect your overall client-side tail latency. Another consideration: With databases like ScyllaDB and Cassandra, data written to the database is stored in the commit log and under an in-memory data structure called a "memtable." The commit log is a write-ahead log that is never really read from, except when there’s a server crash or a service interruption. Since the memtable lives in memory, it eventually gets full. To free up memory space, the database flushes memtables to disk. That process results in SSTables (sorted strings tables), which is how your data gets persisted. What does all this have to do with large partitions? Well, SSTables have specific components that need to be held in memory when the database starts. This ensures that reads are always efficient and minimizes wasting storage disk I/O when looking for data. When you have extremely large partitions (for example, we recently had a user with a 2.5 terabyte partition in ScyllaDB), these SSTable components introduce heavy memory pressure, therefore shrinking the database’s room for caching and further constraining your latencies. How do you address large partitions via data modeling? Basically, it’s time to rethink your primary key. The primary key determines how your data will be distributed across the cluster, which improves performance as well as resource utilization. A good partition key should have high cardinality and roughly even distribution. For example, a high cardinality attribute like User Name, User ID or Sensor ID might be a good partition key. Something like State would be a bad choice because states like California and Texas are likely to have more data than less populated states such as Wyoming and Vermont. Or consider this example. The following table could be used in a distributed air quality monitoring system with multiple sensors: CQL CREATE TABLE air_quality_data ( sensor_id text, time timestamp, co_ppm int, PRIMARY KEY (sensor_id, time) ); With time being our table’s clustering key, it’s easy to imagine that partitions for each sensor can grow very large, especially if data is gathered every couple of milliseconds. This innocent-looking table can eventually become unusable. In this example, it takes only ~50 days. A standard solution is to amend the data model to reduce the number of clustering keys per partition key. In this case, let’s take a look at the updated `air_quality_data` table: CQL CREATE TABLE air_quality_data ( sensor_id text, date text, time timestamp, co_ppm int, PRIMARY KEY ((sensor_id, date), time) ); After the change, one partition holds the values gathered in a single day, which makes it less likely to overflow. This technique is called bucketing, as it allows us to control how much data is stored in partitions. Bonus: See how Discord applies the same bucketing technique to avoid large partitions. Introducing Hot Spots Hot spots can be a side effect of large partitions. If you have a large partition (storing a large portion of your data set), it’s quite likely that your application access patterns will hit that partition more frequently than others. In that case, it also becomes a hot spot. Hot spots occur whenever a problematic data access pattern causes an imbalance in the way data is accessed in your cluster. One culprit is when the application fails to impose any limits on the client side and allows tenants to potentially spam a given key. For example, think about bots in a messaging app frequently spamming messages in a channel. Hot spots could also be introduced by erratic client-side configurations in the form of retry storms. That is, a client attempts to query specific data, times out before the database does, and retries the query while the database is still processing the previous one. Monitoring dashboards should make it simple for you to find hot spots in your cluster. For example, this dashboard shows that shard 20 is overwhelmed with reads. For another example, the following graph shows three shards with higher utilization, which correlates to the replication factor of three configured for the keyspace in question. Here, shard seven introduces a much higher load due to the spamming. How do you address hot spots? First, use a vendor utility on one of the affected nodes to sample which keys are most frequently hit during your sampling period. You can also use tracing, such as probabilistic tracing, to analyze which queries are hitting which shards and then act from there. If you find hot spots, consider: Reviewing your application access patterns. You might find that you need a data modeling change, such as the previously-mentioned bucketing technique. If you need sorting, you could use a monotonically increasing component, such as Snowflake. Or, maybe it’s best to apply a concurrency limiter and throttle down potential bad actors. Specifying per-partition rate limits, after which the database will reject any queries that hit that same partition. Ensuring that your client-side timeouts are higher than the server-side timeouts to prevent clients from retrying queries before the server has a chance to process them ( “retry storms”). Misusing Collections Teams don’t always use collections, but when they do, they often use them incorrectly. Collections are meant for storing/denormalizing a relatively small amount of data. They’re essentially stored in a single cell, which can make serialization/deserialization extremely expensive. When you use collections, you can define whether the field in question is frozen or non-frozen. A frozen collection can only be written as a whole; you cannot append or remove elements from it. A non-frozen collection can be appended to, and that's exactly the type of collection that people most misused. To make matters worse, you can even have nested collections, such as a map that contains another map, which includes a list, and so on. Misused collections will introduce performance problems much sooner than large partitions, for example. If you care about performance, collections can’t be very large at all. For example, if we create a simple key:value table, where our key is a `sensor_id` and our value is a collection of samples recorded over time, our performance will be suboptimal as soon as we start ingesting data. CQL CREATE TABLE IF NOT EXISTS {table} ( sensor_id uuid PRIMARY KEY, events map<timestamp, FROZEN<map<text, int>>>, ) The following monitoring snapshots show what happens when you try to append several items to a collection at once. You can see that while the throughput decreases, the p99 latency increases. Why does this occur? Collection cells are stored in memory as sorted vectors. Adding elements requires a merge of two collections (old and new). Adding an element has a cost proportional to the size of the entire collection. Trees (instead of vectors) would improve the performance, BUT… Trees would make small collections less efficient! Returning that same example, the solution would be to move the timestamp to a clustering key and transform the map into a FROZEN collection (since you no longer need to append data to it). These very simple changes will greatly improve the performance of the use case. CQL CREATE TABLE IF NOT EXISTS {table} ( sensor_id uuid, record_time timestamp, events FROZEN<map<text, int>>, PRIMARY KEY(sensor_id, record_time) )
Snowflake's evolution over the last few years is simply amazing. It is currently a data platform with a great ecosystem both in terms of partners and a wide variety of components like snowgrid, snowpark, or streamlit, but in this article, we are not going to focus on its role as a modern cloud-based data warehouse. It revolutionizes the traditional concept of data warehousing; it offers a more agile and scalable platform that separates storage, computing, and services, allowing each component to scale independently. This means you can store unlimited data, ramp up or down your computing resources based on your querying needs, and only pay for what you use. Currently, we can say that Snowflake is mainly an Online Analytical Processing (OLAP) type solution, but as we see further on, it is evolving to provide transactional and analytical capabilities in a single platform. Below is a high-level architecture diagram showing the layers that are part of Snowflake. Cloud Services Layer: It coordinates and handles tasks that are not specific to querying or storing data. It includes several tasks, such as authenticating user sessions, role-based access control, or ensuring transactional consistency. Compute Layer: This layer is where the actual data processing happens. It comprises one or multiple virtual warehouses, which are essentially clusters of compute resources. Each virtual warehouse can scale up or down independently and can be started or stopped to optimize costs. Storage Layer: This layer is responsible for the storage of structured and semi-structured data. It is stored in cloud storage in a columnar format. Optimized for Analytical Queries Snowflake is designed for big data analytics. It can handle complex queries on large datasets efficiently due to its columnar storage and massively parallel processing (MPP) architecture. Analytical queries typically work with a subset of columns and operations to aggregate, transform, and analyze vast volumes of data to provide insights, trends, or patterns. These are some of the common operations used in analytical queries: Aggregations: Functions like SUM(), AVG(), COUNT(), and MAX() are usually used to summarize data. Range scans: Scan wide ranges of data (e.g. WHERE sale_date BETWEEN '2022-01-01' AND '2022-12-31') Group by: Grouping data using GROUP BY clauses in combination with aggregation functions to provide summaries by some attribute. Ordering and windows function: Use ordering (ORDER BY) and window functions (e.g., ROW_NUMBER(), LAG(), LEAD()) to calculate running totals, ranks, and other advanced analytics. Let's see it in an example to help us understand it: Columnar Storage, the Foundation of Performance Columnar storage, as opposed to row-based storage, manages data using columns (product_id, name, sale_date, etc.) as logical units that are used to store the information in memory. Each logical unit always stores the same data type, which means that the adjacent data in memory all have the same type of data. This strategy provides a number of performance benefits: Data access efficiency: Aggregation queries, like those calculating sums or averages, often require data from only a few columns rather than the entire row. In columnar storage, data is stored column by column. This means that when executing an aggregation on a specific column, the database system can read only the data for that column, skipping over all other unrelated columns. This selective reading can significantly reduce I/O operations and speed up query performance. Compression: Data within a column tends to be more homogeneous (i.e., of the same type and often with similar values) compared to data within a row. This homogeneity makes column data more amenable to compression techniques. For example, if a column storing a month of transaction dates mostly has the same few dates, you can represent those repeated dates once with a count instead of storing them repeatedly. Effective compression reduces storage costs and can also boost performance by reducing the amount of data read from storage. Better CPU cache utilization: Since columnar databases read contiguous memory blocks from a single column, they can better utilize CPU caches. The data loaded into the cache is more likely to be used in the subsequent operations, leading to fewer cache misses. On the other hand, in row-based systems, if only a few columns from a wide row are needed, much of the cached data might go unused. Eliminating irrelevant data quickly: Many columnar databases use metadata about blocks of columnar data, like min and max values. This metadata can quickly determine if a block contains relevant data or can be skipped entirely. For instance, if a query is filtering for pricing over 200, and the maximum value in a block is 110, the entire block can be ignored. In the following diagram, we explain in a simple way how columnar storage could work to help you understand why it is efficient in analytical queries. But it does not mean that Snowflake implements this logic. In this example, the values of each column can be stored in the same order: the first value of the product_id corresponds to the first value in the sales_Data and to the first in the amount; the second to the second to the second, and so on. Therefore when you filter by date, you can quickly get the offsets assigned for the start and end of the timestamp range and also give the offset of the corresponding values in the amount and perform the necessary calculations. Unistore Unifying Analytical and Transactional Data Snowflake is evolving its platform by applying a modern approach to provide transactional and analytical data operations together in a single platform. The new feature is called Unistore and enables running transactional by offering fast single-row operations. Therefore, Snowflake joins a small group of cloud-based databases that offer this type of capability, such as SingleStore or MySQL Heatwave. This feature is still in private preview and has limited access, so we will have to verify the latency times. It should be considered that there are other features of transactional and relational databases, such as referential integrity that are not supported. Row Storage, Transactional Performance Typically, databases are oriented to work at the row level, and queries or operations use row-based storage or row-oriented storage. It is a method in which data is stored by rows. It is especially effective for transactional online transaction processing (OLTP) and workloads that frequently involve single-row queries or operations. Some of the benefits of using this type of storage are listed below: Fewer columns in OLTP queries: Transactional queries, like those from web applications or operational systems, often involve a limited number of columns but require complete rows. In such scenarios, reading a full row from row-based storage is more efficient than assembling a row from multiple columns in columnar storage. Optimized for transactional workloads: OLTP systems often have a high number of small, frequent read and write operations. When updating or inserting a new row in row-based storage, the database writes the whole row at once. This contrasts with columnar systems where an insert or update might involve writing data across various column files. Locking and concurrency: Row-based databases are often optimized for row-level locking. This means that when a row is being updated, the database can lock just that specific row, allowing other operations to proceed concurrently on other rows. This level of granularity in locking is beneficial for high-concurrency transactional systems. Snowflake Platform Layers Cloud Services The Cloud Services layer plays a crucial role in managing and optimizing the overall functionality of the data warehouse and acts as the "brain" that orchestrates processes and resources to deliver a seamless, secure, and scalable data analysis and management experience. It's responsible for handling a wide range of tasks, from authentication and infrastructure management to metadata maintenance and query optimization. It is probably the most unknown layer, which means it is a user-friendly layer that goes unnoticed precisely because of its efficiency and simplicity. This layer offers several key features: Query processing: It receives SQL queries, parses them, and optimizes them for efficient execution, distributing the workload across its compute resources. Metadata management: It maintains metadata for the data stored that includes information about table structures, data types, and compression methods, as well as query history and performance metrics. Access control and security management: It handles user authentication, authorization, and role-based access control. It ensures that users can only access the data and perform the actions their roles permit. Transaction management: Handle the main features of transaction processing, including concurrency control and ensuring the ACID (Atomicity, Consistency, Isolation, Durability) properties of transactions. That, in conjunction with storage layer features (durability, consistency, or data versioning), is crucial for maintaining data integrity and consistency. Infrastructure management: It dynamically allocates and scales computational resources, the Virtual Warehouses, automatically scaling them up or down based on the workload. Data sharing and collaboration: It facilitates secure data sharing across different Snowflake accounts, sharing subsets of data without copying or moving the data, enabling real-time and seamless collaboration. Performance and usage monitoring: It provides tools and dashboards for monitoring the performance and usage of the Snowflake environment. Although, in my opinion, this is one of Snowflake's capabilities that can be improved. Integrations and API support: It provides support for various integrations and APIs, allowing users, applications, and tools to interact with the Snowflake platform. For example, it allows the management of all resources (compute, user management, monitoring, or security) following an as-code approach. Compute Layer This layer is composed of virtual warehouses that are essentially compute clusters and are responsible for executing SQL queries on the data stored in Snowflake. It supports creating multiple virtual warehouses to handle and distribute your workloads. This enables us to create dedicated and sized resources for each scenario or actor. For example, if you have different squads accessing data concurrently on top of the applications and BI tools, we can create and assign their own warehouse, ensuring that heavy querying by some of them doesn't affect another's performance. Isolation: Each cluster is a component isolated from the rest and, therefore, is not affected by the load state of other clusters. Independent scaling: It supports scale-up and scale-out independently for each cluster. If you need more performance for larger queries or more users, you can increase the size of your warehouse or add more nodes using multi-clustering capabilities. Independent elasticity: It supports automatic scale-out, although vertical scaling is not automated and, therefore requires us to perform manual or automatic actions. On-the-fly resizing: Scaling a virtual warehouse in Snowflake can be done on the fly without any downtime. This allows for elasticity, where you can adapt to varying workloads as needed. Multi-cluster warehouses: For even higher levels of concurrency, it enables scale-out automatically from one cluster to multiple compute clusters to accommodate many simultaneous users or queries. Storage Layer It is responsible for storing and managing data efficiently and contributes to having an effective and scalable platform. It offers several key features: Types of data: Snowflake supports structured and semi-structured data, including JSON, Avro, XML, Parquet formats, or Iceberg tables. Elastic and Scalable Storage: The storage layer automatically scales to accommodate data growth without manual intervention, so we do not need to worry about storage limits or provisioning additional storage space. Optimized data storage format: it stores data in an optimized columnar format or in row format in the case of Unistore tables, which can be indexed like traditional OLTP engines. Optimizing storage for each data use case. Data clustering and micro-partitions: Snowflake automatically organizes data into micro-partitions, which are internally optimized and compressed to improve query performance in terms of time and compute resources. Time travel and fail-safe features: It provides the capacity to access historical data up to a certain point in the past at table level. This allows us to revert to previous data states within a specified time window, providing data protection and ensuring data integrity or performing historical data analysis. The fail-safe feature offers additional protection by maintaining the data for a set period for disaster recovery. Data sharing: Snowflake enables secure and easy sharing of data between different Snowflake accounts. This feature allows organizations to share live, ready-to-query data with partners and customers without moving or copying data, ensuring data governance and security. Security and compliance: It provides several security features, including encryption of data at rest and in transit, role-based access control, and compliance with various industry standards and regulations. Cost-effective storage: We pay only for the storage they use, with Snowflake compressing and storing data in a cost-efficient manner. Conclusions In this series of articles, we will explore the various ways in which Snowflake can be used to address a wide range of data challenges. We will start with the basics of SQL and how to use it to query data in Snowflake. We will then move on to more advanced topics such as data modeling, query optimization, and machine learning. Before embarking on any project, it is crucial to understand its underlying architecture, capabilities, and limitations. Failure to understand the nuances of products and platforms can lead to inefficiencies, performance bottlenecks, excessive costs, and potential security vulnerabilities. This is precisely the purpose of this first article, to understand Snowflake's architecture and fundamental features.
There are many situations when applications are requested to be agile and versatile enough so that they can run dynamic reports for which the input comes at runtime. This article aims to present a way of achieving as much by leveraging the temporary configuration parameters supported by PostgreSQL databases. According to the PostgreSQL documentation, starting with version 7.3, it is possible to set a configuration parameter using the set_config(name, value, is_local) function. Later, the value of the previously set parameter may be read using the current_setting(name) function, converted if needed and used. If the third parameter of the former function is true, the changed setting will only apply to the current transaction. This is exactly what is needed here — a way of providing a runtime parameter value that can be used as part of an atomic operation. Set Up The sample application is built with: Java 21 Spring Boot version 3.1.15 PostgreSQL Driver version 42.6.0. Liquibase 4.20.0 Maven 3.6.3 At the application level, the Maven project is configured to use Spring Data JPA and Liquibase dependencies. The domain is represented by products, whose prices are in various currencies. For converting between currencies, a currency exchange rate exists. The goal is to be able to read all products with their prices represented in a certain currency, at the rate of a certain day. Proof of Concept In order to start modelling, one shall first create a new schema, once connected to the database. PLSQL create schema pgsetting; There are three entities: Product, Currency, and CurrencyExchange. Java @Entity @Table(name = "product") public class Product { @Id @Column(name = "id") private Long id; @Column(name = "name", nullable = false) private String name; @Column(name = "price", nullable = false) private Double price; @ManyToOne @JoinColumn(name = "currency_id") private Currency currency; ... } @Entity @Table(name = "currency") public class Currency { @Id @Column(name = "id", nullable = false) private Long id; @Column(name = "name", nullable = false) private String name; ... } @Entity @Table(name = "currency_exchange") public class CurrencyExchange { @Id @Column(name = "id", nullable = false) private Long id; @Column(name = "date", nullable = false) private LocalDate date; @ManyToOne @JoinColumn(name = "from_currency_id", nullable = false) private Currency from; @ManyToOne @JoinColumn(name = "to_currency_id", nullable = false) private Currency to; @Column(name = "value", nullable = false) private Double value; ... } Each one has a corresponding CrudRepository. Java @Repository public interface ProductRepository extends CrudRepository<Product, Long> { } @Repository public interface CurrencyRepository extends CrudRepository<Currency, Long> { } @Repository public interface CurrencyExchangeRepository extends CrudRepository<CurrencyExchange, Long> { } The data source is configured as usual in the application.properties file, together with the path to the Liquibase changelog file that records a few simple change sets for initializing the schema with the three tables and the relations among them. For details, the application properties and db/changelog/schema-init.xml files may be explored. The root changelog file is: XML <?xml version="1.1" encoding="UTF-8" standalone="no"?> <databaseChangeLog xmlns="http://www.liquibase.org/xml/ns/dbchangelog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog https://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd"> <include file="/db/changelog/schema-init.xml"/> </databaseChangeLog> When the application is started, the change sets are executed in the order they are declared. So far, everything is straightforward, nothing out of the ordinary — a simple Spring Boot application whose database changes are managed with Liquibase. Creating the Dynamic Report Let’s assume that currently the application has two currencies defined — RON and EUR and two products with their prices recorded in different currencies. Currency Plain Text +--+----+ |id|name| +--+----+ |1 |RON | |2 |EUR | +--+----+ Product Plain Text +--+-------------------+-----+-----------+ |id|name |price|currency_id| +--+-------------------+-----+-----------+ |1 |Swatch Moonlight v1|100 |2 | |2 |Winter Sky |1000 |1 | +--+-------------------+-----+-----------+ CurrencyExchange Rates for the 15th of November Plain Text +--+----------+----------------+--------------+-----+ |id|date |from_currency_id|to_currency_id|value| +--+----------+----------------+--------------+-----+ |1 |2023-11-15|2 |1 |5 | |2 |2023-11-15|2 |2 |1 | |3 |2023-11-15|1 |2 |0.2 | |4 |2023-11-15|1 |1 |1 | +--+----------+----------------+--------------+-----+ The aimed result is a product report with all prices in EUR, using the exchange rate from the 15th of November 2023. This means the price of the second product needs to be converted. To ease the design, the previously set goal is divided into smaller parts, then conquered. Conceptually, products shall be fetched and their prices converted (if needed). Fetch the products. Convert the prices in the requested currency, using the exchange rate of the requested day. The former is trivial. A Spring Data Repository method would easily allow getting the products — List<Product> findAll(). The latter is achievable through a query that makes the conversions. PLSQL SELECT p.id, p.name, p.price * e.value price, e.to_currency_id currency_id, e.date FROM product p LEFT JOIN currency_exchange e on p.currency_id = e.from_currency_id and e.to_currency_id = 2 and e.date = '2023-11-15' In order to unite the two, the following are accomplished: A view is defined, for the above query — product_view It is defined in the product-view.sql file and added as an idempotent operation in a repeatable Liquibase change set that is run whenever changed. XML <?xml version="1.1" encoding="UTF-8" standalone="no"?> <databaseChangeLog xmlns="http://www.liquibase.org/xml/ns/dbchangelog" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.liquibase.org/xml/ns/dbchangelog https://www.liquibase.org/xml/ns/dbchangelog/dbchangelog-3.8.xsd"> <include file="/db/changelog/schema-init.xml"/> <changeSet id="repeatable" author="horatiucd" runOnChange="true"> <sqlFile dbms="postgresql" path="db/changelog/product-view.sql"/> </changeSet> </databaseChangeLog> A new entity — ProductView — is defined as part of the domain, together with the corresponding repository. Java @Entity @Immutable public class ProductView { @Id private Long id; private String name; private Double price; private LocalDate date; @ManyToOne @JoinColumn(name = "currency_id") private Currency currency; ... } Java @Repository public interface ProductViewRepository extends org.springframework.data.repository.Repository<ProductView, Long> { List<ProductView> findAll(); } The application is now able to construct the desired report, but only for a hardcoded currency and exchange rate. In order to pass the two at runtime, the following are performed in the same transaction: The two parameter values are set as configuration parameters — SELECT set_config(:name, :value, true) The ProductView entities are fetched using the repository method Also, the product_view is modified to read the configuration parameters set as part of the current transaction and select the data accordingly. PLSQL SELECT p.id, p.name, p.price * e.value price, e.date, e.to_currency_id currency_id FROM product p LEFT JOIN currency_exchange e on p.currency_id = e.from_currency_id and e.to_currency_id = current_setting('pgsetting.CurrencyId')::int and e.date = current_setting('pgsetting.CurrencyDate')::date; current_setting('pgsetting.CurrencyId') and current_setting('pgsetting.CurrencyDate') calls read the previously set parameters, which are further converted and used. The implementation needs some additional adjustments. ProductViewRepository is enhanced with a method that allows setting the configuration parameters. Java @Repository public interface ProductViewRepository extends org.springframework.data.repository.Repository<ProductView, Long> { List<ProductView> findAll(); @Query(value = "SELECT set_config(:name, :value, true)") void setConfigParam(String name, String value); } The last parameter is always set to true, thus the value is kept only during the current transaction. Also, a ProductService is defined to clearly mark all operations involved in the transaction. Java @Service public class ProductService { private final ProductViewRepository productViewRepository; public ProductService(ProductViewRepository productViewRepository) { this.productViewRepository = productViewRepository; } @Transactional public List<ProductView> getProducts(Currency currency, LocalDate date) { productViewRepository.setConfigParam("pgsetting.CurrencyId", String.valueOf(currency.getId())); productViewRepository.setConfigParam("pgsetting.CurrencyDate", DateTimeFormatter.ofPattern("yyyy-MM-dd").format(date)); return productViewRepository.findAll(); } } The name of the parameters are the ones used in the product_view definition. To certify the implementation, two tests are set up. Java @SpringBootTest class Product1Test { @Autowired private CurrencyRepository currencyRepository; @Autowired private ProductRepository productRepository; @Autowired private CurrencyExchangeRepository rateRepository; @Autowired private ProductService productService; private Currency ron, eur; private Product watch, painting; private CurrencyExchange eurToRon, ronToEur; private LocalDate date; @BeforeEach public void setup() { ron = new Currency(1L, "RON"); eur = new Currency(2L, "EUR"); currencyRepository.saveAll(List.of(ron, eur)); watch = new Product(1L, "Swatch Moonlight v1", 100.0d, eur); painting = new Product(2L, "Winter Sky", 1000.0d, ron); productRepository.saveAll(List.of(watch, painting)); date = LocalDate.now(); eurToRon = new CurrencyExchange(1L, date, eur, ron, 5.0d); CurrencyExchange eurToEur = new CurrencyExchange(2L, date, eur, eur, 1.0d); ronToEur = new CurrencyExchange(3L, date, ron, eur, .2d); CurrencyExchange ronToRon = new CurrencyExchange(4L, date, ron, ron, 1.0d); rateRepository.saveAll(List.of(eurToRon, eurToEur, ronToEur, ronToRon)); } } The former fetches the products with prices in EUR, using the recorded exchange rates. Java @Test void prices_in_eur() { List<ProductView> products = productService.getProducts(eur, date); Assertions.assertEquals(2, products.size()); Assertions.assertTrue(products.stream() .allMatch(product -> product.getCurrency().getId().equals(eur.getId()))); Assertions.assertTrue(products.stream() .allMatch(product -> product.getDate().equals(date))); Assertions.assertEquals(watch.getPrice(), products.get(0).getPrice()); Assertions.assertEquals(painting.getPrice() * ronToEur.getValue(), products.get(1).getPrice()); } When called, product_view is: Plain Text +--+-------------------+-----+-----------+----------+ |id|name |price|currency_id|date | +--+-------------------+-----+-----------+----------+ |1 |Swatch Moonlight v1|100 |2 |2023-11-15| |2 |Winter Sky |200 |2 |2023-11-15| +--+-------------------+-----+-----------+----------+ The latter fetches the products with prices in RON, using the same exchange rates. Java @Test void prices_in_ron() { List<ProductView> products = productService.getProducts(ron, date); Assertions.assertEquals(2, products.size()); Assertions.assertTrue(products.stream() .allMatch(product -> product.getCurrency().getId().equals(ron.getId()))); Assertions.assertTrue(products.stream() .allMatch(product -> product.getDate().equals(date))); Assertions.assertEquals(watch.getPrice() * eurToRon.getValue(), products.get(0).getPrice()); Assertions.assertEquals(painting.getPrice(), products.get(1).getPrice()); } When called, product_view is: Plain Text +--+-------------------+-----+-----------+----------+ |id|name |price|currency_id|date | +--+-------------------+-----+-----------+----------+ |1 |Swatch Moonlight v1|500 |1 |2023-11-15| |2 |Winter Sky |1000 |1 |2023-11-15| +--+-------------------+-----+-----------+----------+ Sample Code Available here.
It’s All About the Performance SQL database cannot handle only one incoming connection at a time because it would be devastating for the performance of the system. We expect the database to accept many callers in parallel and execute their requests as fast as possible. It is rather clear how to do that when these callers ask for different data, i.e., the first caller reads from Table 1 while the second caller reads from Table 2. However, very often, different callers want to read from and write to the same table. How should we handle these queries? What should be the order of operations and the final result? This is where a transaction isolation level comes into play. A transaction is a set of queries (like SELECT, INSERT, UPDATE, DELETE) sent to the database to execute, which should be completed as a unit of work. This means that they either need to be executed or none of them should be executed. It takes time to execute transactions. For instance, a single UPDATE statement may modify multiple rows. The database system needs to modify every row, and this takes time. While performing an update, another transaction may begin and try to read the rows that are currently being modified. The question we may ask here is — should the other transaction read new values of rows (despite not all of them being already updated), old values of rows (despite some of them being already updated), or maybe should it wait? And what if the first transaction needs to be canceled later for any reason? What should happen to the other transaction? Transaction isolation levels control how we determine the data integrity between transactions. They decide how transactions should be executed when they should wait, and what anomalies are allowed to appear. We may want to allow for some anomalies to be theoretically possible to increase the performance of the system. Read Phenomena Depending on how we control concurrency in the database, different read phenomena may appear. The standard SQL 92 defines three read phenomena describing various issues that may happen when two transactions are executed concurrently with no transaction isolation in place. We’ll use the following People table for the examples: id name salary 1 John 150 2 Jack 200 Dirty Read When two transactions access the same data, and we allow for reading values that are not yet committed, we may get a dirty read. Let’s say that we have two transactions doing the following: Transaction 1 Transaction 2 UPDATE People SET salary = 180 WHERE id = 1 SELECT salary FROM People WHERE id = 1 ROLLBACK Transaction 2 modifies the row with id = 1, then Transaction 1 reads the row and gets a value of 180, and Transaction 2 rolls things back. Effectively, Transaction 1 uses value that doesn’t exist in the database. What we would expect here is that Transaction 1 uses values that were successfully committed in the database at some point in time. Repeatable Read Repeatable read is a problem when a transaction reads the same thing twice and gets different results each time. Let’s say the transactions do the following: Transaction 1 Transaction 2 SELECT salary FROM People WHERE id = 1 UPDATE People SET salary = 180 WHERE id = 1 COMMIT SELECT salary FROM People WHERE id = 1 Transaction 1 reads a row and gets a value of 150. Transaction 2 modifies the same row. Then Transaction 1 reads the row again and gets a different value (180 this time). What we would expect here is to read the same value twice. Phantom Read Phantom read is a case when a transaction looks for rows the same way twice but gets different results. Let’s take the following: Transaction 1 Transaction 2 SELECT * FROM People WHERE salary < 250 INSERT INTO People(id, name, salary) VALUES (3, Jacob, 120) COMMIT SELECT * FROM People WHERE salary < 250 Transaction 1 reads rows and finds two of them matching the conditions. Transaction 2 adds another row that matches the conditions used by Transaction 1. When Transaction 1 reads again, it gets a different set of rows. We would expect to get the same rows for both SELECT statements of Transaction 1. Isolation Levels SQL 92 standard defines various isolation levels that define which read phenomena can occur. There are four standard levels: READ UNCOMMITTED, READ COMMITTED, REPEATABLE READ, and SERIALIZABLE. READ UNCOMMITTED allows a transaction to read data that is not yet committed to the database. This allows for the highest performance, but it also leads to the most undesired read phenomena. READ COMMITTED allows a transaction to read-only data that is committed. This avoids the issue of reading data that “later disappears” but doesn’t protect it from other read phenomena. REPEATABLE READ level tries to avoid the issue of reading data twice and getting different results. Finally, SERIALIZABLE tries to avoid all read phenomena. The following table shows which phenomena are allowed: Level \ Phenomena Dirty read Repeatable read Phantom READ UNCOMMITTED + + + READ COMMITTED - + + REPEATABLE READ - - + SERIALIZABLE - - - The isolation level is defined per transaction. For example, it’s allowed for one transaction to run with SERIALIZALBLE level, and for another to run with READ UNCOMMITTED. How Does It Work Under the Hood? The database needs to implement mechanisms guaranteeing the lack of specific read phenomena. There are generally two broad approaches to solving these: pessimistic locking and optimistic locking. Pessimistic Locking The first approach is called pessimistic locking. In this approach, we want to avoid issues by making sure a transaction doesn’t introduce problematic changes. We do that by locking specific parts of the database. When a given part is locked by one transaction, then another transaction cannot read or write data according to the transaction isolation level to avoid issues. There are various levels of locks in the database: they can be stored on a row level, page level (which we can consider a group of rows for the purpose of this article), table level, and whole database level. There are also various types of locks: locks for reading, for writing, locks that can be shared between transactions or not, locks for intent, and so on. This article focuses on SQL databases in general, so that we won’t go into details of actual implementations. Conceptually, to avoid a given read phenomenon, a transaction needs to lock specific parts of the database in a way that guarantees that other transactions will not introduce changes leading to a specific type of read phenomenon. For instance, to avoid dirty reads, we need to lock all modified or read rows so that other transactions cannot read or modify them. There are multiple advantages of this approach. First, it allows for fine granularity in terms of what can be modified and which transactions can safely carry on. Second, it scales well and imposes low overhead when there are multiple transactions working on different data. Third, transactions don’t need to roll things back. However, this can decrease the performance significantly. For instance, if two transactions want to read and modify data in the same table, and both of these transactions operate on a SERIALIZABLE level, then they’ll need to wait for each other to complete. Even if they touch different rows from the table. Most database management systems use this way. For instance, MS SQL uses this for its four main isolation levels. Optimistic Locking Another approach is called optimistic locking. This approach is also known as snapshot isolation or Multiversion Concurrency Control (MVCC for short). Each entity in the table has an associated version number with it. When we modify a row, we also increase its row version so other transactions can observe that it changed. When a transaction starts, it records the version number so it knows what the state of the rows is. When it reads from a table, it only extracts rows that were modified before the transaction was started. Next, when the transaction modifies the data and tries to commit them to the database, the database verifies row versions. If rows were modified in the meantime by some other transaction, the update is now rejected, and the transaction has to start from scratch. This approach works well in a case when transactions touch different rows because then they can commit with no issues. This allows for better scaling and higher performance because transactions don’t need to take locks. However, when transactions often modify the same rows, some of the transactions will need to be rolled back often. This leads to performance degradation. Another disadvantage is the need to keep the row versions. This increases the complexity of the database system.Various database management systems use this approach. For instance, Oracle or MS SQL with snapshots enabled. Practical Considerations While isolation levels seem to be well defined, there are various little details that affect how the database systems work under the hood. Let’s see some of them. Isolation Levels Are Not Mandatory While the SQL 92 standard defines multiple isolation levels, they are not mandatory. This means that all levels in a given database management system can be implemented as SERIALIZABLE. We use other isolation levels to increase the performance, but it’s not enforced in any way. This means that if we rely on a particular optimization happening in one database management system, the same optimization may not be used in another database management system. We shouldn’t rely on implementation details but stick to the standard instead. Default Isolation Level Is Not Standardized The default isolation level is configured per transaction. This is typically dictated by the library or connectivity technology you use to connect to the database. Depending on your default settings, you may operate on a different isolation level, and this may lead to different results or different performance. Typical libraries use SERIALIZABLE or READ COMMITTED level. Problems With READ COMMITTED While READ COMMITTED guarantees a transaction reads only committed data, it doesn’t guarantee that the data it reads is the latest one. It is possible that it reads a value that was committed at some point in the past but was overridden later on by another transaction. There is another issue with the READ COMMITTED level. Due to how entities are stored under the hood, it is possible that a transaction reads a particular row twice or skips it. Let’s see why. A typical database management system stores rows in a table in an ordered fashion, typically using the primary key of the table in a B-tree. This is because the primary key typically imposes a clustered index, which causes the data to be physically ordered on a disk. Let’s now assume that there are ten rows with IDs from 1 to 10. Let’s say our transaction read eight rows already, so rows with IDs from 1 to 8 inclusive. Now, if another transaction modifies the row with id = 2 and changes the id value to 11 (and commits), we’ll then continue scanning and find 11 rows in total. What’s more, we’ll read the row with id = 2, but the row doesn’t exist anymore! Based on the same idea, we can miss a row. Let’s say that we have 10 rows in total, and we already read rows from 1 to 4. Next, another transaction changes the row with id = 10 and sets its id to 3. We won’t find this row due to the ordering. White and Black Marbles Problem We mentioned two different ways of implementing locks. Pessimistic locking locks rows and disallows other transactions to modify them when they’re locked. Optimistic locking stores row versions and allows other transactions to move on as long as they work on the latest data. There is another issue with the SERIALIZABLE level, when it’s implemented with optimistic locking, is known as the white and black marble problem. Let’s take the following Marbles table: id color row_version 1 black 1 2 white 1 Let’s now say that we want to run two transactions. First tries to change all black stones into white. Another one tries to do the opposite — it tries to change all whites into blacks. We have the following: Transaction 1 Transaction 2 UPDATE Marbles SET color = 'white' WHERE color = 'black' UPDATE Marbles SET color = 'black' WHERE color = 'white' Now, if we implement SERIALIZABLE with pessimistic locking, a typical implementation will lock the entire table. After running both of the transactions, we end with either two black stones (if first we execute Transaction 1 and then Transaction 2) or two white stones (if we execute Transaction 2 and then Transaction 1). However, if we use optimistic locking, we’ll end up with the following: id color row_version 1 white 2 2 black 2 Since both transactions touch different sets of rows, they can run in parallel. This leads to an unexpected result. What To Do Now? We learned how transaction isolation levels work. We can now use them to improve the performance. To do that, we need to understand what SQL queries are executed in the database and how they impact the performance. One of the easiest ways to do that is by using the Metis Observability dashboard: Metis dashboard can show you insights about all the queries that are executed and how to improve their performance by changing the database configuration. This way, we can see if we get the expected results and use the right isolation levels. Summary In this article, we have seen what transaction isolation levels are and how they allow for different read phenomena. We also learned how they are conceptually implemented by the database systems and how they can lead to unexpected results.
In the ever-evolving landscape of database technology, staying ahead of the curve is not just an option, it’s a necessity. As modern applications continue to grow in complexity and global reach, the role of the underlying database becomes increasingly critical. It’s the backbone that supports the seamless functioning of applications and the storage and retrieval of vast amounts of data. In this era of global-scale applications, having a high-performance, flexible, and efficient database is paramount. As the demands of modern applications surge, the need for a database that can keep pace has never been greater. The “ultra-database” has become a key player in ensuring that applications run seamlessly and efficiently globally. These databases need to offer a unique combination of speed, versatility, and adaptability to meet the diverse requirements of various applications, from e-commerce platforms to IoT systems. They need to be more than just data repositories. They must serve as intelligent hubs that can quickly process, store, and serve data while facilitating real-time analytics, security, and scalability. The ideal ultra-database is not just a storage facility; it’s an engine that drives the dynamic, data-driven applications that define the modern digital landscape. The latest release of HarperDB 4.2 introduces a unified development architecture for enterprise applications, providing an approach to building global-scale applications. HarperDB 4.2 HarperDB 4.2 is a comprehensive solution that seamlessly combines an ultra-fast database, user-programmable applications, and data streaming into a cohesive technology. The result is a development environment that simplifies the complex, accelerates the slow, and reduces costs. HarperDB 4.2 offers a unified platform that empowers developers to create applications that can span the globe, handling data easily and quickly. In this tutorial, we will explore the features of HarperDB 4.2 and show you how to harness its power in conjunction with Java Quarkus. We will take you through the steps to leverage HarperDB’s new capabilities to build robust and high-performance applications with Quarkus, demonstrating the impressive potential of this unified development architecture. So, join us on this enlightening journey and revolutionize your application development process. Creating a Quarkus Microservice API With HarperDB, Part 1: Setting up the Environment This section will guide you through configuring your development environment and creating the necessary project setup to get started. Step 1: Configuring the Environment Before diving into the development, you need to set up your environment. We’ll start by running HarperDB in a Docker container. To do this, open your terminal and run the following command: Shell docker run -d -e HDB_ADMIN_USERNAME=root -e HDB_ADMIN_PASSWORD=password -e HTTP_THREADS=4 -p 9925:9925 -p 9926:9926 harperdb/harperdb This command downloads and runs the HarperDB Docker container with the specified configuration. It exposes the necessary ports for communication. Step 2: Creating a Schema and Table With HarperDB up and running, the next step is to create a schema and define a table to store animal data. We will use the “curl” commands to interact with HarperDB’s RESTful API. Create a schema named “dev” by executing the following command: Shell curl --location --request POST 'http://localhost:9925/' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data-raw '{ "operation": "create_schema", "schema": "dev" }' This command sends a POST request to create the “dev” schema. Next, create a table named “animal” with “scientificName” as the hash attribute using the following command: Shell curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_table", "schema": "dev", "table": "animal", "hash_attribute": "scientificName" }' This command establishes the “animal” table within the “dev” schema. 3. Now, add the required attributes for the “animal” table by creating “name,” “genus,” and “species” attributes: Shell curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_attribute", "schema": "dev", "table": "animal", "attribute": "name" }' curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_attribute", "schema": "dev", "table": "animal", "attribute": "genus" }' curl --location 'http://localhost:9925' \ --header 'Authorization: Basic cm9vdDpwYXNzd29yZA==' \ --header 'Content-Type: application/json' \ --data '{ "operation": "create_attribute", "schema": "dev", "table": "animal", "attribute": "species" }' These commands add the “name”, “genus”, and “species” attributes to the “animal” table within the “dev” schema. With HarperDB configured and the schema and table set up, you can start building your Quarkus-based microservice API to manage animal data. Stay tuned for the next part of the tutorial, where we’ll dive into the development process. Building Quarkus Application We configured HarperDB and prepared the environment. Now, we’ll start building our Quarkus application to manage animal data. Quarkus makes it easy with a handy project generator, so let’s begin. Quarkus offers an intuitive web-based project generator that simplifies the initial setup. Visit Quarkus Project Generator, and follow these steps: Select the extensions you need for your project. Add “JAX-RS” and “JSON” for this tutorial to handle REST endpoints and JSON serialization. Click the “Generate your application” button. Download the generated ZIP file and extract it to your desired project directory. With your Quarkus project generated, you’re ready to move on. Our project will use the DataFaker library and the HarperDB Java driver to generate animal data to interact with the HarperDB database. To include the HarperDB Java Driver, please read the previous article. In your Quarkus project, create a Java record to represent the Animal entity. This record will have fields for the scientific name, name, genus, and species, allowing you to work with animal data efficiently. Java public record Animal(String scientificName, String name, String genus, String species) { public static Animal of(Faker faker) { var animal = faker.animal(); return new Animal( animal.scientificName(), animal.name(), animal.genus(), animal.species() ); } } This record includes a factory method, of, that generates an Animal instance with random data using the DataFaker library. We’ll use this method to populate our database with animal records. In your Quarkus project, we’ll set up CDI (Contexts and Dependency Injection) to handle database connections and data access. Here’s an example of how to create a ConnectionSupplier class that manages database connections: Java @ApplicationScoped public class ConnectionSupplier { private static final Logger LOGGER = Logger.getLogger(ConnectionSupplier.class.getName()); @Produces @RequestScoped public Connection get() throws SQLException { LOGGER.info("Creating connection"); // Create and return the database connection, e.g., using DriverManager.getConnection } public void dispose(@Disposes Connection connection) throws SQLException { LOGGER.info("Closing connection"); connection.close(); } } The ConnectionSupplier class uses CDI annotations to produce and dispose of database connections. This allows Quarkus to manage the database connection lifecycle for you. Let’s create the AnimalDAO class to interact with the database using JDBC. This class will have methods for inserting and querying animal data. Java @ApplicationScoped public class AnimalDAO { private final Connection connection; public AnimalDAO(Connection connection) { this.connection = connection; } public void insert(Animal animal) { try { // Prepare and execute the SQL INSERT statement to insert the animal data } catch (SQLException exception) { throw new RuntimeException(exception); } } public Optional<Animal> findById(String id) { try { // Prepare and execute the SQL SELECT statement to find an animal by ID } catch (SQLException exception) { throw new RuntimeException(exception); } } // Other methods for data retrieval and manipulation } In the AnimalDAO class, you’ll use JDBC to perform database operations. You can add more methods to handle various database tasks, such as updating and deleting animal records. The AnimalService class will generate animal data and utilize the AnimalDAO for database interaction. Java @ApplicationScoped public class AnimalService { private final Faker faker; private final AnimalDAO dao; @Inject public AnimalService(Faker faker, AnimalDAO dao) { this.faker = faker; this.dao = dao; } // Implement methods for generating and managing animal data } In the AnimalService, you’ll use the DataFaker library to generate random animal data and the AnimalDAO for database operations. With these components in place, you’ve set up the foundation for your Quarkus-based Microservice API with HarperDB. In the next part of the tutorial, we’ll dive into developing RESTful endpoints and data management. Create AnimalResource Class In this final part of the tutorial, we will create an AnimalResource class to expose our animal service through HTTP endpoints. Additionally, we will provide sample curl commands to demonstrate how to consume these endpoints locally. Create an AnimalResource class with RESTful endpoints for managing animal data. This class will interact with the AnimalService to handle HTTP requests and responses. Java @Path("/animals") @Produces(MediaType.APPLICATION_JSON) @Consumes(MediaType.APPLICATION_JSON) public class AnimalResource { private final AnimalService service; public AnimalResource(AnimalService service) { this.service = service; } @GET public List<Animal> findAll() { return this.service.findAll(); } @POST public Animal insert(Animal animal) { this.service.insert(animal); return animal; } @DELETE @Path("{id}") public void delete(@PathParam("id") String id) { this.service.delete(id); } @POST @Path("/generate") public void generateRandom() { this.service.generateRandom(); } } In this class, we’ve defined several RESTful endpoints, including: GET /animals: Returns a list of all animals. POST /animals: Inserts a new animal. DELETE /animals/{id}: Deletes an animal by its ID. POST /animals/generate: Generates random animal data. Here are curl commands to test the HTTP endpoints locally using http://localhost:8080/animals/ as the base URL: Retrieve All Animals (GET) Shell curl -X GET http://localhost:8080/animals/ Insert a New Animal (POST) Shell curl -X POST -H "Content-Type: application/json" -d '{ "scientificName": "Panthera leo", "name": "Lion", "genus": "Panthera", "species": "Leo" }' http://localhost:8080/animals/ Delete an Animal by ID (DELETE) Replace {id} with the ID of the animal you want to delete: Shell curl -X DELETE http://localhost:8080/animals/{id} Generate Random Animal Data (POST) This endpoint doesn’t require any request data: Shell curl -X POST http://localhost:8080/animals/generate These curl commands allow you to interact with the Quarkus-based microservice API, performing actions such as retrieving, inserting, and deleting animal data. The generated random data endpoint is valuable for populating your database with test data. With these RESTful endpoints, you have a fully functional Quarkus application integrated with HarperDB to manage animal data over HTTP. You can extend and enhance this application further to meet your specific requirements. Congratulations on completing this tutorial! Conclusion In this tutorial, we embarked on a journey to build a Quarkus-based Microservice API integrated with HarperDB, a robust, high-performance database. We started by setting up our environment and creating a Quarkus project with the necessary extensions. Leveraging the DataFaker library, we generated random animal data to populate our HarperDB database. The core of our application was the seamless integration with HarperDB, showcasing the capabilities of the HarperDB Java driver. We used CDI to manage database connections efficiently and created a structured data access layer with the AnimalDAO class. Through this, we performed database operations, such as inserting and querying animal data. With the implementation of the AnimalService class, we combined the generated data with database operations, bringing our animal data management to life. Finally, we exposed our animal service through RESTful endpoints in the AnimalResource class, allowing us to interact with the service through HTTP requests. You can explore the complete source code of this project on GitHub. Feel free to fork, modify, and extend it to suit your needs. As you continue your journey into the world of HarperDB and Quarkus, remember to consult the comprehensive HarperDB documentation available at HarperDB Documentation to dive deeper into the capabilities and features of HarperDB. Stay informed about the latest updates, release notes, and news on HarperDB’s official website to ensure you’re always working with the most up-to-date information. Check out the latest release notes to discover what’s new and improved in HarperDB. By combining Quarkus and HarperDB, you’re well-equipped to build efficient and scalable applications that meet the demands of the modern digital landscape. Happy coding!
In my previous articles, we've discussed in detail how to architect global API layers and multi-region service meshes using Kong and YugabyteDB. However, the solutions presented still harbored a bottleneck and a single point of failure: the database Kong uses internally to store its metadata and application-specific configurations. This guide demonstrates how to eliminate this final bottleneck by running Kong on YugabyteDB, a distributed SQL database built on PostgreSQL. Kong's Default Database Kong uses PostgreSQL as a database for its own needs. Taking a look at the database schema created by Kong during the bootstrap process, you'll find dozens of tables and other database objects that store metadata and application-specific configurations: SQL kong=# \d List of relations Schema | Name | Type | Owner --------+-------------------------------+-------+---------- public | acls | table | postgres public | acme_storage | table | postgres public | basicauth_credentials | table | postgres public | ca_certificates | table | postgres public | certificates | table | postgres public | cluster_events | table | postgres public | clustering_data_planes | table | postgres public | consumers | table | postgres public | filter_chains | table | postgres public | hmacauth_credentials | table | postgres public | jwt_secrets | table | postgres public | key_sets | table | postgres public | keyauth_credentials | table | postgres public | keys | table | postgres .. the list goes on PostgreSQL serves perfectly well those Kong deployments that don't need to scale across multiple availability zones, regions, or data centers. However, when an application needs to deploy Kong Gateway or Kong Mesh across various locations, a standalone PostgreSQL server can become a bottleneck or single point of failure. Initially, Kong offered Apache Cassandra as an alternative to PostgreSQL for those wishing to architect distributed APIs and service meshes. But later, Cassandra support was officially deprecated. Kong team stated that PostgreSQL would remain the only officially supported database. Why Distributed PostgreSQL? Even though Cassandra was deprecated, the demand for a distributed version of Postgres by Kong users didn't wane, driven by several reasons: High availability: API layers and service meshes must be resilient against all kinds of potential outages, including zone and region-level incidents. Scalability: From global load balancers to the API and database layers, the entire solution needs to handle both read and write workloads at low latency. Data regulations: When an API or mesh spans multiple jurisdictions, certain API endpoints may be required to store specific settings and configurations within data centers located in a particular geography. As a result, members from both the Kong and YugabyteDB communities began to work on adapting YugabyteDB for distributed Kong deployments. Why YugabyteDB? YugabyteDB is a distributed SQL database that is built on PostgreSQL. The upper half of YugabyteDB, the query layer, is PostgreSQL, with modifications needed for the YugabyteDB's distributed storage layer. Essentially, you can think of YugabyteDB as a distributed Postgres. Provided that YugabyteDB maintains feature and runtime compatibility with Postgres, the majority of applications, libraries, drivers, and frameworks designed for Postgres should operate seamlessly with YugabyteDB, requiring no code changes. For instance, one of the earlier articles shows how to deploy Kubernetes on YugabyteDB, using the integration initially created for Postgres. Back in 2022, following the Cassandra deprecation, Kong was not compatible with YugabyteDB due to the absence of certain Postgres features in the distributed database engine. However, this changed with the release of YugabyteDB version 2.19.2, which included support for all the features necessary for Kong. Next, we'll explore how to get Kong Gateway up and running on a multi-node YugabyteDB cluster. Starting a Multi-Node YugabyteDB Cluster There are many ways to start Kong Gateway and YugabyteDB. One of the options is to run everything inside Docker containers. So, let's use this approach today: First off, create a custom docker network for YugabyteDB and Kong containers: Shell docker network create custom-network Next up, get a three-node YugabyteDB cluster running: Shell mkdir $HOME/yb_docker_data docker run -d --name yugabytedb_node1 --net custom-network \ -p 15433:15433 -p 7001:7000 -p 9001:9000 -p 5433:5433 \ -v $HOME/yb_docker_data/node1:/home/yugabyte/yb_data --restart unless-stopped \ yugabytedb/yugabyte:latest \ bin/yugabyted start --base_dir=/home/yugabyte/yb_data --daemon=false docker run -d --name yugabytedb_node2 --net custom-network \ -p 15434:15433 -p 7002:7000 -p 9002:9000 -p 5434:5433 \ -v $HOME/yb_docker_data/node2:/home/yugabyte/yb_data --restart unless-stopped \ yugabytedb/yugabyte:latest \ bin/yugabyted start --join=yugabytedb_node1 --base_dir=/home/yugabyte/yb_data --daemon=false docker run -d --name yugabytedb_node3 --net custom-network \ -p 15435:15433 -p 7003:7000 -p 9003:9000 -p 5435:5433 \ -v $HOME/yb_docker_data/node3:/home/yugabyte/yb_data --restart unless-stopped \ yugabytedb/yugabyte:latest \ bin/yugabyted start --join=yugabytedb_node1 --base_dir=/home/yugabyte/yb_data --daemon=false And finally, verify the cluster's status by going to the YugabyteDB UI here: Starting Kong Gateway To deploy Kong Gateway on YugabyteDB using Docker, follow the steps below. First, connect to YugabyteDB and create the kong database using psql or your preferred SQL tool: Shell psql -h 127.0.0.1 -p 5433 -U yugabyte create database kong; \q Next, start the Kong bootstrapping and migration process: Shell docker run --rm --net custom-network \ -e "KONG_DATABASE=postgres" \ -e "KONG_PG_HOST=yugabytedb_node1" \ -e "KONG_PG_PORT=5433" \ -e "KONG_PG_USER=yugabyte" \ -e "KONG_PG_PASSWORD=yugabyte" \ kong:latest kong migrations bootstrap KONG_DATABASE: is set to postgres, which directs Kong to continue using the PostgreSQL implementation for its metadata storage. KONG_PG_HOST: Kong can interface with any node within the YugabyteDB cluster. The chosen node will route Kong's requests and manage their execution across the cluster. The bootstrapping process can take up to 5 minutes, during which there may be no log output. Once completed, the following log messages will indicate the happy end: Shell .... migrating response-ratelimiting on database 'kong'... response-ratelimiting migrated up to: 000_base_response_rate_limiting (executed) migrating session on database 'kong'... session migrated up to: 000_base_session (executed) session migrated up to: 001_add_ttl_index (executed) session migrated up to: 002_320_to_330 (executed) 58 migrations processed 58 executed Database is up-to-date Finally, launch the Kong Gateway container, configured to utilize YugabyteDB as the database backend: Shell docker run -d --name kong-gateway \ --net custom-network \ -e "KONG_DATABASE=postgres" \ -e "KONG_PG_HOST=yugabytedb_node1" \ -e "KONG_PG_PORT=5433" \ -e "KONG_PG_USER=yugabyte" \ -e "KONG_PG_PASSWORD=yugabyte" \ -e "KONG_PROXY_ACCESS_LOG=/dev/stdout" \ -e "KONG_ADMIN_ACCESS_LOG=/dev/stdout" \ -e "KONG_PROXY_ERROR_LOG=/dev/stderr" \ -e "KONG_ADMIN_ERROR_LOG=/dev/stderr" \ -e "KONG_ADMIN_LISTEN=0.0.0.0:8001, 0.0.0.0:8444 ssl" \ -p 8000:8000 \ -p 8443:8443 \ -p 127.0.0.1:8001:8001 \ -p 127.0.0.1:8444:8444 \ kong:latest Test Kong’s operation by sending a request to the Gateway: Shell curl -i -X GET --url http://localhost:8001/services Then, return to the YugabyteDB UI, selecting 'kong' from the 'Databases' menu to view the dozens of tables and indexes Kong uses internally. Job done! In Summary Even though the Kong team stopped supporting Cassandra for distributed deployments, their initial bet on PostgreSQL paid off over time. As one of the fastest-growing databases, Postgres has a rich ecosystem of extensions and other products that extend its use cases. Kong users required a distributed version of Postgres for APIs and service meshes spanning various locations, and that use case was eventually addressed by YugabyteDB, a distributed database built on PostgreSQL.
The Initial Need Leading to CQRS The traditional CRUD (Create, Read, Update, Delete) pattern has been a mainstay in system architectures for many years. In CRUD, reading and writing operations are usually handled by the same data model and often by the same database schema. While this approach is straightforward and intuitive, it becomes less effective as systems scale and as requirements become more complex. For instance, consider a large-scale e-commerce application with millions of users. This system may face conflicting demands: it needs to quickly read product details, reviews, and user profiles, but it also has to handle thousands of transactions, inventory updates, and order placements efficiently. As both reading and writing operations grow, using a single model for both can lead to bottlenecks, impacting performance and user experience. Basics of the CQRS Pattern CQRS was introduced to address these scaling challenges. The essence of the pattern lies in its name — Command Query Responsibility Segregation. Here, commands are responsible for any change in the system’s state (like placing an order or updating a user profile), while queries handle data retrieval without any side effects. In a CQRS system, these two operations are treated as entirely distinct responsibilities, often with separate data models, databases, or even separate servers or services. This allows each to be tuned, scaled, and maintained independently of the other, aligning with the specific demands of each operation. CQRS Components Commands Commands are the directive components that perform actions or changes within the system. They should be named to reflect the intent and context, such as PlaceOrder or UpdateUserProfile. Importantly, commands should be responsible for changes, and therefore should not return data. Further exploration into how commands handle validation, authorization, and business logic would illuminate their role within the CQRS pattern. Queries Queries, on the other hand, handle all request-for-data operations. The focus could be on how these are constructed to provide optimized, denormalized views of the data tailored for specific use cases. You may delve into different strategies for structuring and optimizing query services to deal with potentially complex read models. Command and Query Handlers Handlers serve as the brokers that facilitate the execution of commands and queries. Command handlers are responsible for executing the logic tied to data mutations while ensuring validations and business rules are adhered to. Query handlers manage the retrieval of data, potentially involving complex aggregations or joins to form the requested read model. Single Database vs Dual Database Approach Single Database In this model, both command and query operations are performed on a single database, but with distinct models or schemas. Even though both operations share the same physical storage, they might utilize different tables, views, or indexes optimized for their specific requirements. It could be represented as follows: Single database representation Benefits Simplified infrastructure and reduced overhead Immediate consistency, as there’s no lag between write and read operations Trade-Offs The shared resource can still become a bottleneck during heavy concurrent operations. Less flexibility in tuning and scaling operations independently Dual Database Approach Here, command and query operations are entirely separated, using two distinct databases or storage systems. The write database is dedicated to handling commands, while the read database serves query operations. A handle synchronization system must be added. It could be represented as follows: Multi database representation Benefits Individual tuning, scaling, and optimization for each operation Potential for each database to be hosted on different servers or services, distributing load; database solutions could also be different to allow more modularity for specific needs. Trade-Offs Introduces complexity in ensuring data consistency between the two databases Requires synchronization mechanisms to bridge the potential misalignment or latency between the write and read databases: For instance, a write operation may update the command database, but the read database might not immediately reflect these changes. To address this, synchronization techniques can range from simple polling to more intricate methods like Event Sourcing, where modifications are captured as a series of events and replayed to update the read database. The single database approach offers simplicity; the dual database configuration provides more flexibility and scalability at the cost of added complexity. The choice between these approaches depends on the specific requirements and challenges of the system in question, particularly around performance needs and consistency requirements. Benefits and Trade-Offs of CQRS Pattern Benefits Performance optimization: By separating read and write logic, you can independently scale and optimize each aspect. For instance, if a system is read-heavy, you can allocate more resources to handling queries without being bogged down by the demands of write operations. Flexibility and scalability: With separate models for reading and writing, it’s easier to introduce changes in one without affecting the other. This segregation not only allows for more agile development and easier scalability but also offers protection against common issues associated with eager entity loading. By distinctly handling reads and writes, systems can be optimized to avoid unnecessary data loading, thereby improving performance and reducing resource consumption. Simplified codebase: Separating command and query logic can lead to a more maintainable and less error-prone codebase. Each model has a clear responsibility, reducing the likelihood of bugs introduced due to intertwined logic. Trade-Offs Data consistency: As there might be different models for read and write, achieving data consistency can be challenging. CQRS often goes hand-in-hand with the “eventual consistency” model, which might not be suitable for all applications. Complexity overhead: Introducing CQRS can add complexity, especially if it’s paired with other patterns like Event Sourcing. It’s crucial to assess whether the benefits gained outweigh the added intricacy. Increased development effort: Having separate models for reading and writing means, in essence, maintaining two distinct parts of the system. This can increase the initial development effort and also the ongoing maintenance overhead. Conclusion CQRS offers a transformative approach to data management, bringing forth significant advantages in terms of performance, scalability, and code clarity. However, it’s not a silver bullet. Like all architectural decisions, adopting CQRS should be a measured choice, considering both its benefits and the challenges it introduces. For systems where scalability and performance are paramount, and where the trade-offs are acceptable, CQRS can be a game-changer, elevating the system’s robustness and responsiveness to new heights.
Choosing the right database solution is an essential factor that could significantly influence your application’s overall performance. This article aims to provide a comprehensive comparison between AWS RDS MySQL and Aurora MySQL, two powerful database solutions offered by Amazon Web Services (AWS). I will delve into the specifics of their architecture, performance, data replication capabilities, security measures, cost efficiency, ease of use, integration capabilities, and support resources. By the end of this guide, you will be equipped with all the necessary information to make an informed decision about the most suitable database solution for your specific needs. AWS RDS MySQL and Aurora MySQL are both managed database services offered by Amazon Web Services. AWS RDS MySQL is a relational database service that provides cost-efficient and resizable capacity while automating time-consuming administration tasks. On the other hand, Aurora MySQL is a MySQL-compatible relational database engine that offers superior performance akin to high-end commercial databases at a fraction of the cost. The right database solution not only ensures efficient data management but also supports your applications' performance and scalability requirements. It can help you avoid potential downtime, enhance application responsiveness, and ensure data security and compliance. Thus, understanding the nuances of AWS RDS MySQL and Aurora MySQL becomes crucial in determining the best fit for your particular scenario. Architecture and Performance AWS RDS MySQL uses a traditional monolithic architecture where the database exists on a single server or multiple servers working as one unit. This setup allows it to deliver a very fast, multi-threaded, and robust SQL database server, making it an ideal choice for mission-critical and heavy-load production systems. However, its architecture might have limitations when dealing with extremely high workloads. Unlike RDS MySQL, Aurora MySQL employs a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance. This architecture enables Aurora MySQL to offer up to five times better performance than MySQL, making it a top choice for demanding applications that require high throughput and low latency. When we compare AWS RDS MySQL and Aurora MySQL in terms of performance and scalability, Aurora tends to outshine RDS MySQL. While RDS MySQL offers robust performance for a wide range of applications, Aurora MySQL's distributed architecture allows it to handle higher workloads, offering superior performance and scalability. However, the choice between the two will heavily depend on your specific workload and performance requirements. Data Replication and Availability AWS RDS MySQL supports data replication through its Read Replicas feature, allowing you to create up to five copies of your database. This process aids in enhancing the database's availability and durability. However, compared to Aurora MySQL, RDS MySQL's replication process is relatively slower. Aurora MySQL takes data replication a notch higher by allowing you to provision up to 15 replicas, and it performs replication in milliseconds. This quick replication process, coupled with automatic failover, mitigates data loss risks and ensures higher data availability. In terms of data availability and replication speed, Aurora MySQL has the upper hand over RDS MySQL. The ability to provision up to 15 replicas and its lightning-fast replication process make Aurora MySQL more resilient and reliable, especially for applications that demand high data availability. Security and Compliance AWS RDS MySQL offers robust security features, including network isolation using Amazon VPC, encryption at rest and in transit, IAM integration for access control, and automated patches and updates. It also complies with several key industry standards, providing a secure environment for your data. Just like RDS MySQL, Aurora MySQL also provides robust security features, including encryption at rest and in transit, network isolation using Amazon VPC, and IAM integration. Additionally, Aurora MySQL includes advanced features like database activity streams for real-time monitoring of the database, further enhancing its security posture. Both RDS MySQL and Aurora MySQL offer strong security features, ensuring that your data is protected against potential threats. However, Aurora MySQL's additional capabilities, like real-time database activity streams, give it a slight edge over RDS MySQL when it comes to security. Cost Efficiency AWS RDS MySQL follows a pay-as-you-go pricing model. The costs are based on the resources consumed, such as compute instances, storage, and data transfer. This flexible pricing structure can be cost-effective, especially for small to medium-sized workloads. Just like RDS MySQL, Aurora MySQL also follows a pay-as-you-go pricing model, with charges based on the resources used. However, considering its superior performance and scalability features, Aurora MySQL delivers similar performance to high-end commercial databases at almost one-tenth the cost. While both RDS MySQL and Aurora MySQL offer cost-effective solutions, the choice between the two should center around your specific requirements. If you require a database for small to medium-sized workloads, RDS MySQL could be your cost-effective choice. However, if you're dealing with high-volume workloads and need superior performance and scalability, Aurora MySQL's high-end features might justify its higher costs. Ease of Use and Management AWS RDS MySQL offers automated backups, software patching, automatic failover, and recovery mechanisms, which significantly reduce the administrative burden. It also allows easy scaling of compute resources and storage capacity to meet the demands of your application. Aurora MySQL also provides a fully managed service that automates time-consuming tasks such as hardware provisioning, database setup, patching, and backups. Furthermore, it allows on-the-fly modifications to the instance type or storage, providing flexibility in managing your database operations. Both RDS MySQL and Aurora MySQL provide a fully managed experience, simplifying database management. However, Aurora MySQL's ability to make on-the-fly adjustments to instance types and storage adds an extra layer of flexibility, making it slightly more user-friendly in terms of management. Integration Capabilities RDS MySQL integrates well with other AWS services like Lambda, CloudWatch, and IAM. It also supports integration with third-party applications, providing flexibility in building diverse applications. Aurora MySQL not only integrates seamlessly with other AWS services but also supports native integration with Lambda, enabling serverless computing. It also supports cross-region replication with RDS for MySQL, increasing its extensibility. While both RDS MySQL and Aurora MySQL provide efficient integration capabilities, Aurora MySQL's native integration with Lambda and support for cross-region replication with RDS MySQL gives it a slight edge when it comes to integration efficiency. Conclusion To summarize, while both AWS RDS MySQL and Aurora MySQL offer robust performance, security, and ease of use, there are key differences. Aurora MySQL stands out with its superior performance, faster data replication, more flexible management, and enhanced integration capabilities. However, RDS MySQL might still be the optimal choice for small to medium-sized workloads, given its cost-efficiency and robust feature set. The decision between AWS RDS MySQL and Aurora MySQL should be made based on your specific needs. If your priority is superior performance, high scalability, and advanced integration capabilities, Aurora MySQL might be the best fit. However, if you're looking for a cost-effective solution for moderate workloads, RDS MySQL might be your go-to option. Ultimately, the choice between RDS MySQL and Aurora MySQL depends on your unique situation. It's important to assess your requirements, workload size, budget, and future growth plans before making a decision. Remember, what works best for one organization may not necessarily work best for another. It's all about aligning your choice with your specific needs and goals.
In most financial firms, online transaction processing (OLTP) often relies on static or infrequently updated data, also called reference data. Reference data sources don’t always require ACID transaction capabilities, rather need support for fast read queries often based on simple data access patterns, and event-driven architecture to ensure the target systems remain up-to-date. NoSQL databases emerge as ideal candidates to meet these requirements, and cloud platforms such as AWS offer managed and highly resilient data ecosystems. In this article, I am not going to determine which AWS NoSQL database is better: the concept of a better database only exists within a specific purposeful context. I will share a coding lab to measure the performance of AWS-managed NoSQL databases such as DynamoDB, Cassandra, Redis, and MongoDB. Performance Testing I will start by defining the performance test case, which will concurrently insert a JSON payload 200 times and then read it 200 times. JSON Payload The base/parent class in base_db.py implements the test case logic of executing 10 concurrent threads to create and read 200 records. Python #imports ..... class BaseDB: def __init__(self, file_name='instrument.json', threads=10, records=20): ................................... def execute(self): create_threads = [] for i in range(self.num_threads): thread = threading.Thread( target=self.create_records, args=(i,)) create_threads.append(thread) thread.start() for thread in create_threads: thread.join() read_threads = [] for i in range(self.num_threads): thread = threading.Thread(target=self.read_records, args=(i,)) read_threads.append(thread) thread.start() for thread in read_threads: thread.join() self.print_stats() Each thread executes the write/read routine in the create_records and read_records, respectively. Notice that these functions do not include any database-specific logic, but rather, measure the performance of each read-and-write execution. Python def create_records(self, thread_id): for i in range(1, self.num_records + 1): key = int(thread_id * 100 + i) start_time = time.time() self.create_record(key) end_time = time.time() execution_time = end_time - start_time self.performance_data[key] = {'Create Time': execution_time} def read_records(self, thread_id): for key in self.performance_data.keys(): start_time = time.time() self.read_record(key) end_time = time.time() execution_time = end_time - start_time self.performance_data[key]['Read Time'] = execution_time Once the test case is executed, the print_stats function prints the execution metrics such as the read/write mean and the standard deviation (stdev) values, which indicate database read/write performance and consistency (smaller stdev implies more consistent execution performance). Python def print_stats(self): if len(self.performance_data) > 0: # Create a Pandas DataFrame from performance data df = pd.DataFrame.from_dict(self.performance_data, orient='index') if not df.empty: df.sort_index(inplace=True) # Calculate mean and standard deviation for each column create_mean = statistics.mean(df['Create Time']) read_mean = statistics.mean(df['Read Time']) create_stdev = statistics.stdev(df['Create Time']) read_stdev = statistics.stdev(df['Read Time']) print("Performance Data:") print(df) print(f"Create Time mean: {create_mean}, stdev: {create_stdev}") print(f"Read Time mean: {read_mean}, stdev: {read_stdev}") NoSQL Code Unlike relational databases that support standard SQL, each NoSQL database has its own SDK. The child test case classes for each NoSQL database only need to implement a constructor and create_record/read_recod functions that contain proprietary database SDK to instantiate a database connection and to create/read records in a few lines of code. DynamoDB Test Case Python import boto3 from base_db import BaseDB class DynamoDB (BaseDB): def __init__(self, file_name='instrument.json', threads=10, records=20): super().__init__(file_name, threads, records) dynamodb = boto3.resource('dynamodb', region_name='us-east-1') table_name = 'Instruments' self.table = dynamodb.Table(table_name) def create_record(self, key): item = { 'key': key, 'data': self.json_data } self.table.put_item(Item=item) def read_record(self, key): self.table.get_item(Key={'key': key}) if __name__ == "__main__": DynamoDB().execute() AWS Setup To execute these performance test cases in an AWS account, you should follow these steps: Create an EC2 IAM role with privileges to access the required AWS data services. Launch an EC2 instance and assign the newly created IAM role. Create each NoSQL database instance. IAM Role DynamoDB Table Cassandra Keyspace/Table Please note the DB host and credentials were hardcoded and removed in the mongo_db.py and redis_db.py modules and will need to be updated with the corresponding database connection setting for your AWS account. To connect to DynamoDB and Cassandra, I opted to use the Boto3 session credentials temporarily assigned to the db_performnace_iam_role IAM Role. This code will run in any AWS account in the East 1 region without any modification. Python class CassandraDB(BaseDB): def __init__(self, file_name='instrument.json', threads=10, records=20): super().__init__(file_name=file_name, threads=threads, records=records) self.json_data = json.dumps( self.json_data, cls=DecimalEncoder).encode() # Cassandra Keyspaces configuration contact_points = ['cassandra.us-east-1.amazonaws.com'] keyspace_name = 'db_performance' ssl_context = SSLContext(PROTOCOL_TLSv1_2) ssl_context.load_verify_locations('sf-class2-root.crt') ssl_context.verify_mode = CERT_REQUIRED boto_session = boto3.Session(region_name="us-east-1") auth_provider = SigV4AuthProvider(session=boto_session) cluster = Cluster(contact_points, ssl_context=ssl_context, auth_provider=auth_provider, port=9142) self.session = cluster.connect(keyspace=keyspace_name) Connect to the EC2 instance (I used the Session Manager), and run the following Shell script to perform these tasks: Install Git. Install Pythion3. Clone the GitHub performance_db repository. Install and activate the Python3 virtual environment. Install 3rd party libraries/dependencies. Execute each test case. Shell sudo yum install git sudo yum install python3 git clone https://github.com/dshilman/db_performance.git sudo git pull cd db_performance python3 -m venv venv source ./venv/bin/activate sudo python3 -m pip install -r requirements.txt cd code sudo python3 -m dynamo_db sudo python3 -m cassandra_db sudo python3 -m redis_db sudo python3 -m mongo_db You should see the following output for the first two test cases: (venv) sh-5.2$ sudo python3 -m dynamo_dbPerformance Data: Create Time Read Time1 0.336909 0.0314912 0.056884 0.0533343 0.085881 0.0313854 0.084940 0.0500595 0.169012 0.050044.. ... ...916 0.047431 0.041877917 0.043795 0.024649918 0.075325 0.035251919 0.101007 0.068767920 0.103432 0.037742 [200 rows x 2 columns]Create Time mean: 0.0858926808834076, stdev: 0.07714510154026173Read Time mean: 0.04880355834960937, stdev: 0.028805479258627295Execution time: 11.499964714050293 (venv) sh-5.2$ sudo python3 -m cassandra_dbPerformance Data: Create Time Read Time1 0.024815 0.0059862 0.008256 0.0069273 0.008996 0.0098104 0.005362 0.0058925 0.010117 0.010308.. ... ...916 0.006234 0.008147917 0.011564 0.004347918 0.007857 0.008329919 0.007260 0.007370920 0.004654 0.006049 [200 rows x 2 columns]Create Time mean: 0.009145524501800537, stdev: 0.005201661271831082Read Time mean: 0.007248317003250122, stdev: 0.003557610695674452Execution time: 1.6279327869415283 Test Results DynamoDB Cassandra MongoDB Redis Create mean: 0.0859stdev: 0.0771 mean: 0.0091stdev: 0.0052 mean: 0.0292std: 0.0764 mean: 0.0028stdev: 0.0049 Read mean: 0.0488stdev: 0.0288 mean: 0.0072stdev: 0.0036 mean: 0.0509std: 0.0027 mean: 0.0012stdev: 0.0016 Exec Time 11.45 sec 1.6279 sec 10.2608 sec 0.3465 sec My Observations I was blown away by Cassandra’s fast performance. Cassandra support for SQL allows rich access pattern queries and AWS Keyspaces offer cross-region replication. I find DynamoDB's performance disappointing despite the AWS hype about it. You should try to avoid the cross-partition table scan and thus must use an index for each data access pattern. DynamoDB global tables enable cross-region data replication. MongoDB has a very simple SDK, is fun to use, and has the best support for the JSON data type. You can create indexes and run complex queries on nested JSON attributes. As new binary data formats are emerging, MongoDB may lose its appeal. Redis performance is amazingly fast, however, at the end of the day, it’s a key/value cache even if it supports complex data types. Redis offers powerful features such as pipelining and scripting to further improve query performance by passing code to Redis to execute on the server side. Conclusion In conclusion, choosing the AWS-managed NoSQL database for your enterprise reference data platform depends on your specific priorities. If performance and cross-region replication are your primary concern, AWS Cassandra stands out as a clear winner. DynamoDB integrates well with other AWS services such as Lambda and Kinesis and therefore is a great option for AWS native or serverless architecture. For applications requiring robust support for JSON data types, MongoDB takes the lead. However, if your focus is on fast lookup or session management for high availability, Redis proves to be an excellent option. Ultimately, the decision should align with your organization's unique requirements. As always, you can find the code in the GitHub repo linked earlier in this article (see Shell script task #3 above). Feel free to contact me if you need help running this code or with the AWS setup.
Relational DataBase Management Systems (RDBMS) represent the state-of-the-art, thanks in part to their well-established ecosystem of surrounding technologies, tools, and widespread professional skills. During this era of technological revolution encompassing both Information Technology (IT) and Operational Technology (OT), it is widely recognized that significant challenges arise concerning performance, particularly in specific use cases where NoSQL solutions outperform traditional approaches. Indeed, the market offers many NoSQL DBMS solutions interpreting and exploiting a variety of different data models: Key-value store (e.g., the simplest storage where the access to persisted data must be instantaneous and the retrieve is made by keys like a hash-map or a dictionary); Documented-oriented (e.g., widely adopted in server-less solutions and lambda functions architectures where clients need a well-structured DTO directly from the database); Graph-oriented (e.g., useful for knowledge management, semantic web, or social networks); Column-oriented (e.g., providing highly optimized “ready-to-use” data projections in query-driven modeling approaches); Time series (e.g., for handling sensors and sample data in the Internet of Things scenarios); Multi-model store (e.g., combining different types of data models for mixed functional purposes). "Errors using inadequate data are much less than those using no data at all."CHARLES BABBAGE A less-explored concern is the ability of software architectures relying on relational solutions to flexibly adapt to rapid and frequent changes in the software domain and functional requirements. This challenge is exacerbated by Agile-like software development methodologies that aim at satisfying the customer in dealing with continuous emerging demands led by its business market. In particular, RDBMS, by their very nature, may suffer when software requirements change over time, inducing rapid effects over database tabular schemas by introducing new association tables -also replacing pre-existent foreign keys- and producing new JOIN clauses in SQL queries, thus resulting in more complex and less maintainable solutions. In our enterprise experience, we have successfully implemented and experimented with a graph-oriented DBMS solution based on the Neo4j Graph Database so as to attenuate architectural consequences of requirements changes within an operational context typical of a digital social community with different users and roles. In this article, we: Exemplify how graph-oriented DBMS is more resilient to functional requirements; Discuss the feasibility of adopting graph-oriented DBMSs in a classic N-tier (layered) architecture, proposing some approach for overcoming main difficulties; Highlight advantages and disadvantages and threats to their adoption in various contexts and use cases. The Neo4j Graph Database The idea behind graph-oriented data models is to adopt a native approach for handling entities (i.e., nodes) and relationships behind them (i.e., edges) so as to query the knowledge base (namely, knowledge graph) by navigating relationships between entities. The Neo4j Graph Database works on oriented property graphs where both nodes and edges own different kinds of property attributes. We choose it as DBMS, primarily for: Its “native” implementation is concretely modeled through a digital graph meta-model, whose runtime instance is composed of nodes (containing the entities with their attributes of the domain) and edges (representing navigable relationships among the interconnected concepts).In this way, relationships are traversed in O(1); The Cypher query language, adopted as a very powerful and intuitive query system of the persisted knowledge within the graph. Furthermore, the Neo4j Graph Database also offers Java libraries for Object Graph Mapping (OGM), which help developers in the automated process of mapping, persisting, and managing model entities, nodes, and relationships. Practically, OGM interprets, for graph-oriented DBMS, the same role that the pattern Object Relational Mapping (ORM) has for relational persistence layers. Comparable to the ORM pattern designed for RDBMS, the OGM pattern serves to streamline the implementation of Data Access Objects (DAOs).Its primary function is to enable semi-automated elaboration in persisting domain model entities that are properly configured and annotated within the source code. With respect to Java Persistence API (JPA)/Hibernate, widely recognized as a leading ORM technology, Neo4j's OGM library operates in a distinctive manner: Write Operations OGM propagates persistence changes across all relationships of a managed entity (analyzing the whole tree of objects relationships starting from the managed object); JPA performs updates table by table, starting from the managed entity and handling relationships based on cascade configurations. Read Operations OGM retrieves an entire "tree of relationships" with a fixed depth by the query, starting from the specified node, acting as the "root of the tree"; JPA allows the configuration of relationships between an EAGER and a LAZY loading approach. Solution Benefits of an Exemplary Case Study To exemplify the meaning of our analysis, we introduce a simple operative scenario: the UML Class Diagram of Fig. 1.1 depicts an entity User which has a 1-to-N relationship with the entity Auth (abbr. of Authorization), which defines permissions and grants inside the application.This Domain Model may be supported in RDBMS by a schema like that of Tab. 1.1 and Tab. 1.2 or, in graph-oriented DBMS, as in the knowledge graph of Fig. 1.2. Fig. 1.1: UML Class Diagram of the Domain Model. users table id firstName lastName ... ... ... Tab. 1.1: Table mapped within RDBMS schema for User entity. AUTHS table id name level user_fk ... ... ... ... Tab. 1.2: Table mapped within RDBMS schema for Auth entity. Fig. 1.2: Knowledge graph related to the Domain Model of Fig. 1.1. Now, imagine that a new requirement emerges during the production lifecycle of the application: the customer, for administrative reasons, needs to bound authorizations in specific time periods (i.e., from and until the date of validity) as in Fig. 2.1, transforming the relationship between User and Auth in a N-to-N. This Domain Model may be supported in RDBMS by a schema like that of Tab. 2.1 or, in graph-oriented DBMS, as in the knowledge graph of Fig. 2.2. Fig. 2.1: UML Class Diagram of the Domain Model after the definition of new requirements. users table id firstName lastName ... ... ... Tab. 2.1: Table mapped within RDBMS schema for User entity. users_AUTHS table user_fk auth_fk from until ... ... ... ... Tab. 2.2: Table mapped within RDBMS schema for storing associations between User and Auth. entities. AUTHS table id name level ... ... ... Tab. 2.3: Table mapped within RDBMS schema for Auth entity. Fig. 2.2: Knowledge graph related to the Domain Model of Fig. 2.1. The advantage is already clear at a schema level: indeed, the graph-oriented approach did not change the schema but only prescribes the definition of two new properties on the edge (modeling the relationship), while the RDBMS approach has created the new association table users_auths substituting the external foreign key in auths table referencing the user's table. Proceeding further with a deeper analysis, we can try to analyze a SQL query wrt a query written in the Cypher query language syntax under the two approaches: we’d like to identify users with the first name “Paul” having an Auth named “admin” with the level greater than or equal to 3. On the one hand, in SQL, the required queries (respectively the first one for retrieving data from Tab. 1.1 and Tab. 1.2, while the second one for Tab. 2.1, Tab. 2.2, and Tab. 2.3) are: SQL SELECT users.* FROM users INNER JOIN auths ON users.id = auths.user_fk WHERE users.firstName = 'Paul' AND auths.name = 'admin' AND auths.level >= 3 SQL SELECT users.* FROM users INNER JOIN users_auths ON users.id = users_auths.user_fk INNER JOIN auths ON auths.id = users_auths.auth_fk WHERE users.firstName = 'Paul' AND auths.name = 'admin' AND auths.level >= 3 On the other hand, in Cypher query language, the required query (for both cases) is: Cypher MATCH (u:User)-[:HAS_AUTH]->(auth:Auth) WHERE u.firstName = 'Paul' AND auth.name = 'admin' AND auth.level >= 3 RETURN u While the SQL query needs one more JOIN clause, it can be noted that, in this specific case, not only the query written in Cypher query language does not present an additional clause or a variation on the MATCH path, but it also remains identical. No changes were necessary on the "query system" of the backend! Conclusions Wedge Engineering contributed as the technological partner within an international Project where a collaborative social platform has been designed as a decoupled Web Application in a 3-tier architecture composed of: A backend module, a layered RESTful architecture, leveraging on the JakartaEE framework; A knowledge graph, the NoSQL provided by the Neo4j Graph Database; A frontend module, a single-page app based on HTML, CSS, and JavaScript, exploiting the Angular framework. The most challenging design choice we had to face was about using a driver that exploits natively the Cypher query language or leveraging on the OGM library to simplify DAO implementations: we discovered that building an entire application with custom queries written in Cypher query language is neither feasible nor scalable at all, while OGM may be not efficient enough when dealing with large data hierarchies that involve a significant number of relationships involving referenced external entities. We finally opted for a custom approach exploiting OGM as the reference solutions for mapping nodes and edges in an ORM-like perspective and supporting the implementation of ad hoc DAOs, therefore optimizing punctually with custom query methods that were incapable of performing well. In conclusion, we can claim that the adopted software architecture well responded to changes in the knowledge graph schema and completely fulfilled customer needs while easing efforts made by the Wedge Engineering developers team. Nevertheless, some threats have to be considered before adopting this architecture: SQL is far more common expertise than Cypher query language → so it’s much easier to find -and thus to include within a development team- experts able to maintain code for RDBMS rather than for theNeo4j Graph Database; Neo4j system requirements for on-premise production are relevant (i.e., for server-based environments, at least 8 GB are recommended) → this solution may not be the best fit for limited resources scenarios and for low-cost implementations; At the best of our efforts, we didn’t find any open source editor “ready and easy to use” for navigating through the Neo4j Graph Database data structure (the official data browser of Neo4j does not allow data modifications through the GUI without custom MERGE/CREATE query) as there are many for RDBMS → this may be intrinsically caused by the characteristic data model which hardens the realization of tabular views of data.
Abhishek Gupta
Principal Developer Advocate,
AWS
Artem Ervits
Solutions Engineer,
Cockroach Labs