16. Grid, mesh and hub
New recipes in data integration
No pies, but just potatoes, vegetables and meat. In other words, the standard meal Aleksandra’s grew up with. With the arrival of migrants from Indonesia and Southern Europe, all sorts of new dishes came along to the Netherlands. Aleksandra likes all of these cuisines available in the Netherlands.
Consequently, the village where Thomas grew up had a Chinese-Indonesian restaurant early on. As a child, he loved going there to order anything and everything. His grandfather also liked to go there. But he didn’t think it should be “too adventurous.” So he always ordered the same thing: Bami Goreng (Special).
With new generations of immigrants from Suriname, Morocco and Turkey came whole new dishes to the Netherlands, such as Roti, Couscous and Döner Kebab. Largely, these “newer” trends passed Thomas’ grandparents by. In fact, at the age of 73, his grandfather ate pizza for the first time. “That’s darn tasty!”, he did remark.
For Thomas’ parents, pizza (from Dr. Oetker) or couscous are commonplace. However, things popular with younger generations largely pass them by. Think of trends like Vegan, Fusion and Sushi.
Trends in IT
Just as generations of migrants brought new dishes, new technologies bring a wave of new IT developments. As an IT startup, you jump in at the latest wave, and that’s your foundation. Once you master the technology, another wave is already coming.
If you’ve been through a few of these waves it gets a little tiresome that every new technology is presented as the best invention since sliced bread. Aleksandra then thinks, to stay in the food analogy, “That’s more old wine in new bags.”
Yet a new technology trend also brings something refreshing. Often this is a solution to a problem of one of the previous waves (again, of course, with its own problems). Therefore, it is not that a new trend completely replaces the previous one. Rather, it is an enrichment.
So technological enrichment goes the same way as the enrichment of our food culture. For example, we don’t all just eat sushi now. At home, we still enjoy eating potatoes, vegetables and (veggie) meat just as much. And we also still eat noodles, spaghetti and couscous. It’s all just became more varied.
A new wave of data integration trends
The potatoes, vegetables and meat of data integration are the file exchange, data synchronization and brokers. Think technologies, such as FTP, Adapters and database links. In later generations, things like Remote Procedure Calls, the Enterprise Service Bus and Streaming Brokers have been added. In recent years, APIs have been the next best thing since sliced bread.
However, there is already another wave of integration trends with technologies such as:
- event grid
- service mesh
- data hub
Let’s open these wine bottles and see how they taste.
Event Grid
An event grid is basically a notification service. Think of a notification on your phone. Something happens: an update becomes available, the battery is low, and so on. You can then take action based on such a notification.
Event grids are notification services between systems. These so-called grids answer a number of questions:
- where is something happening? (source)
- what happened? (event)
- who wants to know (goal)
Event-driven systems are nothing new. However, usually these are not notifications, but messages with a lot of data. For example, messages in a broker such as Kafka (sometimes called an Event Hub) are unlimited, while the size of notifications in Azure Event Grid, for example, are limited to 64 kb. Event Grids are thus lightweight.
Another difference is that Kafka, for example, lets the clients actively connect. It thinks from the application’s point of view. A client application pushes a message to the broker, another client pulls it in (push-pull). An event grid is much more of a pass-through, push-push.
A final point is that endpoints (a source or a target) can be configured from the grid. So the source and targets have support for notifications beforehand. An example are sources (Even sources) and targets (Event handlers) in Azure:
So event grids are particularly useful for responding to events within a particular ecosystem. If something happens within this system, another component can take action on it. You no longer need a complex middleware system for this.
Service mesh
A service mesh is also not a completely new concept, but it has been revived by microservices architectures. Microservices require an enormous amount of communication between different services. A service mesh make this possible.
Normally, most middleware is in the integration or application layer. With a service mesh this is one level deeper, in the infrastructure layer. This layer provides service-to-service communication between (micro)services using a proxy.
Each microservice contains a small piece of business logic. It takes several microservices to form a business application or business process. So all these services are constantly communicating with each other. In order for microservices to communicate with each other, each microservice requires a module to take care of the communication. Think routing, retries and authorization.
With a service mesh, this logic no longer needs to be built into the microservice itself. Each microservice gets a proxy (also called a sidecar ) that contains this logic. Through a control plane, such proxies can be configured and managed:
As the above diagram shows, a service mesh is not a necessary component for a microservices environment. It is quite possible to program this logic yourself or use specific libraries (Netflix, for example, previously used only libraries). The use of the sidecar pattern is on the rise, though, because it takes a lot of work off the hands of developers.
Some well-known service meshes are:
- Istio (Google)
- Open service mesh (Microsoft)
- Consul
The mesh
If the functionality to communicate between services is not in the application, how does communication work?
Once a microservice is deployed in a cluster, it is automatically assigned to a sidecar. No request goes directly through the service, but all requests go through the proxies in the sidecar. Together, the proxies form a mesh:
While the principle of a service mesh is very simple, its complex configuration is also notorious. Setting up a Service Mesh only makes sense when deploying lots of microservices. Cloud specialist Gadi Naor comments on this:
Service mesh platforms — particularly Istio — are incredibly feature rich and extensible, with capabilities for automated security, access control policies, observability, and service connectivity. But with that complexity comes a steep learning curve. For smaller teams or those without extensive experience with Istio already, it may be more trouble than it’s worth.
Hubs
In general, hub means a point where multiple flows come together. In this, the hub is a transfer point to other flows. Think of a:
- Airline hub
- Internet hub
- USB hub
So the basis of the hub concept is that flows come together and then are further distributed. This is also called the hub-spoke model. Consider international airports that serve as interchanges for other, regional airports. For example, in the example below, Hub A and B could be the transfer airports of Amsterdam and Frankfurt:
In data integration, the same principle can be applied. It involves the coming together of multiple data streams and redistributing them further. There is a talk of different types of hubs within integration:
- Event hub: a streaming broker to which multiple messages (events) are pushed and routed. A client can subscribe to consume certain events.
- Integration hub: usually a marketplace (aka exchange) where multiple connectors can be chosen that together form an integration
- Data hub: a place where data from multiple sources comes together, is aggregated and shared. So it’s not just about storing data centrally, but also aggregating and delivering data.
Event hubs have already been discussed in the previous chapter. We now turn to the last type, data hubs.
Data hubs: data as a product
It is often said these days that data is the new gold. In practice, the focus is still on applications and the functionality surrounding those applications. Data itself has long been a product, of course, but data has not always been the focus yet. Data hubs can change this.
Data integration has traditionally focused primarily on enabling access and exchange between different applications. With that, data is not yet an independent product. The first step to this is often to work API-driven and create a data mesh.
APIs are designed into a data mesh as a product and made available through a self-service developer portal. However, this API-driven way of working also has some limitations.
For example:
- Dependence on the quality of the underlying API/application.
- Dependence on the quality of network connections to an API.
- Dependence on central management (through an API Platform).
- Data is spread across applications (functions/domains).
What if we could bring APIs (and API Management) and Data (and Data Management) together in one solution? This is the idea behind a data hub.
Data hub is a term that originally comes from Gartner. They state on their website about it:
It turns out that so many organizations had assumed all data and/or policy should be centralized (think ERP), or widely distributed at the edge (think standalone business application or IoT edge device).
It turns out that a ‘new’ approach, once that permits the notion of intermediate nodes to store/execute the policy or trusted data itself somewhere within all the spaghetti of systems, yields real efficiencies and drives agility.
Those “nodes” we introduced as ‘data hubs’, knowing that the word “hub” was already well used.
Below are graphs of the different modes of data distribution that Gartner talks about:
Hub-Spoke
The basic idea of a data hub, as we saw earlier, is like an international airport that serves as a transfer point for a number of smaller airports nearby. This is known as the Hub-Spoke model:
Usually there is not one hub, but several hubs. The subnodes can communicate with each other directly, as well as through the hubs. At airports, this is as follows:
The concept
For a long time, Data hubs were approached skeptically as data collections because they required a lot of storage space, data replication, scalability, connectivity and security. In addition, there was no clear protocol so that other systems could tap into them.
What you could see a lot was that data was collected centrally in a data warehouse, but used only for analysis and reporting purposes. Data hubs are specifically for operational purposes as well. Wikipedia states:
A data hub differs from a data warehouse in that it is generally unintegrated and often at different grains. It differs from an operational data store because a data hub does not need to be limited to operational data.
A data hub differs from a data lake by homogenizing data and possibly serving data in multiple desired formats, rather than simply storing it in one place, and by adding other value to the data such as de-duplication, quality, security, and a standardized set of query services.
Conceptually, a data hub is somewhere between central and decentralized and operational and strategic.
Advantages
But what is the advantage of a data hub? The main advantage is the same as with large airports serving as hubs. You don’t have to connect between each point, but choose a number of points where you centralize data in part.
Several smaller systems can thus easily exchange data through the hub. In doing so, it does not need to be connected to one large central system, but only to the nearest data hub.
Data hubs are often pitted against a central ERP or a mainframe as a concept. However, we can also contrast this approach well with more traditional decentralized approaches. These usually assume that storage and integration are separated.
An integration in a decentralized application landscape usually has the following structure:
- source (stored data)
- connector/API with the source
- source via connectors/APIs accessed via middleware
- middleware: integration logic that routes the data, transforms it and so on
- connector/API with the purpose: retrieve/send data from/to the target system
- purpose: consume data (process and/or store)
Above are six steps, but in practice these can be long chains of dozens of steps. Sometimes also with dependencies on third-party systems and APIs.
Data hubs can significantly reduce the number of steps. An example of such a data hub is an item information hub. There are enormous number of suppliers who make items and enormous number of companies who sell items. You can make an integration with all kinds of suppliers, but that results in a huge amount of work. Therefore, it is often decided to collect the item information at a central point.
Item data from thousands of suppliers are collected. Each buyer can consume part of the data streams via an API (perhaps in its own data format or protocol). The example of item data involves multiple sources, but still the same data each time. Of course, it is also possible to combine different types of data. For example, different financial data in one data hub:
OData
A term used with many data hubs is OData (Open Data Protocol). This is a standard based on a set of best practices for creating REST APIs.
OData helps you focus on your business logic while building RESTful APIs without having to worry about the various approaches to define request and response headers, status codes, HTTP methods, URL conventions, media types, payload formats, query options, etc. OData also provides guidance for tracking changes, defining functions/actions for reusable procedures, and sending asynchronous/batch requests.
Several data hubs, such as SAP, MarkLogic and Mendix support OData.
Products
Thus, a data hub involves data as a product. However, few providers still offer a Data hub as a product or as part of their product.
There are three types of data hub products:
- data hub as a part of a central application. The application is a hub to the rest of the organization. For example, a central financial system, to which several small, financially related systems, are linked. The central system is the hub to the small applications and other domains in the organization.
- data hub as a business solution where the data hub is already filled with data. For example, this data can be public data from different sources and made accessible to one or more parties.
- data hub as a specific technical solution where data is centralized in an organization, for example, from multiple smaller sources and accessed operationally from there. For example, multiple financial systems coming together in a data hub (which itself has no function other than as a hub).