12. Complexity in data integrations

How to deal with complexities and scaling issues

Raymond Meester
13 min readMay 17, 2024

Because of the war in Ukraine, energy prices have risen significantly. Bakeries, including Aleksandra’s, are quite affected by this. After all, all the machines and ovens use quite a lot of energy. In addition, it is a competitive market with many suppliers.

It is up to Aleksandra to always be innovative while also paying attention to costs. To that end, she has purchased new machines to expand her product range. She also has a production scheduling system that can schedule employees for each product.

A bigger assortment, more buyers and more applications. Now things are really starting to get complicated. The processes will have to be tightly regulated so that there are no bottlenecks and they can still be followed by staff.

Complex integrations

When business cases become more complicated and data volumes and the number of applications grow, data integrations also become more complex. In these cases, you will definitely need to have a strategy to deal with the complexity.

But even if the company is still small and use cases seem simple, an integration can quickly change from this:

in this:

Dataflow art

Where does this complexity come from? And what can you do about it? Often an integration becomes complicated because the integration is not created from data and architectural principles, but from applications and use cases. This also makes sense because:

  • the business is usually the client for building integrations
  • business reality is complex and cannot be easily captured 1 to 1 in software
  • the task is often to get data in one place and overcome differences between applications

Case study

In practice it goes like this. Business assignment:

We have a new webshop where cakes and pastries can be ordered easily. The webshop needs all the information of articles to show on webshop. We need to link the central system to this new webshop”.

The analysis

The central system contains the entire assortment; all items are in here. However, the function of the system is mainly focused on logistics and production. So it contains all kinds of data about item numbers, suppliers and size of packages. Daily, the total article file is exported as CSV to an FTP server.

The webshop has more of a commercial purpose. The idea is to display the cakes nicely with good pictures. In addition, it is required by law to display information about ingredients and allergens. This information is not in the central system, but in the cloud application with recipes.

Aleksandra indicates that there is quite a rush. It is actually assumed that the website will be live in 6 weeks. In fact, the website builders have been working with some dummy data for months. They now have an API and are expecting the data in JSON format. Only the integration still needs to be built.

Design

There are three systems:

  1. ERP system: data in CSV format.
  2. Cloud app with recipes, ingredients and allergens: Data in XML format.
  3. Webshop: Data in JSON format.

Because there are no real standards and architectural principles, the responsibility for bridging the differences lies entirely in the integration layer and thus, in effect, with the developer of the integration.

In doing so, the developer does indicate that he would prefer to receive changes on a per-article basis, because that’s how the Webshop receives it as well. But that’s difficult, because the ERP system is already on the verge of being phased out and it would take too much time and money to modify it. There are only six weeks of time. The supplier of the ERP application always does ages about adjustments.

Well then we’ll build it the following way:

This still seems fairly straightforward, but in practice the builder runs into some problems:

  • a record must be kept of which file and article has been enriched, so a header will need to be set
  • the ingredient system has a maximum of 50 requests per minute. The assortment contains 800 products. In this case, the split items should be rotated
  • the system with ingredients uses oAuth2, so that requires tokens to be kept and other things to be taken care of
  • sometimes articles cannot be found in ingredient systems. Then error handling should be built for this
  • the enriched data and the original data must be merged. The original data cannot be sent to the recipe application and must be kept temporarily
  • the webshop sometimes returns an HTTP OK (200 code), but there is an error in the message. That will require special error handling that checks the content of the response.
  • The idea is to keep track of which items have been sent so that there is no discussion about them. Before sending, the responses of the shop should be logged
  • another separate integration will have to be built for the images
  • and so on

Every time an editing step is added, for example, splitting, enriching or transforming, it impacts the performance of the integration platform and the complexity of the integration. So every addition and change will have to be carefully considered.

Total complex

The total of all integrations constitutes the complexity of the integration layer.

Complexity is determined by:

  1. the number of integrations
  2. the number of steps per integration
  3. the internal or external occurrence of the processing steps (latency)
  4. dependencies (other systems/flows)
  5. the number of bytes processed (= number of messages * average message size)
  6. batch or real time
  7. the exchange pattern (Fire-And-Forget or Request-Reply)
  8. the integration patterns (splits, loops, enrichments etc)
  9. queueing and throttling
  10. the size of a message

Complexity affects how easy it is to modify and test the integration. It also affects finding causes when problems go wrong, delays in processing and the performance of the specific integration or the platform as a whole.

In production

The start of development of the integration was difficult. First, obtaining information, such as credentials for the FTP server, took some time. In addition, the firewall had to be adjusted. Also, the mapping with the web shop had to be changed a few times. But after several tests, it could go into production after 8 weeks (the release of the webshop had been delayed for 2 weeks because the webshop was not ready yet).

On production, we immediately run into a problem because a field for allergens is not in the message and is required for a limited number of products. Well, time for another new release. Functionally, the integration does exactly what it needs to do. Have a nice weekend.

But Saturday morning there is a call. The article file was prepared at 4 a.m. in the morning, but did not arrive at the webshop!

After an investigation, it appears that the item file is too large and the entire middleware system received an “Out of Memory Exception.” Fortunately, the machine is running in Azure, so the memory can be increased. Too bad this also slowed down the orders, because they run on the same server. Photos also use a lot of memory and slow down other integrations. It is decided to run the photos through another system. All is quiet again.

How to solve problems?

Integration problems can be solved in three ways:

  1. Design
  2. Measure
  3. Isolating

Design

Prevention is better than cure. Thus designing an integration properly can prevent many problems later in production. You make a design based on architectural principles and development guidelines. An example is that the integration platform processes only text data (CSV, XML, JSON), not binary data (such as photos). Binary data can be difficult to edit, and zip files, photos and videos are often large. It may then be better to run them through another platform, such as MFT (Managed File Transfer). An additional guideline might be that XML and JSON are preferable to other text formats.

The lack of design guidelines often causes work to be done on a per use case basis. Each integration (per developer / per use case) is different. Standards make integrations easier for everyone to follow. Developers, including new ones, then face fewer surprises because each integration is built the same.

In addition, it is a well-known fact that the integration layer is seen as the “glue” layer. In this case, integrations are the only place to bridge issues. This in turn saves time and efforts on the application side, but can negatively affect the integration layer. Seeing the integration layer purely as a glue layer, a layer to tie applications together, often does not produce a solid integration platform. It is then like hanging applications together via duct tape.

It is best to look at the whole, both in purchasing systems, and in adapting a system to the application landscape.

In the case with the article integration, for example, the source system could better deliver the messages per article in XML/JSON. This will allow you to work “event-driven,” spreading the load over the day and requiring fewer resources. In addition, an XML/JSON will be easier to process and no conversions are required.

But the source system was about to be phased out and the vendor was slow, right? Yes, this is a difficult consideration (especially from the business perspective). It is often the case that something is on the nomination to be phased out, but nothing is actually planned. No new system has been chosen and migration to the new system has yet to begin. In short, the system will probably be running for years to come. Engaging in a conversation with the vendor to explore options for improving their data interface may end up paying big dividends.

Yes, but there was only 6 weeks to make the integration, right? That’s right, but the web shop was already 6 months in development at that time and at the end it still ran out because the web shop was not finished. So why not spend enough time on the integration? If an integration is included early in the process and then some guidelines are made, it can save a lot of trouble. In practice, you will sometimes have to deviate from your standards, but that should be a conscious choice. The impact of the choices are clear in advance.

Measure

Measuring is knowing. For an integration, measuring means testing the impact of your choices against real-world experience. For example, you can measure based on one message how long it takes. That way the impact is immediately clear. For example, ten processing steps together may take 30 milliseconds, while one enrichment step takes 3 seconds. Then it is quickly clear where bottlenecks can occur.

To put it to the test, you can test with the average load. What does this do to the system resources? Where do bottlenecks arise? Next, growth can also be taken into account. If the number of items grows, will this have an impact? Where is the breaking point?

Isolating

In the example, we saw that after the integration went into production, other integrations stopped working. The problems with the article integration, cause issue with the order integration. Especially when you have hundreds of integrations running and suddenly problems arise, it’s sometimes very difficult to determine what’s causing it. How do integrations impact each other? And where in the integration does the problem lie?

The most common solution is to break up the integration layer into independent pieces. In addition, a single integration can be cut into a number of manageable pieces (independent flows). Brokers can then be put between these integrations and flows as a generic transport layer.

The idea is that each individual piece can be placed in a separate container. If one flow has many processing steps, multiple instances of that container can run. Because integrations run in isolation, this has the following advantages:

  • Integrations run independently, so that when problems arise they do not affect each other
  • If something does go wrong, it is clear where the problem is located
  • Integrations can be scaled independently without scaling the entire platform

Technically, container clusters, such as Kubernetes and OpenShift, lend themselves better to scalability than older technical components, such as application servers or OSGi containers. These techniques are also scalable, but you often see many resources being placed together.

Integrations at scale

To better define complexity, let’s give one last example. The Bakery Group receives data from various suppliers, such as article data, address data and prices. From one supplier, the following integration was made:

A supplier transmits data. These are converted to a generic format in the onramp. In an offramp, the data is processed and sent to the ERP system. Because the offramp is the same for every supplier, the idea is to reuse that process:

The problem

With one or a few calling onramp flows, there are no problems. But now there are dozens of flows calling the offramp. Now there is a problem though, that of integrations at scale. In integrations with few dependencies and low data volumes, the choices for constructs are not of great importance. With integrations at scale, they do, and that requires thinking differently.

But the question is: why is this a problem? Why can’t the platform take care of this? Let’s explain this using the production process at The Bakery Group.

The production line

The Bakery Group’s factory consists of:

  1. factory hall with machinery
  2. packing department
  3. warehouse

Translated into integration:

1. machines = Onramp flows

2. packing department = Offramp flows

3. warehouse = queue

pies = message

The process

The machine produces all kinds of cakes and puts them in the warehouse. The packing department takes various cakes, packs them and puts them on the dock for the trucks. The factory receives confirmation once the pies have left the factory.

At first this seems to be a good process and nothing is wrong. Therefore, since there is a demand for more cakes, a few more machines are purchased.

The packing department is slowly becoming a bottleneck. In fact, the factory is getting later and later confirmation that the cakes have left the factory.

Now here the bottleneck problem can be masked by:

  • agree that confirmations will come later
  • take out the warehouse in between

In the latter case, the factory hall fills up and you end up having problems with production. So in addition to the factory hall, the size of the packing department, the number of docks and the number of trucks will have to be scaled along with it.

Solutions

If you have dozens of integration on-ramps all calling one offramp flow, you will have scalability problems. These are different than for normal integration issues. Fundamentally, of course, you want to collect as much data as possible to see where the problems arise. Often timeouts or messages staying on queues are enough indication of a bottleneck.

The easiest solution might be to give each onramp its own copy of the offramp. However, this construction has its own problems:

  • More development work (for 50 flows we need 50 copies of the offramp)
  • When changes are made, everything must be adjusted (and something can be forgotten just like that)
  • Each copy to be loaded costs additional resources. Suppose the offramp has 30 steps. Then 30 * 50 = 1500 steps will need to be loaded.

The problem of bottlenecks and scalability is not new. In general, vertical scaling (i.e., giving the server more resources, such as more memory) has been shown to be limited. Usually, you can implement several solutions. Some examples:

Event-Driven Architecture

Event-Driven means that you process based on an event. So for example, instead of processing an entire article file with 6,000 articles, you process only the changed articles. Every time the article data is changed this leads to exactly one event. Instead of processing 6,000 articles each day, possibly only 12 need to be processed by the integration.

This gives you a fixed message size and allows you to spread the load across the system. Should all articles change, they will be offered in batches of, say, 100 per minute.

An event-driven architecture works “one-way” as much as possible. This eliminates the need for feedback and makes it easier to deploy queues for disconnection, without worrying about timeouts.

An event-driven architecture works independently in an integration as much as possible, so there are no dependencies with other integrations (flows) or external systems. This makes it easier to run them in separate containers (provided the platform allows it).

In addition, an event-driven architecture aims to get messages out of the system as quickly as possible. Structures are often standardized and kept as simple as possible.

You can, of course, stick with other architectures, but then the scalability problem remains in effect.

Observability

Something that can be improved in the integration platform is visibility. This way, a separate instance can be installed in a timely manner and problems can be isolated.

Middleware

Another way to reduce load (and business logic) is to use a data hub, a data staging area or a streaming broker (e.g., Kafka). In these cases, you let the data come in there centrally. You first send the raw data to a separate database/application or streaming broker. There you can then compile data, log it and then send it on again.

Horizontal scaling

In general, these scalability issues are solved by scaling horizontally. In this case, you would have multiple instances of the offramp running (sometimes also scaling automatically through autoscaling). So one instance of each calling flow is running, while 50 instances of the offramp are running. You can then scale back the offramp when it is less busy.

It sounds very appealing using technologies like Kubernetes to achieve scalability. The fact is that setting up Kubernetes is very complex, especially for an integration layer. Everything has to be developed cloud native with issues like distributed flows, latency, stateless, protocols, security, management and so on. But while this is very challenging, it can pay off for complex integrations in the long run.

--

--