10. How do you increase the quality of integrations?

Testing & Monitoring

10 min readMay 1, 2024

At some point the factory is set up and the cakes and pastries roll off the conveyor belt. Aleksandra, of course, does not want to stand next to this and look at each cake before it is delivered. She must be able to trust that it works.

At least, if everything goes well, she doesn’t need to know anything. But of course, not everything always goes right. To make sure as much as possible goes right, a lot of testing is done first, and then if things do go wrong we use monitoring to be informed about it. Both testing and monitoring can take a lot effort (to the point that it is a full-time job for many).

Testing and monitoring are both about the quality and continuity of the production environment. Testing captures as many instances as possible before an integration goes into production, while monitoring monitors this quality after it has gone into production.

Testing

Before a cake rolls off the assembly line, the process goes through many steps. Like the steps in a recipe, the ingredients will pass through several machines before they are transformed into a delicious cake. When setting up the cake factory, engineers and machine builders ensure that the baking process is efficient and in a logical sequence.

The factory will have two docks for this purpose, each on a different side of the factory. On one side, an onramp, where suppliers come to deliver the ingredients, and on the other, an offramp, where the cakes are ready for the buyers. In between are the various machines that bring the ingredients together, knead the dough, build the layers and apply the garnishes. When the cake is ready, it is packaged and placed on the offramp.

At the factory, the different steps are tested for each machine. Finally, exactly how many cakes can be produced to accommodate peaks is being considered.

A standard integration

You can also think of an integration as a production process with an onramp and offramp. Onramp is the part where data messages are delivered or retrieved and offramp where they are sent or offered. There are different processing steps in between. Associated processing steps are often put into a module. In integration, a production line can look like this:

On this process you can run all kinds of different tests. The test process specifies in what order and by whom this test is performed. Unsuccessful testing provides remedial work for the developers or work for third-party vendors.

Basic process testing of an integration:

Connection test: a new integration starts with connection testing. This test checks the connection between two components. The reason is that there is a dependency on third parties. These include vendors or network operators, for example. This is testing the onramp or offramp block. You want to address these issues quickly so that all connections run without problems.
Unit test: testing the software. A unit test is not very relevant at this stage, but is especially important during later changes. This test checks whether the software still works as expected when a change is made. In a unit test, the connections are replaced by a mockup.
Integration test: this test with real life input verifies that a message is properly processed by the module and the output is properly received. Sometimes this test is also seen as a part of a unit test without mockups.
Smoke test: after the release of an integration, the smoke test is a check if everything works similarly on a test or acceptance environment. With this you mainly test the rollout and tailoring. Ask yourself whether the rollout is complete, but especially think about differences between the development and test environment, such as connections and environment variables (properties). The goal is also to detect technical errors so that chain testing can be done properly.
Chain test: tests the integration through scenarios. It runs from source to target application. This is why it’s also called end-to-end testing. Here is the functional content of the output being actually checked. First of all, usually by testing the common situation (happy flow) and then also the alternative scenarios and the error scenarios. The chain test is the most complete test. If it passes, then the integration works functionally.
Release test: the administrator tests the rollout on the acceptance environment. With this he checks whether all acceptance criteria (technical handover) have been met.
Load test: the administrator tests whether the interface functions with common load (based on representative data). Also, maximum capacity and any bottlenecks are tested.
Shadow test: the administrator tests on a pre-production system with live data to ensure that a new change does not introduce errors. This test is less common in a smaller environment and is seen mainly in a larger upgrade of the integration platform or in migrations to another integration platform.
Acceptance test: the user organization tests an integration based on real life data. They view the integration layer as a black box and assume only the application layer. Data is entered directly at the source and the result checked at the target systems.

These different tests, each with its own role, make up the testing process:

Most integration testing differs somewhat from application testing. The latter revolve primarily around whether the functionality, as the end users use it. Integration testing is much more about the input and output of data. In other words, does a data message arrive at its destination in a timely, complete and correct manner?

Monitoring

Aleksandra and her co-workers are having a nice lunch. Then they hear a warning signal coming from the factory hall. “There must be another croissant stuck,” she jokes. But Thomas is already looking at his cell phone. There he has seen an alert coming in. Things are going wrong in the topping machine. “I’m already on my way!”

All the various tests aim to increase the probability that an integration on production will run well and work as expected. But how do you know it for sure? If something goes wrong, you want to know about it, right? And you want to know what exactly goes wrong, where exactly it goes wrong and how to fix it.

By monitoring, you detect when something goes wrong, but good monitoring systems can also detect before something goes wrong. This is the difference between reactive and preventive monitoring. In addition, monitoring systems can also collect data and display statistics to do trend analysis. Some examples:

Reactive monitoring: a receiving system is unavailable. Several messages are down and placed in the error queue. The monitoring provides an alert of this. The administrator can receive this alert on his phone or as a report in the mail the next day. If necessary, he can manually offer the message again as soon as the receiving system is back up and running.
Preventive monitoring: a build-up occurs on a queue, messages are still running, but new messages may cause blockages. The monitoring system issues a preventive warning about this. If necessary, additional instances can then be added so that the messages continue to flow.
Trend analysis: there is a small memory leak, causing a little more memory usage every day. After six months, the module would be “OutOfMemory.” Preventive monitoring may give a warning at 90%, but graphs show exactly the trend.

Generally, the analysis and subsequent actions are performed by administrators. This is often because there are unforeseen issues that did not come out of testing, or are caused by edge systems.

Automatic actions can also be taken, such as reconnecting, restarting or deploying an error handling process. In environments such as microservices or stateless containers, services can sometimes scale up automatically when busy.

Alerts, of course, don’t just show up on an administrator’s phone. Often, software components need to be raised as items. Triggers can be attached to items, from which in turn comes an event. Finally, actions can be attached to them (such as notification of the alert or a restart).

Type monitoring

Within the integration layer, you can monitor integrations at different levels.

System monitoring

At the system level, various technical components, such as servers, network and virtual machines, are monitored. Examples include logs, queues, caches, memory and processor usage. Based on items, triggers and events, alerts are set up. Well-known systems are Zabbix, Nagios and Cacti.

2. Performance Monitoring

Performance monitoring looks at complex integration performance issues by detecting and diagnosing bottlenecks to maintain an expected level of service. In doing so, you look primarily at the load of the overall integration or at specific steps (the bottlenecks). You also look at processing and response times between components. Well-known monitoring software is New Relic, Dynatrace and Datadog.

3. Chain Monitoring

Chain monitoring is an end-to-end approach to monitoring. For example, an application sends thousands of messages throughout the day. Chain monitoring examines exactly how many messages were sent, what steps they went through, how many went wrong and how many were received by the target systems.

Chain monitoring usually captures messages. It stores part of the message such as the name of the step, the ID of the message, the timestamp, or the type of log (INFO, WARN, ERROR etc.).

Sometimes the whole process is seen as one transaction and you can track the message through the chain. Well-known systems for chain monitoring are Elastic, Splunk and Grafana/Prometheus. Often these systems can measure not only the different steps and numbers, but also the time a step takes (at a higher level than Performance Monitoring).

Network Mapping & Time Diagrams

Integration layers often consist of different middleware components that contain multiple services. There are often architecture maps and designs that show the interrelationship of components, but that is at design time, not at run time. To get an overview of how these components interrelate in production at runtime, live network maps can be used.

Network maps can be made at system, performance and chain monitoring levels. At the system level, for example, they show the network and servers and the data in bytes. In performance monitoring, you will be able to click on the lines to view processing and response times. With chain monitoring, at the application and service level you will be able to view the different numbers and follow a message through the chain.

Network maps primarily provide a picture of how the various components are related. A data pipeline is often represented in time diagrams. These show by message, by step, how a message is processed.

Burden of proof

Proper testing and monitoring is also often necessary because the integration layer is often used (some say abused) to solve issues that an organization finds difficult to achieve in the application layer. The reason is that standard applications or cloud services (SaaS applications) offer little flexibility.

Systems are often developed by different parties. The integration layer in between provides basic communication, but in practice much more. A limited set of logic where there is limited room for error and interpretation is unfortunately a pipe dream. Messages often need to be transformed or enriched in the integration layer. Version control of the applications is essential in this regard.

It may happen that a bug is fixed in the source systems in one application where a field disappears, is renamed or added, which then produces a transformation error in the integration layer. The bug is then successfully tested in the application and put live. The error is then often attributed to the middleware. So it is essential to test this part along with it and perform chain testing.

It is smart to use messages that are output from another application and make the process by which the message is “created” suitable for application user test (AUT) after transformation and enrichment. This creates far fewer integration issues when all systems are tied together for going live.

Proper logging and monitoring of these integrations, once they are in production, is then again important for the “proof” of where something goes wrong and who should fix it. An integration is a chain of links, and often after an error occurs, somewhere in the chain is identified as responsible for it. The right data can prove whether this is actually the case.