Data Integration Language

Comparison: DIL and other approaches

Raymond Meester
17 min readSep 23, 2022

DIL is just one of the many valid approaches within data integration. In this part, we discuss and compare DIL to other common approaches in data integration.

As DIL transpiles into Camel, so we take extra attention to compare both with each other.

Approaches in data integration

Some of the approaches to solve data integration problems are:

  1. General-purpose languages
  2. Domain-specific languages
  3. Middleware
  4. Products
  5. Frameworks
  6. Modeling tools

Let’s explore each of these solutions.

Just coding

The most basic approach is to use a general-purpose programming language. The basic idea is that a data integration is a program. This is a very flexible approach. It’s possible to create a solution for any data integration problem.

The question is how fast do you write such a solution? And how reliable, readable and maintainable is this solution? In practice, the answer to this question is that most programming languages produce complex integration programs that are hard to read and change.

Programming languages tend to be good at building applications (building stuff) and not at integrations (moving stuff). It’s actually because most programming languages suck at data integration that most other approaches were invented.

Domain-specific languages

General purpose languages are no fit for data integration, but what about DSL’s? Are there any programming languages that are good for data integration. Not really. Most DSL’s, think of SQL, Regex, XSLT are used in completely other domains.

There are some research efforts for domain languages in integration. The most notable are Highway and LiLa. Both are declarative languages in the domain of data integration. The first one has a functional approach (using Clojure) and the second a logical approach (using Datalog). Both are more scientific efforts and, as far as I know, have no open source implementations.

Within the JavaScript world there is also a flow-based programming language called FBP. The specification is unfortunately only text-based (there is a converter to JSON, but only for JavaScript). This language is used by middleware such as NoFlo, MsgFlo, imgflo and MicroFlo. Though the language does not fit cleanly in the world of enterprise integration patterns and Java, it’s an inspiration for DIL.

Stuck in the middle

Instead of programming integrations directly, specific data integration tools were created. These so-called middleware tools form together a data integration. Instead of programming, mostly the developer needs to configure the various parts. Examples of middleware tools are adapters, message brokers, ESB’s, gateways and so on.

Today there are thousands of such tools available. There is however not a single middleware tool that is a solution for all data integration problems. Thus people have to learn what the right tool, for the right job is. In other words, the right middleware tool for the right integration solution.

Using various tools is a very scattered solution and one gets easily lost how the data actually flows. So middleware are practical means, but do not provide a way to describe the dataflow.

All-in-one solutions

It’s hard to find the right tools and make them work as a whole. There are two solutions that provide an all-in-one approach:

  1. Product suites
  2. Integration frameworks

Products

Generally, products suites (Think of Tibco and Mule) gather various middleware tools into one suite. They also provide ways to design the data integration (often graphically) and run them. A lot of enterprises uses these suites.

Besides that product suites tend to be quite expensive, they are also very complex and less portable than a single tool. And even though a lot of the core integration concepts and protocols are shared among these product suites, it’s often a specialized field. You are a Tibco or a Mule expert (very rarely both).

It requires deep knowledge to work with these suites and even for integration experts it can take a while to get effective when switching between product suites. Solutions are mostly very specific to a product suite. Changing product suite requires a lot of migration work (or even rewrites).

Frameworks

Products are thus very specialized. Frameworks, like Apache Camel, Spring Integration or Zato, provide an approach where you can still use a general purpose language. So a Java developer can quickly start using for example Apache Camel.

Instead of the product suite, they provide a more low-level approach. And instead of pure programming languages, they provide the means to create real data integrations. Because of their more low-level nature, they need both programming and integration skills. They are mostly tight to the general purpose programming language they are written in.

Modeling tools

But wait, aren’t there already a lot of modeling solutions available? Solution where you can model, instead of code or middleware?

Processes that are described as flows, nets or patterns have been around for a very long time (at least since the early seventies). Some popular contemporary examples:

All above software are excellent modelling platforms (I also added Camel as a framework for comparison). They are however not really domain-specific models for integration. Either because they focus on other domains or because they don’t have a well-defined syntax.

Consider the first three software, they are targeted at workflows. These flows are about a set of work that are processed manually or automatically. Each model has a notation in an XML format (DAX, PNML and BPMN). They however don’t say a lot about integration protocols, technologies and patterns.

In the domain of data integration, Apache NiFi and Apache Camel are more used. Both NiFi and Camel like to express the integration logic in a declarative manner (though Camel allows imperative parts through beans and processors). Still, they are often more technological orientated, both in terminology as the details of their implementations.

DIL and Camel

DIL is transpiled to Camel. To explore the differences and relation of DIL with Apache Camel it’s good to first see what Apache Camel is.

What is Camel?

Here a list of the basic characteristics of Apache Camel:

1. A framework

Camel as base is an integration framework written in Java. It provides several constructs and solutions that go well together, but it’s neutral to where Camel is running or wherefore it is used for. It does have it roots in ESB and SOA, but it can be used very well in other architectures or other middleware.

2. Enterprise Integration Patterns

Camel is based on Enterprise Integration Patterns. The flow of a message from one system to the next through channels, routing, transformations and other patterns.

3. Domain Specific Language

It provides its own Domain Specific Language that in its core is a route where the integration flow is defined. There are several formats to write the integration, like in Java or XML.

4. Integration

The goal of Camel is to solve integration problems. This can be as small as moving a file towards a whole integration service bus.

5. Connectivity

It provides connectivity with more than 300 components. Every component can be defined by a URI. These endpoints can be called in a route.

Comparing DIL with Camel

So Apache Camel is a real swiss-army knive of integration. And it’s message-orientated, component-based and a domain-specific language. This sounds a lot like DIL. So is there any difference?

It’s true that there is quite some overlap with Camel and Kamelets in particular. This is on purpose. The reason is that Camel has a solid base in various integration concepts, together with years of practical deployments. Besides, a lot of people are familiar with Camel and its syntax.

Still, when looking more closely, there are some conceptual differences. It would say DIL relates to Camel, as Camel itself relates to Java.

1. Concepts

The three basic concepts “data”, “component” and “links” are closer to flow-based programming than to integration patterns. Patterns are secondary language constructs. DIL for example doesn’t really distinguishes between patterns and endpoints. Consider this example of the Camel Java DSL:

from("activemq:my.queue")
.split(xpath("//foo/bar"))
.to("file:some/directory")

Here we have a from endpoint that contains an ActiveMQ component, a split pattern and then a to endpoint with a file component. In DIL these are all just components, for example on a step level:

flow
.step("activemq:my.queue")
.step("split:xpath://foo/bar")
.step("file:some/directory")

There is thus not really a difference between a split pattern and a file component in DIL. They are connected by links (Note: in flow-based programming these are called ports and connectors).

2. Conceptual integrity

A lot of concepts in Camel rose from practice. For example, Kamelets are predefined routes or route templates. However, when looking more closely, a Kamelet for example more resembles a function. Because it’s naming conventions, it is not always clear what it does, also how the various parts are related to each other.

DIL tries to avoid specific terms and technical terms. A Kamelet is somewhat equivalent to “step”. A flow, steps, blocks are things people can imagine what its use is. We don’t talk about metadata, datasets, bindings, beans etc.

DIL not only tries to keep the terminology simple, but also how the concepts are related. The concepts do not stand on their own. They are actually related through levels and links. This make DIL not just a language to tackle integration problems, but also a thought-framework to think about integrations.

3. Specification

DIL is a language specification. We will implement it in Apache Camel, but it can also be implemented in another framework, for example Spring Integration. Or in another programming language like Python with the help of the Zato framework.

4. Levels

DIL has the notion of levels through component composition. This make it possible to have different concepts on different levels. In older versions of Assimbly and the same counts more or less for Kamelets is that they are trying to do a lot on the same level as Camel routes. They are distancing themselves, while at the same time being close to Camel. This is actually confusing to the role a component has.

For example, a Kamelet needs a (route) template where a parameterized route is defined. Then you can call the Kamelet also in route. In DIL, you define it in blocks, this makes a step that are parts of flows. DIL avoids mixing up terminology through levels of components with different roles.

5. Higher-Level constructs

In DIL, “low-level” routes can form a component which then can be used by higher-level components.

Consider the following Camel example:

<camelContext id=”ID_627a6b7338c74a00130007f9" xmlns=”http://camel.apache.org/schema/blueprint" useMDCLogging=”true” streamCache=”true”>
<jmxAgent id=”agent” loadStatisticsEnabled=”true”/>
<streamCaching id=”streamCacheConfig” spoolThreshold=”0" spoolDirectory=”tmp/camelcontext-#camelId#” spoolUsedHeapMemoryThreshold=”70"/>
<threadPoolProfile id=”wiretapProfile” defaultProfile=”false” poolSize=”0" maxPoolSize=”5" maxQueueSize=”2000" rejectedPolicy=”DiscardOldest” keepAliveTime=”10"/>
<threadPoolProfile id=”defaultProfile” defaultProfile=”true” poolSize=”8" maxPoolSize=”16" maxQueueSize=”1000" rejectedPolicy=”CallerRuns” keepAliveTime=”30"/>
<onException> <exception>java.lang.Exception</exception>
<redeliveryPolicy maximumRedeliveries=”0" redeliveryDelay=”5000"/>
<setExchangePattern pattern=”InOut”/>
</onException>
<route id=”04a6c550-d067–11ec-83f5–3747809ef661">
<from uri=”jetty:https://0.0.0.0:9001/1/Aggregate?httpBinding=#customHttpBinding&amp;matchOnUriPrefix=false&amp;sslContextParameters=sslContext"/>
<removeHeaders pattern=”CamelHttp*”/>
<to uri=”activemq:ID_627a6b7338c74a00130007f9_test_04a6c550-d067–11ec-83f5–3747809ef661?timeToLive=86400000&amp;exchangePattern=InOut”/>
</route>
<route id=”797f5ea0-d0f8–11ec-83f5–3747809ef661">
<from uri=”activemq:ID_627a6b7338c74a00130007f9_test_04a6c550-d067–11ec-83f5–3747809ef661"/>
<split streaming=”false” parallelProcessing=”false”>
<xpath saxon=”true” threadSafety=”true”>/names/name</xpath>
<setHeader headerName=”CamelSplitIndex”>
<simple>${exchangeProperty.CamelSplitIndex}</simple>
</setHeader>
<setHeader headerName=”CamelSplitSize”>
<simple>${exchangeProperty.CamelSplitSize}</simple>
</setHeader>
<setHeader headerName=”CamelSplitComplete”>
<simple>${exchangeProperty.CamelSplitComplete.toString().trim()}</simple>
</setHeader>
<to uri=”activemq:ID_627a6b7338c74a00130007f9_test_797f5ea0-d0f8–11ec-83f5–3747809ef661_BottomCenter_split?timeToLive=86400000&amp;exchangePattern=InOut”/>
</split>
</route>
<route id=”75be5f00-d0f8–11ec-83f5–3747809ef661">
<from uri=”activemq:ID_627a6b7338c74a00130007f9_test_797f5ea0-d0f8–11ec-83f5–3747809ef661_BottomCenter_split”/>
<setProperty propertyName=”Aggregate-Type”>
<simple>text/xml</simple>
</setProperty>
<aggregate strategyRef=”CurrentAggregateStrategy” completionSize=”3">
<correlationExpression>
<constant>true</constant>
</correlationExpression>
<to uri=”activemq:ID_627a6b7338c74a00130007f9_test_75be5f00-d0f8–11ec-83f5–3747809ef661_aggregator?timeToLive=86400000"/>
</aggregate>
</route>
<route id=”04a6ec60-d067–11ec-83f5–3747809ef661">
<from uri=”activemq:ID_627a6b7338c74a00130007f9_test_75be5f00-d0f8–11ec-83f5–3747809ef661_aggregator”/>
<setHeader headerName=”CamelVelocityTemplate”>
<simple>Message Body:${bodyAs(String)}</simple>
</setHeader>
<to uri=”velocity:generate”/>
</route>
</camelContext>

For someone who doesn’t know Camel, this leaves a lot of questions:

  1. What is a Camel context?
  2. What do the direct endpoints mean?
  3. What is a breadcrumbid?
  4. What is Jetty?
  5. What is a CamelVelocityTemplate?
  6. What is an exchange pattern?
  7. What is the aggregationStrategy?

And a lot of other questions…

The same use case as a DIL flow:

<flow inbound=”sample”>
<step source=”httpinbound?path=sample&preserveHttpHeader=true”/>
<step action=”split?type=xpath&expression=/names/name”>
<links>
<link name="unsplited" direction="out">
<link name="splitted" direction="out">
</links>
</step>
<step action=”aggregate?type=xml&completionCount=3">
<links>
<link name="splitted" direction="in">
</links>
</step>
<step sink=”print:${body}”/>
</flow>

It can be written more compact, because the components used a composed of more low-level components. The developer however, should only care about the level it’s working on.

6. Representations

As usual with dataflow programming, DIL focuses on both visual and textual representation. Camel focused more traditionally on textual way of programming, where the solution is created by a programmer. The various way to represent it for example in Hawt.io, Red Hat Fuse or Karavan make them visual, but in DIL the language itself is visual. It’s setup with visualization in mind, where the textual representation is derived from that. Thus, the opposite of what Camel and the mentioned tools do.

Higher-level components

It’s really hard to give a direct comparison between how different concepts are used by different projects. We will give a try in the following table:

Especially on the step level, there are some difference in terminology. The idea is that DIL could bring a common domain-specific language to talk about integrations.

Why do we need levels?

Consider the following use case:

  1. Create an API
  2. Send incoming data to a queue
  3. Insert data into a database

For a business user this is one step, for an integration specialist these are three steps, programmers (for example using Camel) use many steps.

When you want to develop the above integration in Apache Camel, things aren’t that easy as creating a route or routetemplate with the steps and pack it as a reusable integration. Let’s try to implement the mentioned use case in Camel. To create an API we need the camel-rest component as the base to configure it. Then we need to write the REST configuration (which is a separate DSL to the route DSL). For example:

@Override
public void configure() throws Exception {
rest("/say")
.get("/hello").to("direct:hello")
.get("/bye").consumes("application/json").to("direct:bye")
.post("/bye").to("mock:update");
from("direct:hello")
.transform().constant("Hello World");
from("direct:bye")
.transform().constant("Bye World");
}

Next step is to decide which servlet container it will run, for example Netty, Jetty or Undertow. Then we need to configure that servlet service. And of course we need to enable HTTPS for encrypted communication, thus we need to set up SSL/TLS something like this:

KeyStoreParameters ksp = new KeyStoreParameters();
ksp.setResource("/users/home/server/keystore.jks");
ksp.setPassword("keystorePassword");
KeyManagersParameters kmp = new KeyManagersParameters();
kmp.setKeyStore(ksp);
kmp.setKeyPassword("keyPassword");
SSLContextServerParameters scsp = new SSLContextServerParameters();
scsp.setClientAuthentication(ClientAuthentication.REQUIRE);
SSLContextParameters scp = new SSLContextParameters();
scp.setServerParameters(scsp);
scp.setKeyManagers(kmp);
SSLContext context = scp.createSSLContext();
SSLEngine engine = scp.createSSLEngine();

Then we want to put the result on a queue. For this, we need to make need a connection factory (otherwise we get the error ‘ConnectionFactory must be specified’. So we need to write some more code:

ConnectionFactory connectionFactory = new ActiveMQConnectionFactory("vm://localhost?broker.persistent=false");ComponentsBuilderFactory.jms()
.connectionFactory(connectionFactory)
.acknowledgementMode(1)
.register(context, "test-jms");

And then we need to bind our endpoint in the route to that connection factory. At last, we need to add more code to make a connection to a JDBC database and use XPath or simple to retrieve to values and put it into the insert statement. This all is written in Java. Thus, when you are not a Java programmer and integration specialist (to understand the protocols and patterns used) it’s actually hard to write such integration (though much easier than directly in Java).

Kamelets

To work use case based, Kamelets come to the rescue. You can combine all this into one Kamelet and then you can call the Kamelet from a normal Camel route. Kamelets have been very powerful from the beginning, however they were at first very tight bounded to the CamelK (running Camel on Kubernetes/OpenShift). Nowadays, they are standalone and made more widely available to the broader Camel ecosystem.

Unfortunately, Kamelets aren’t that easy and flexible as it at first seems. Consider a similar use case which is used in the Kamelet documentation. It’s about a scientist that needs to collect data about earthquakes. Writing this in pure Camel, he needs Camel, Java and integration knowledge, however with a Kamelet this all packed in one. Here is the example:

apiVersion: camel.apache.org/v1alpha1
kind: Kamelet
metadata:
name: earthquake-source
annotations:
camel.apache.org/kamelet.icon: "data:image/svg+xml;base64..." # truncated (1)
camel.apache.org/provider: "Apache Software Foundation"
labels:
camel.apache.org/kamelet.type: "source"
camel.apache.org/requires.runtime: "camel-quarkus" (2)
spec:
definition:
title: Earthquake Source
description: |-
Get data about current earthquake events happening in the world using the USGS API
properties:
period:
title: Period between polls
description: The interval between fetches to the earthquake API in milliseconds
type: integer
default: 60000
lookAhead:
title: Look-ahead minutes
description: The amount of minutes to look ahead when starting the integration afresh
type: integer
default: 120
types: (3)
out:
mediaType: application/json
dependencies: (4)
- camel-quarkus:caffeine
- camel-quarkus:http
template:
from:
uri: "timer:earthquake"
parameters:
period: "{{period}}"
steps:
- set-header:
name: CamelCaffeineAction
constant: GET
- tod: "caffeine-cache:cache-${routeId}?key=lastUpdate"
- choice:
when:
- simple: "${header.CamelCaffeineActionHasResult}"
steps:
- set-property:
name: lastUpdate
simple: "${body}"
otherwise:
steps:
- set-property:
name: lastUpdate
simple: "${date-with-timezone:now-{{lookAhead}}m:UTC:yyyy-MM-dd'T'HH:mm:ss.SSS}"
- set-header:
name: CamelHttpMethod
constant: GET
- tod: "https://earthquake.usgs.gov/fdsnws/event/1/query?format=geojson&updatedafter=${exchangeProperty.lastUpdate}&orderby=time-asc"
- unmarshal:
json: {}
- set-property:
name: generated
simple: "${body[metadata][generated]}"
- set-property:
name: lastUpdate
simple: "${date-with-timezone:exchangeProperty.generated:UTC:yyyy-MM-dd'T'HH:mm:ss.SSS}"
- claim-check:
operation: Push
- set-body:
exchange-property: lastUpdate
- set-header:
name: CamelCaffeineAction
constant: PUT
- tod: "caffeine-cache:cache-${routeId}?key=lastUpdate"
- claim-check:
operation: Pop
- split:
jsonpath: "$.features[*]"
steps:
- marshal:
json: {}
- to: "kamelet:sink"

The above shows us how powerful Kamelets are because they define:

  • The route (Flow of the integration)
  • Parameters (Which can be used when calling the Kamelet)
  • Dependencies
  • Types
  • Metadata (documentation, icons etc.)

The example shows also that all this power add more complexity and new terminology. Terminology of the example:

  • apiVersion
  • kind
  • metadata
  • annotation
  • spec
  • definition
  • template

Besides, in practice Kamelets are still relatively tight to CamelK and not all features can be used in for example route templates. The syntax of the example is written in the YAML format, far from what most Camel users are familiar with.

Comparison of Kamelet with Step

To some extent, we could say that a whole Kamelet is a step in DIL (thus a DIL step has nothing to do with steps in the previous Kamelet example). We could also say that a Kamelet template is comparable to a block. The Camel route that uses the Kamelet can be compared to a DIL flow.

But a route, even a route that calls Kamelets still is relatively bounded to Java. It’s not really clear that it encapsulates other routes through the Kamelets. Besides, templates are really just parameterized routes, not really reusable building blocks for creating steps.

In DIL a route, but also a (route)template, (route)configuration, a message, a connection and so on are core components. These core components can be referenced by a block that together form a step. The idea is not only that core components can be reused by multiple blocks, but also that the parameters guide what kind is deployed and run.

Comparison of (Route) Template with Blocks

Blocks are so to say templates on steroids. I think the current route template (which is a subset of the functionality of Kamelets) could be further developed into blocks. In the following blog some ideas were gathered:

What about a general-purpose language?

We saw that it’s hard to write data integrations with a general-purpose language. They targeted at applications and with help of for example objects they focus on the control flow, instead of the data flow.

But why is this? The reason that they are not the right tool, is because these languages don’t tackle the fallacies of distributed computing. In distributed computed we have to deal with network (latency, bandwidth, security and so on), scattered data and uncoupled processes. Most programming languages don’t have built in constructions to handle concurrency, transactions and messaging.

There is only one language which take distributed computing (or cloud computing) as it core design principle, which is the language ballerina:

A lot of things in Ballerina that in other languages was added later through libraries, framework and tooling are build in the language.

It’s a monumental effort initiated by WS02. Because it’s general purpose it comes with a lot of features unrelated to data integration which (in my opinion) makes the syntax not intuitive for data flows and harder to learn (also because it’s syntax is imperative like Java or C#, instead of declarative).

Camel as a programing language

As a last chapter, I would like to discuss the possibility for Camel to become an independent general-purpose programming language. This would be a possibility from around Camel version 5. I think when combining the ideas of the several DSL’s currently in the Camel framework, Kamelets and DIL one could end up with a very powerful (general-purpose) language that could compete with Ballerina in the field of cloud computing. But which has a declarative flow-orientated syntax, instead of procedural.

What would be needed to realize this:

A metamodel

A language reference (metamodel) that defines the language independently of language representations. A consistent way to define integration where it’s clear how concepts are related. Everything is done within this language and no other languages are needed.

Representations

The language specification can be represented by multiple formats:

  1. A visual format (Karavan, Kaoto, Dovetail)
  2. A file format (xml, yaml and json).
  3. A programming language (Java, Groovy, Kotlin)

JSON is now for example missing, but is the lingua franca in web and cloud computing (think of JavaScript, MongoDB and so on).

Syntax

In every syntax one could not only write things like routes, but also other stuff on higher levels like integrations, flows and steps. Thus, developers have one unified experience to do everything without needing Java.

Standalone

The language should be embedded into for example Java (like one can currently use Camel), but also run it completely standalone. For this you have files with the extension .cl (camel language). clx would mean represented as XML, while .clk would mean represented in Kotlin syntax. A standalone project could combine these files into one ‘Camel’ program that runs on the JVM (or with GraalVM natively).

Software Development Kit

Like Java has the JDK, Camel could have its own SDK. This could contain OpenJDK, but also add additional tooling like JBang (run in repl/run locally/Camel CLI), Kamel (run remotely). It could also provide the basic jar files (think of camel engine, camel support etc) which acts as the base libary to make a Camel program.

Catalogs and libraries

The metadata and much of the other Kamelet stuff can be moved outside the route template (blocks) and can be placed in a separate configuration file (step configuration). The programmer can program multiple reusable blocks. With one or more of those blocks and the configuration, a step can be created. These can be packed as a Jar file and used in other programs or visual tools.

Extensions

Currently, the Camel framework is used to extend for example programming in Java. In the Camel Language things would be reversed. Thus the primary coding is done with the Camel language (thus you only write in the language without need of classes), but you can extend it with Java code (through beans and processors).

Thus yet another language?

DIL was inspired by most mentioned approaches in this blog, whether these were languages, middleware, product suites, frameworks or models. They are all valid and sound approaches that are hard to top. But sometimes the best view you get is after the hardest climb.

DIL can be seen as an experiment that enhances the work that already has been done with Kamelets, Assimbly, Dovetails and other integration approaches. It doesn’t mean that DIL is a more valid approach, but it’s an effort to generate new ideas for alternative approaches. Hopefully, this inspires new ways of creating integrations and make its components first-class citizens in high-code or low-code solutions. At the end, we all want to connect the world.

More reading

LiLa

Highway

--

--