rmoff’s random ramblings

✨ Data Engineering, Kafka, and other random geekery 🤓

Using Open Sea Map data in Kibana maps

Kibana’s map functionality is a powerful way to visualise data that has a location element in it. I was recently working with data about ships at sea, and whilst the built in Road map is very good it doesn’t show much maritime detail. Kibana’s map visualisation has the option to pull in additional visual information from other places (known as tile servers). I found a list of Tile servers, which had details of OpenSeaMap which includes:

Continue Reading

Loading delimited data into Kafka - quick & dirty (but effective)

Whilst Apache Kafka is an event streaming platform designed for, well, streams of events, it’s perfectly valid to use it as a store of data which perhaps changes only occasionally (or even never). I’m thinking here of reference data (lookup data) that’s used to enrich regular streams of events.

You might well get your reference data from a database where it resides and do so effectively using CDC - but sometimes it comes down to those pesky CSV files that we all know and love/hate. Simple, awful, but effective. I wrote previously about loading CSV data into Kafka from files that are updated frequently, but here I want to look at CSV files that are not changing. Kafka Connect simplifies getting data in to (and out of) Kafka but even Kafka Connect becomes a bit of an overhead when you just have a single file that you want to load into a topic and then never deal with again. I spent this afternoon wrangling with a couple of CSV-ish files, and building on my previous article about neat tricks you can do in bash with data, I have some more to share with you here :)

Continue Reading

📼 ksqlDB HOWTO - A mini video series 📼

Some people learn through doing - and for that there’s a bunch of good ksqlDB tutorials here and here. Others may prefer to watch and listen first, before getting hands on. And for that, I humbly offer you this little series of videos all about ksqlDB. They’re all based on a set of demo scripts that you can run for yourself and try out.

🚨 Make sure you subscribe to my YouTube channel so that you don’t miss more videos like these!

Continue Reading

Performing a GROUP BY on data in bash

One of the fun things about working with data over the years is learning how to use the tools of the day—but also learning to fall back on the tools that are always there for you - and one of those is bash and its wonderful library of shell tools.

There’s an even better way than I’ve described here, and it’s called visidata. I’ve written about it more over here.

I’ve been playing around with a new data source recently, and needed to understand more about its structure. Within a single stream there were multiple message types.

Continue Reading

Running as root on Docker images that don’t use root

tl;dr: specify the --user root argument:

docker exec --interactive \
            --tty \
            --user root \
            --workdir / \
            container-name bash

Continue Reading

Running a self-managed Kafka Connect worker for Confluent Cloud

Confluent Cloud is not only a fully-managed Apache Kafka service, but also provides important additional pieces for building applications and pipelines including managed connectors, Schema Registry, and ksqlDB. Managed Connectors are run for you (hence, managed!) within Confluent Cloud - you just specify the technology to which you want to integrate in or out of Kafka and Confluent Cloud does the rest.

Continue Reading

Creating topics with Kafka Connect

When Kafka Connect ingests data from a source system into Kafka it writes it to a topic. If you have set auto.create.topics.enable = true on your broker then the topic will be created when written to. If auto.create.topics.enable = false (as it is on Confluent Cloud and many self-managed environments, for good reasons) then you can tell Kafka Connect to create those topics first. This was added in Apache Kafka 2.6 (Confluent Platform 6.0) - prior to that you had to manually create the topics yourself otherwise the connector would fail.

Continue Reading

Kafka Connect - Deep Dive into Single Message Transforms

KIP-66 was added in Apache Kafka 0.10.2 and brought new functionality called Single Message Transforms (SMT). Using SMT you can modify the data and its characteristics as it passes through Kafka Connect pipeline, without needing additional stream processors. For things like manipulating fields, changing topic names, conditionally dropping messages, and more, SMT are a perfect solution. If you get to things like aggregation, joining streams, and lookups then SMT may not be the best for you and you should head over to Kafka Streams or ksqlDB instead.

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 12: Community Transformations

Apache Kafka ships with many Single Message Transformations (SMT) included - but the great thing about it being an open API is that people can, and do, write their own transformations. Many of these are shared with the wider community, and in this final installment of the series I’m going to look at some of the transformations written by Jeremy Custenborder and available in kafka-connect-transform-common which can be downloaded and installed from Confluent Hub (or built from source, if you like that kind of thing).

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 11: Predicate and Filter

Apache Kafka 2.6 included KIP-585 which adds support for defining predicates against which transforms are conditionally executed, as well as a Filter Single Message Transform to drop messages - which in combination means that you can conditionally drop messages.

As part of Apache Kafka, Kafka Connect ships with pre-built Single Message Transforms and Predicates, but you can also write you own. The API for each is documented: Transformation / Predicate. The predicates that ship with Apache Kafka are:

RecordIsTombstone - The value part of the message is null (denoting a tombstone message)
HasHeaderKey- Matches if a header exists with the name given
TopicNameMatches - Matches based on topic

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 10: ReplaceField

The ReplaceField Single Message Transform has three modes of operation on fields of data passing through Kafka Connect:

Include only the fields specified in the list (whitelist)
Include all fields except the ones specified (blacklist)
Rename field(s) (renames)

Continue Reading

Scheduling Hugo Builds on GitHub pages with GitHub Actions

Over the years I’ve used various blogging platforms; after a brief dalliance with Blogger I started for real with the near-inevitable Wordpress.com. From there I decided it would be fun to self-host using Ghost, and then almost exactly two years ago to the day decided it definitely was not fun to spend time patching and upgrading my blog platform instead of writing blog articles, so headed over to my current platform of choice: Hugo hosted on GitHub pages. This has worked extremely well for me during that time, doing everything I want from it until recently.

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 9: Cast

The Cast Single Message Transform lets you change the data type of fields in a Kafka message, supporting numerics, string, and boolean.

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 8: TimestampConverter

The TimestampConverter Single Message Transform lets you work with timestamp fields in Kafka messages. You can convert a string into a native Timestamp type (or Date or Time), as well as Unix epoch - and the same in reverse too.

This is really useful to make sure that data ingested into Kafka is correctly stored as a Timestamp (if it is one), and also enables you to write a Timestamp out to a sink connector in a string format that you choose.

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 7: TimestampRouter

Just like the RegExRouter, the TimeStampRouter can be used to modify the topic name of messages as they pass through Kafka Connect. Since the topic name is usually the basis for the naming of the object to which messages are written in a sink connector, this is a great way to achieve time-based partitioning of those objects if required. For example, instead of streaming messages from Kafka to an Elasticsearch index called cars, they can be routed to monthly indices e.g. cars_2020-10, cars_2020-11, cars_2020-12, etc.

The TimeStampRouter takes two arguments; the format of the final topic name to generate, and the format of the timestamp to put in the topic name (based on SimpleDateFormat).

"transforms"                                     : "addTimestampToTopic",
"transforms.addTimestampToTopic.type"            : "org.apache.kafka.connect.transforms.TimestampRouter",
"transforms.addTimestampToTopic.topic.format"    : "${topic}_${timestamp}",
"transforms.addTimestampToTopic.timestamp.format": "YYYY-MM-dd"

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 6: InsertField II

We kicked off this series by seeing on day 1 how to use InsertField to add in the timestamp to a message passing through the Kafka Connect sink connector. Today we’ll see how to use the same Single Message Transform to add in a static field value, as well as the name of the Kafka topic, partition, and offset from which the message has been read.

"transforms"                                : "insertStaticField1",
"transforms.insertStaticField1.type"        : "org.apache.kafka.connect.transforms.InsertField$Value",
"transforms.insertStaticField1.static.field": "sourceSystem",
"transforms.insertStaticField1.static.value": "NeverGonna"

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 5: MaskField

If you want to mask fields of data as you ingest from a source into Kafka, or write to a sink from Kafka with Kafka Connect, the MaskField Single Message Transform is perfect for you. It retains the fields whilst replacing its value.

To use the Single Message Transform you specify the field to mask, and its replacement value. To mask the contents of a field called cc_num you would use:

"transforms"                               : "maskCC",
"transforms.maskCC.type"                   : "org.apache.kafka.connect.transforms.MaskField$Value",
"transforms.maskCC.fields"                 : "cc_num",
"transforms.maskCC.replacement"            : "****-****-****-****"

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 4: RegExRouter

If you want to change the topic name to which a source connector writes, or object name that’s created on a target by a sink connector, the RegExRouter is exactly what you need.

To use the Single Message Transform you specify the pattern in the topic name to match, and its replacement. To drop a prefix of test- from a topic you would use:

"transforms"                             : "dropTopicPrefix",
"transforms.dropTopicPrefix.type"        : "org.apache.kafka.connect.transforms.RegexRouter",
"transforms.dropTopicPrefix.regex"       : "test-(.*)",
"transforms.dropTopicPrefix.replacement" : "$1"

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 3: Flatten

The Flatten Single Message Transform (SMT) is useful when you need to collapse a nested message down to a flat structure.

To use the Single Message Transform you only need to reference it; there’s no additional configuration required:

"transforms"                    : "flatten",
"transforms.flatten.type"       : "org.apache.kafka.connect.transforms.Flatten$Value"

Continue Reading

🎄 Twelve Days of SMT 🎄 - Day 2: ValueToKey and ExtractField

Setting the key of a Kafka message is important as it ensures correct logical processing when consumed across multiple partitions, as well as being a requirement when joining to messages in other topics. When using Kafka Connect the connector may already set the key, which is great. If not, you can use these two Single Message Transforms (SMT) to set it as part of the pipeline based on a field in the value part of the message.

To use the ValueToKey Single Message Transform specify the name of the field (id) that you want to copy from the value to the key:

"transforms"                    : "copyIdToKey",
"transforms.copyIdToKey.type"   : "org.apache.kafka.connect.transforms.ValueToKey",
"transforms.copyIdToKey.fields" : "id",

Continue Reading

Robin Moffatt

Robin Moffatt is a Principal DevEx Engineer at Decodable. He likes writing about himself in the third person, eating good breakfasts, and drinking good beer.