Skip to main content

Architecture

How Splunk Analysis and Kafka Connect Works

Istock 898079472333

General Introduction

  • Splunk Connect for Kafka is a sink connector that allows a Splunk software administrator to subscribe to a Kafka topic and stream the data to the Splunk HTTP Event Collector. After the Splunk platform indexes the events, you can then directly analyze the data or use it as a contextual data feed to correlate with other Kafka-related data in the Splunk platform.
  • There are primarily two methods for getting metrics into Splunk when using agents such etcd and collectd. Either using a Direct TCP/UDP input or via the HEC. Using Http Event Collector(HEC) is considered a best practice due to the resiliency and scalability of the HEC endpoint, and the ability to horizontally scale the collection tier easily.

Universal Splunk Forwarders

Splunk Forwarders

Splunk Use-Case Modelling

Use Case

Splunk Connect For Kafka

Splunk Connect for Kafka introduces a scalable approach to tap into the growing volume of data flowing into Kafka. With the largest Kafka clusters processing over one trillion messages per day and Splunk deployments reaching petabytes ingested per day, this scalability is critical.

Overall, the connector provides:

  1. High scalability, allowing linear scaling, limited only by the hardware supplied to the Kafka Connect environment
  2. High reliability by ensuring at-least-once delivery of data to Splunk
  3. Ease of data onboarding and simple configuration with Kafka Connect framework and Splunk’s HTTP Event Collector

Procedural Flow

          Step 1: Start Zookeeper Server

Kafka uses ZooKeeper so you need to first start a ZooKeeper server if you don’t already have one. You can use the convenience script packaged with kafka to get a quick-and-dirty single-node ZooKeeper instance.

> bin/zookeeper-server-start.sh config/zookeeper.properties
 [2013-04-22 15:01:37,495] INFO Reading configuration from: config/zookeeper.properties   (org.apache.zookeeper.server.quorum.QuorumPeerConfig)
...

            Step 2: Start Kafka Cluster

After starting the Zookeeper server, Kafka Clusters and brokers can able to run properly without any interuptions

> bin/kafka-server-start.sh config/server.properties
 [2013-04-22 15:01:47,028] INFO Verifying properties (kafka.utils.VerifiableProperties)
 [2013-04-22 15:01:47,051] INFO Property socket.send.buffer.bytes is overridden to 1048576   (kafka.utils.VerifiableProperties)
...

            Step 3: Create a topic

Let’s create a topic named “test” with a single partition and only one replica:

> bin/kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test 

We can now see that topic if we run the list topic command:

> bin/kafka-topics.sh --list --bootstrap-server localhost:9092
test

Alternatively, instead of manually creating topics you can also configure your brokers to auto-create topics when a non-existent topic is published.

How it Works

As a sink connector, Splunk Connect for Kafka takes advantage of the Kafka Connect framework to horizontally scale workers to push data from Kafka topics to Splunk Enterprise or Splunk Cloud. Once the plugin is installed and configured on a Kafka Connect cluster, new tasks will run to consume records from the selected Kafka topics and send them to your Splunk indexers through HTTP Event Collector, either directly or through a load balancer.

Key configurations include:

  1. Load Balancing by specifying list of indexers, or using a load balancer
  2. Indexer Acknowledgements for guaranteed at-least-once delivery (if using a load balancer, sticky sessions must be enabled)
  3. Index Routing using connector configuration by specifying 1-to-1 mapping of topics to indexes, or using props.conf on indexers for record level routing
  4. Metrics using collectd, raw mode, and collectd_http pre-trained sourcetype

Where can I get it?

  1. The latest release is available on Splunkbase
  2. The Kafka Connect project is also open sourced and available on GitHub

Deployment Considerations

  1. Clone the repo from https://github.com/splunk/kafka-connect-splunk
  2. Verify that Java8 JRE or JDK is installed.
  3. Run mvn package. This will build the jar in the /target directory. The name will be splunk-kafka-connect-[VERSION].jar.

First, we need to get our hands on the packaged JAR file (see above) and install it across all Kafka Connect cluster nodes that will be running the Splunk connector. Kafka Connect has two modes of operation—Standalone mode and Distributed mode. To start Kafka Connect we run the following command from the $KAFKA_CONNECT_HOME directory, accompanied with the worker properties file connect-distributed.properties. Ensure that your Kafka cluster and zookeeper are running.

Establish Connection

Run .$KAFKA_HOME/bin/connect-distributed.sh $KAFKA_HOME/config/connect-distributed.properties to start Kafka Connect.

Kafka ships with a few default properties files, however the Splunk Connector requires the below worker properties to function correctly. connect-distributed.properties can be modified or a new properties file can be created.

Note the two options that need some extra configuration:

  1. bootstrap.servers – This is a comma-separated list of where your Kafka brokers are located.
  2. plugin.path – To make the JAR visible to Kafka Connect, we need to ensure that when Kafka Connect is started that the plugin path variable is folder path location of where your connector was installed to in the earlier section.

Configuration Properties

#These settings may already be configured if you have deployed a connector in your Kafka Connect Environment
bootstrap.servers=<BOOTSTRAP_SERVERS>
plugin.path=<PLUGIN_PATH>
#Required
key.converter=org.apache.kafka.connect.storage.StringConverter
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false
offset.flush.interval.ms=10000
#Recommended
group.id=kafka-connect-splunk-hec-sink

Run the following command to create connector tasks. Adjust topics to configure the Kafka topic to be ingested, splunk.indexes to set the destination Splunk indexes, splunk.hec.token to set your Http Event Collector (HEC) token and splunk.hec.uri to the URI for your destination Splunk HEC endpoint.

Splunk Indexing using HEC Endpoint

curl <hostname>:8083/connectors -X POST -H "Content-Type: application/json" -d'{
  "name": "kafka-connect-splunk",
    "config": {
     "connector.class": "com.splunk.kafka.connect.SplunkSinkConnector",
     “tasks.max": "1",
     "topics": "topic_name",
     "splunk.hec.uri":"https://localhost:8088",
     "splunk.hec.token": "1B901D2B-576D-40CD-AF1E-98141B499534",
     "splunk.hec.ack.enabled : "true",
     "splunk.hec.raw" : "false",
     "splunk.hec.ssl.validate.certs": "false",
     "splunk.hec.track.data" : "true"
    }
}' | jq

Now is a good time to check that the Splunk Connect for Kafka has been installed correctly and is ready to be deployed. Verify that data is flowing into your Splunk platform instance by searching using the index specified in the configuration. Use the following commands to check status, and manage connectors and tasks:

Useful Commands

    # List active connectors
    curl http://localhost:8083/connectors

    # Get kafka-connect-splunk connector status
    curl http://localhost:8083/connectors/kafka-connect-splunk/status | jq

    # Get kafka-connect-splunk connector config info
    curl http://localhost:8083/connectors/kafka-connect-splunk/config

    # Delete kafka-connect-splunk connector
    curl http://localhost:8083/connectors/kafka-connect-splunk -X DELETE

    # Get kafka-connect-splunk connector task info
    curl http://localhost:8083/connectors/kafka-connect-splunk/tasks

 

Tags

Thoughts on “How Splunk Analysis and Kafka Connect Works”

  1. Lokesh, I have a cluster of 5 worker nodes. Should i start the connector(establish the connection) on all worker nodes? or just establishing the connection from one worker node will automatically balance the load from all the brokers.

    In my “$KAFKA_HOME/config/connect-distributed.properties” I have listed all the worker nodes in the configs

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Lokeshkumar Ravi, Associate Technical Consultant

Lokesh Kumar is a Passionate Software Engineer, who Loves to learn and explore various Opensource technologies with a demonstrated Hands-On experience and expertise in Application related development, Analysing Business requirements as well as Deploying applications with Multi-Cloud Strategies.

More from this Author

Follow Us
TwitterLinkedinFacebookYoutubeInstagram