Serverless Architecture in practice: sentiment analysis on Twitter with KEDA
Photo by Jeremy Bezanger on Unsplash
Introduction
Serverless architectures have gained such popularity recently, especially because of Cloud Providers off-the-shelf solutions that are cost-efficient and easy to start with. But how does it work? What is the technology behind? How can I build my own serverless architecture if I wanted to?
I will try to demystify the concept and provide you with a concret example you can reproduce. Through this we will build a serverless application on top of Kubernetes, that search for tweets given a specific pattern and conduct a sentiment analysis on it.
Don’t worry, it is easier that it looks. At the end, you will understand how it works behind the scene and you will be able to reproduce your own serverless application. Let’s start with the basics first.
What is a Serverless Architecture?
The term is probably a bit misleading, and some people may think this is a way to run computer code without servers. Unfortunately, every time you run a program, compute capacity will be involved at some point. It still requires servers, power, network connectivity, … Serverless architectures could be more or less described as a way to mask compute complexity, so developers can focus on what they do best: coding.
The ideal use case is when you are dealing with a seasonal workload, one with predictive usage on business days, and/or with pikes during short period of time like black Fridays or Christmas for example. A piece of code will be triggered when it is needed and will consume resources only when the service is provided and generating revenues. For example, a piece of code will be executed only when users make a call to an API endpoint, will incur compute costs when users are active and using your service but not when they are sleeping at night. Generally, on top of that, cloud providers are offering highly scalable infrastructures to absorb surges in workload. At the end, when using this king of architectures you can benefit from both high scalability and cost-efficiency.
The Serverless model is very suitable to event-driven architectures and fits very well the pay-as-you-go model offered by cloud providers. You can start coding on your laptop and use the same code to process millions of events in production the same day.
Why KEDA?
Cloud Providers have their own tooling (AWS Lambda, Azure functions, …) that makes it very easy to use but is also masking the real complexity behind. The CloudNative foundation is providing opensource solutions to build your own Serverless architecture. The most mature is KNative, but in this article, I will use another one named KEDA
KEDA is lightweight solution also part of the CloudNative ecosystem, that runs on top of Kubernetes and provides ready-to-use connectors to most popular event triggering services, either opensource or from cloud providers/third party vendors. It is easy to install and very quick to start with, and I wanted to experiment. I picked one of the demo projects provided on their website and here we are.
Sentiment analysis on Twitter
To start with a cool example, we will perform a sentiment analysis on Twitter. The goal is to search for tweets containing a certain pattern, like “banana” or “watermelon”, and then process a sentiment analysis on it. It read through the content and attribute a grade from 0 to 1, whether the tweet is rather sympathetic or antipathetic. Then it will record the grade and geolocation, and pass those to a PowerBI data stream so we can see the results on a world map and visualize which country is a banana or watermelon aficionado.
In order to do this, we will rely on a message broker. Any message broker solution could have suit our demo, but to remain cloud agnostic, we will deploy Kafka in our Kubernetes cluster. Thus, the tweets will be retrieved and pushed into a Kafka topic waiting to be processed.
As we usually say that a diagram is worth ten thousand words, here is what it looks like :
At the heart of the engine will be KEDA. It is a very light auto-scaler part of the Cloud Native Foundation ecosystem. It can listen to any kind of events and scale Kubernetes deployments. In our case, it will monitor the size of the Kafka queue, the bigger it is the more replicas it will scale. What is very interesting here is when there is no message in the queue to process, KEDA will scale the deployment at 0, meaning it will put the processing function to sleep.
Requirements
If you want to follow every steps of this article, you will need the following :
- a Kubernetes cluster and the kubectl CLI
- the helm CLI
- the Azure functions core tools
- a container registry, go to Docker HUB if you do not have one
- the git CLI
- the code of the functions to deploy
- an elevated Twitter developer account with Oauth v1 activated
If you don’t have any Kubernetes cluster you can order a free trial account to any cloud provider available on the market, and start one. Here is how to start on Microsoft Azure :
# login to azure
~ » az login# create a resource group in the region you want, here in Zurich
~ » az group create --name demo --location switzerlandnorth# create your AKS cluster
~ » az aks create --resource-group demo --name aksdemo --node-count 2 --generate-ssh-keys# once created, update your kubeconfig file
~ » az aks get-credentials --resource-group demo --name aksdemo
Regarding the container registry, if you don’t have one yet, the quickest and easiest way is create an account on Docker HUB and create a registry on it.
You will also need to clone this repository where you will find the deployments :
~ » git clone https://github.com/kedacore/sample-typescript-kafka-azure-function
~ » cd sample-typescript-kafka-azure-function
How to install KEDA
The installation is quite straightforward, here are the steps to follow to deploy it in one minute.
You will first need to add Kedacore to your Helm repository list :
~ » helm repo add kedacore https://kedacore.github.io/charts
~ » helm repo update
Then create a namespace dedicated to it, you can change the name if you want :
~ » kubectl create ns keda
And finally install it :
~ » helm install keda kedacore/keda --namespace keda
That’s it, there is not much to do here. If you want to know more about KEDA, you can go and visit the documentation.
How to install and configure Kafka
The same way we did with KEDA, we will add the Confluent to our helm repositories :
~ » helm repo add confluentinc https://confluentinc.github.io/cp-helm-charts
~ » helm repo update
Then install it, we will have to specify some values here :
~ » helm install kafka --set cp-schema-registry.enabled=false,cp-kafka-rest.enabled=false,cp-kafka-connect.enabled=false,dataLogDirStorageClass=default,dataDirStorageClass=default,storageClass=default confluentinc/cp-helm-charts
Monitor your deployments with the following command :
~ » kubectl get deployment
Wait for Kafka to be up and running, then deploy a small container containing a Kafka client to create a new topic.
~ » kubectl apply -f deploy/kafka-client
~ » kubectl exec -it kafka-client -- /bin/bash
While in the container, run the following commands. Change the topic name to fit your use case:
~ » kafka-topics \
--zookeeper kafka-cp-zookeeper-headless:2181 \
--topic twitter \
--create \
--partitions 5 \
--replication-factor 1 \
--if-not-exists~ » exit
Build and deploy function
This step will require a container registry, make sure you are logged onto one before processing with the following. You will also need the Azure functions core tools, which is an open source model for serverless application deployments. It is basically a wrapper for the docker build, kubectl apply commands and can be used to run the function locally, deploy your function onto Azure or deploy it in your own Kubernetes cluster.
Thus, here is how to proceed :
~ » func kubernetes deploy --name twitter-function --registry myregistry --typescript
It is quite simple, isn’t it?
Retrieve tweets and send them to Kafka
As I previously said, you will need a developer Twitter account at this stage. To create one, go to https://developer.twitter.com/ and create one.
You will then be invited to create a project, I called mine demo and created an application inside. It will give you a API key and secret. Keep them somewhere close we will need those in a moment.
At the same time, you should request for an elevated account, otherwise you will not be able to pull any message. It is free of charge, you will be prompted for information about your project and then in a matter of minutes the access are granted. You should then have a sign next to your project saying that you are now in elevated mode.
Finally, you would have to generate an Oauthv1 access token and secret. By default now, Oauthv2 is preferred, so if you want to use version 1 you edit the User authentication settings and tick the Oauth 1.0a box. The save button will not be cliquable until you provide a callback URI and a website URL. This is the page your users will be redirected to once the Oauth process is completed. In our case, this information is not important and you can input whatever comes to your mind.
Now you should be able to generate your Oauthv1 access token and secret.
Modify the following file with your own values :
~ » vim deploy/twitter-to-kafka.yaml
Replace :
- CONSUMER_KEY by your API key
- CONSUMER_SECRET by your API secret
- ACCESS_TOKEN by your OAuthv1 access token
- ACCESS_TOKEN_SECRET by your OAuthv1 access token secret
- SEARCH_TERM by the search pattern of your choice
Then, apply the manifest :
~ » kubectl apply -f deploy/twitter-to-kafka.yaml
If you look into the logs, you can verify the messages are processed. Look for the name of one of the pods that is currently running :
~ » kubectl logs -f $(kubectl get pods | grep twitter-to-kafka | awk '{print $&}')
To see if the tweets are processed correctly, look into the logs of the twitter-function:
~ » kubectl logs -f $(kubectl get pods | grep twitter-function | awk '{print $&}')
You should see something like this in the log file. This example is with the pattern set to avengers :
info: Function.KafkaTwitterTrigger.User[0]
Tweet analyzed
Tweet text: RT @ballerguy: Yeah avengers endgame was good but I found out my boyfriend is a movie clapper so at what cost
Sentiment: 0.09523809523809523
info: Function.KafkaTwitterTrigger[0]
Executed 'Functions.KafkaTwitterTrigger' (Succeeded, Id=67cc49a3-0e13-4fa8-b605-a041ce37420a)
info: Host.Triggers.Kafka[0]
Stored commit offset twitter / [3] / 37119
You can also watch if the horizontal autoscaler is working, the number of replicas should be moving accordingly with the number of tweets. Obviously, if you chose a very popular pattern, the number of replicas will change more often and could grow very high.
~ » kubectl get hpa
By default, KEDA will trigger the creation of a new replicas every time the Kafka topic lag reach 5. To adjust this value, you can change the logThreshold value in the ScaledObject manifest.
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
name: twitter-function
namespace: default
labels:
deploymentName: twitter-function
spec:
scaleTargetRef:
deploymentName: twitter-function
triggers:
- type: kafka
metadata:
type: kafkaTrigger
direction: in
name: event
topic: twitter
brokerList: kafka-cp-kafka-headless.default.svc.cluster.local:9092
consumerGroup: functions
dataType: binary
lagThreshold: '5'
You can also adpat the minimum and maximum number of replicas accordingly with your resources, then add the minReplicaCount and maxReplicaCount value. You can also change the pollingInterval and cooldownPeriod to react more quickly if your search pattern is in the top tweets list.
apiVersion: keda.k8s.io/v1alpha1
kind: ScaledObject
metadata:
name: twitter-function
namespace: default
labels:
deploymentName: twitter-function
spec:
scaleTargetRef:
deploymentName: twitter-function
pollingInterval: 30 # Optional. Default: 30 seconds
cooldownPeriod: 300 # Optional. Default: 300 seconds
minReplicaCount: 0 # Optional. Default: 0
maxReplicaCount: 100 # Optional. Default: 100
triggers:
- type: kafka
metadata:
type: kafkaTrigger
direction: in
name: event
topic: twitter
brokerList: kafka-cp-kafka-headless.default.svc.cluster.local:9092
consumerGroup: functions
dataType: binary
lagThreshold: '5'
And here you are, you have built a serverless application. Not too difficult, wasn’t it?
What next?
Well, you have plenty of choices. For sure adding a data visualization layer would be a plus. For example, you can send the result to a PowerBI data stream then build a dashboard to map the countries where the tweets are originated in relation to the overall rating there. The function implement this it will be part of another article I will write.
Conclusion
Well done, you should now be able to replicate and adapt this to other use cases than tweet processing.
Remember that serverless application doesn’t suit every needs. Even though it is promoted by AWS through the combinaision of API Gateway and Lambda functions, serverless is not very well suited to request/reply use cases. Regarding KEDA, it needs to be tweaked if you intend to process longer than 30 minutes batches, because it was not designed for lightweight applications. Generally speaking, the technology must fit into a design that matches your architectural drivers, so choose carefully when to apply serverless.
So that’s it, please feel free to send me your questions or feedback on this topic. Keep posted, I’ll come soon with a new article regarding data visualization.