Strategies To Handle Errors While Using Kinesis With Lambda

This blog talks about strategies for lambda error in case of Kinesis trigger.
Aman Kumar_avatar
Aman Kumar
Oct 19, 2022 | 7 min read

Introduction

An event-driven architecture uses events to trigger and communicate between decoupled services and is common in modern applications built with microservices. This post is intended for users working on building event-driven architecture using microservices. In this article, we’ll be discussing everything you need to know while working on building event-driven architecture using microservices. This blog will also help you understand:

  • Working of AWS lambda error and its retries
  • Properties of lambda event source mapping and its use
  • Handling of different lambda error scenarios

Let's take an example of a microservice built on a serverless platform. Consider we have a simple set of architecture where a lambda function is invoked through an asynchronous event. These events are being published on a Kinesis Stream by a Producer.

Diagram Of Architecture

lambda_kinesis_architecture

Lambda Error Scenarios

The most common mystery is the lambda retry behavior, i.e., error handling in lambda (since the “in order” guarantee ensures that failed batches are retried until the data record expires) and how it processes events. Let's take a look at different types of errors that can occur.

  • Lambda Invocation Error:

    Following are the error cases in which lambda Invocation error occurs:

    • If there is an issue with the request being made.
    • When the caller does not have permission to invoke the lambda.
    • The number of provisioned lambda is exhausted or is already running to serve other requests.

    In the above scenarios, lambda returns an error and a status code.

  • Lambda Function Error:

    Following are the error cases in which lambda function error occurs:

    • When the function code throws an exception.
    • If lambda ran out of time while processing a request and detected a syntax error.

    In the above scenarios, lambda returns an error code.

Event Source Mapping (ESM)

It is a lambda resource that reads from the event source and invokes the lambda. We can use this to process items from a queue or stream. In our case, we will be using ESM to process items from a Kinesis stream.

Properties Of ESM

  • Stream – The stream from which it will read the data/records.
  • Batch size – Number of records to send to the function in each batch.
  • Batch window – Specifies the time for which ESM will wait to gather the records before invoking the function.
  • Starting position – Specifies the point at or from which the ESM will process the record.
    • Latest – Process new records that are added to the stream.
    • Trim horizon – Process all records that are present in the stream.
    • At timestamp – Process records starting from a specific time.
  • On-Failure Destination – A destination where we can send the details about the batch that can’t be processed. It can be an SQS (Simple Queue Service) queue or an SNS (Simple Notification Service) topic.
  • Retry attempts – The maximum number of times that lambda retries processing the batch when the function returns an error.
  • Split batch on error – Splits the batch into two before retrying when the function is unable to process the entire batch.
  • Report batch item failure – A navigational property that can be used to return the sequence number of the faulty record of a batch.

Scenarios:

In this section, we will talk about a specific scenario where we have a Poison Pill(a single malformed record in the batch that can prevent processing on an entire shard).

Consider we have a record in the batch that is corrupted in a way that it cannot be processed. Let us see how different configurations of ESM would behave in this scenario. We assume here that the code to handle the event is idempotent. For this case, we will fix the batchSize to 5. We also assume that our 4th record is a poison pill for us. Below is the representation of the same:

records_with_poison_pill

Configuration 1:

  • BatchWindow: 5

  • MaxRetryAttempts: -1(infinite)

    In this scenario, lambda will start processing the batch and will process the records successfully up to the 3rd record before failing at the 4t. As we haven't configured our retry attempts, lambda will retry the entire batch infinitely. As a result, records from 1st to 3rd will be processed again, but because our code was idempotent, this will not have a major impact on our resources. As you can see below, the complete batch is getting retried infinitely.

    inifinte-retry

Configuration 2:

  • BatchWindow: 5

  • MaxRetryAttempts: 1

    In this scenario, lambda will successfully process the records till the 3rd record and will fail at the 4th. Since the value of maxRetryAttempts is set to 1, lambda will retry the entire batch one more time. As a result, it will process the whole batch twice.

    twice-rerty

Configuration 3:

  • BatchWindow: 5

  • MaxRetryAttempts: 1

  • Bisect: true

    In this case, lambda will successfully process the records till the 3rd record and will fail at the 4th. Since we have configured the split-batch-on-error option of ESM, it will try to bisect the record into two batches and process it. It will bisect repeatedly until the poison pill is isolated and then retry until the maximum number of retry attempts is reached.

    bisect-and-retry

This is a good method for isolating the poison pill, but it has one drawback: it splits the batch into two, and if our poison pill is in the second half of the batch it will reprocess the first batch again. We can avoid the reprocessing of records by providing one more property of ESM.

Configuration 4:

  • BatchWindow: 5

  • MaxRetryAttempts: 1

  • Bisect: true

  • batchItemFailures: Enabled

    In this scenario, lambda will successfully process the records till the 3rd record and will fail at the 4th. Since we have configured split-batch-on-error and batchItemFailures( requires function code change) of ESM, the event handler code will be wrapped in a try-catch block. In case of any exception, the sequence number of the failed record will be returned, which will help in the optimization of the batch bisection. It will now try to bisect the batch from the failed record, avoiding the need to reprocess the records.

    report-bisect-rerty

For all of the above conigurations, we can configure one more property of ESM, i.e., Lambda Destinations, with the OnFailure trigger (refer to the image below). It will try to send the details about the record or batch to an SQS queue or an SNS topic (based on the configuration) so that we will not lose the information.

lambda-kinesis-destination

Event-Driven Architecture At Medly

We at Medly use microservices to build event-driven architecture. Different Medly teams are also considering the configurations suggested in this blog for handling lambda errors in the case of poison pills.

Conclusion

Every configuration has pros and cons, and no fixed error handling mechanism fits all. There are many factors to consider when determining the error handling configuration, including Lambda invocations, poison pill probability, upstream service failure probability, etc.

We hope you find this blog helpful and gives you a better understanding of how to handle Lambda errors with the Kinesis trigger. Having said that, I would like to extend my sincere gratitude to Niraj Palecha for his guidance and assistance in writing this blog.

Stay tuned as we keep bringing to you more interesting tech-related blogs. Alternatively, if you want to learn more about Medly, please visit our website at Medly.com.