Kinesis

Icon

Kinesis

About

  • Its a managed, serverless proprietary offering from AWS.

  • It makes it easy to collect, process and analyze streaming data in real time (upto 200ms/70ms).

  • Real time data can be Application Logs, Metrics, Website clickstreams, IoT Telemetry data.

  • There are 4 services that makes up Kinesis

    • Kinesis Data Stream

    • Kinesis Data Firehose

    • Kinesis Data Analytics

    • Kinesis Video Streams

  • Supports replay capability.

Kinesis Data Stream

Icon

Kinesis Data Stream Icon

About

  • Helps to capture, process and store data streams.

  • Use case is in big data.

  • They are made of multiple shards. And shards must be provisioned ahead of time.

  • Shards define the stream capacity in terms of ingestion and consumption rates.

Kinesis Record

  • A Kinesis Record consists of Partition Key, Sequence Number (unique per partition key per shard) and Data blob of size upto (1 MB) and more.

  • Partition Key decides on to which shard the record will be stored.

  • Hence partition key should be highly distributed, to avoid hot partition problem.

Producers

  • Producers send data (Kinesis record) to Kinesis Data Streams. Producers can be applications, Kinesis Producer Library (KPL), Kinesis Agent. At low-level they rely on AWS SDK.

  • Producers produces Kinesis Record that will be ingested into Kinesis Data Stream.

  • Producers can send records (Write Throughput) at a rate of 1MB/sec or 1000 messages/sec per shard.

  • PutRecord API is used to insert records to Kinesis.

  • Use batching with PutRecords API to reduce costs and increase throughput.

  • Partition Key of the Kinesis Record goes through hash function, which will determine the shard on to which the record goes to.

  • Overproducing records by producers leads to ProvisionedThroughputExceeded exception. To avoid this problem,

    • Use a highly distributed partition key OR

    • Retries by using exponential backoff OR

    • Scale out by increasing shards

Consumers

  • Consumers read the data from Kinesis Data Stream using Kinesis Consumer Library (KCL).

  • There are various consumption modes,

    • Shared (classic) Fan-out Consumer

      • 2MB/sec per shard shared across all consumers.

      • This uses GetRecords() API.

      • Uses Pull model.

      • Useful when there is low number of consuming applications.

      • Per shard only 5 GetRecords() API calls/sec.

      • Latency is approx 200ms.

      • Less costly option

      • Returns upto 10 MB or upto 10000 Records and then it will throttle for 5 seconds.

    • Enhanced Fan-out Consumer (EFO)

      • 2MB/sec per shard per consumers.

      • This uses SubscribeToShard() API.

      • Uses Push model over HTTP/2 connection.

      • Latency is approx 70ms.

      • Higher cost

      • Multiple consuming applications for same stream.

      • Soft Limit of 5 Consumer applications per data stream by default.

  • Consumers can be Kinesis Data Firehose, Lambda, Applications through KCL, Kinesis Data Analytics etc.

  • Retention period can be 1 day to 365 days.

  • Data once inserted in KDS, it can't be deleted.

  • It is possible to replay the data.

  • Data inserted to same partition goes to same shard and thus maintains key based ordering.

Lambda as Consumer Integration

  • Supports both Classic and Enhanced Fan-out consumers.

  • Read records in batches.

  • Batch size and Batch window can be configured.

  • In case of errors, Lambda retries untill succeeds or data expired.

  • Can process 10 batches/shard simultaneously.

Kinesis Client Library (KCL)

  • A java library that helps read the record from a Kinesis Data Stream with distributed applications sharing the read work load.

  • Each shard is to be read by only one KCL instances.

    • 4 shards -> maximum 4 KCL instances

    • 6 shards -> maximum 6 KCL instances

  • Progress is checkpointed into DynamoDB (needs IAM Access).

  • Track other workers and share the work among shards using DynamoDB.

  • KCL can be run on EC2 instances, Elastic BeanStalk, on-premises servers and so on.

  • Records are read in order at the shard level.

  • Versions:

    • KCL 1.x version (supports shared fanout consumers)

    • KCL 2.x version (supports shared/enhanced fanout consumers)

  • When there are more shards than KCL instances then Kinesis will detect this and split the work between KCL applications.

Capacity modes

  • Provisioned mode (historic mode)

    • You chose the number of shards provisioned, scale manually or using API.

    • Each shard gets 1MB/sec for input or 1000 records/sec.

    • Each shard gets 2MB/sec for output or enhanced fan-out consumer.

    • You pay per shard provisioned per hour.

  • On-Demand mode

    • No need to provision or manage capacity.

    • Default capacity provisioned 4 MB/s or 4000 records/sec.

    • Scales automatically based on observed throughput peak during last 30 days.

    • At Max it offers write capacity of 200 MiB/sec and 2000 records/sec and read capacity of 400 MiB/sec. Supports upto 20 consumers when using EFO consumption mode.

    • Offers pricing models to pay per stream/hour & data in/out per GB.

Kinesis Operations

  • There are two operations

Shard Splitting

  • Used to split the shards in to two.

  • This operation is typically used to increase the capacity of the stream.

  • Used to divide hot shard, and increase the capacity. As capacity is increased so does the cost.

  • The old shard will be closed and data will be expired as per the retention policy following which shard will be deleted.

  • There is no direct auto-scaling support in Kinesis, the scaling has to managed manually. The auto-scaling setup need to be architected as mentioned here.

  • A shard cant be split into more than two shards in a single operation.

Shard Merging

  • As shards can be split it can also be merged, when mulitple shards have low traffic.

  • Only two shards (cold shards) can be merged in a single operation.

  • This operation reduces cost and capacity.

  • Once merged the old shard will be deleted once data is expired as per the retention policy.

Security

  • Control access/authorization using IAM policies.

  • Encryption in flight using HTTPS endpoints.

  • Encryption at rest using KMS.

  • Can implement encryption/decryption of data on client side.

  • VPC Endpoints are available for Kinesis to access within VPC.

  • Can monitor all API calls using CloudTrail.

Kinesis Data Firehose (earlier knowns as Delivery Streams)

Icon

Kinesis Data Firehose Icon

About

  • Fully managed service, near real-time, auto scaling, serverless service used to send data streams into AWS data stores.

  • Data stream producers can be AWS Kinesis Data Stream, CloudWatch Logs, AWS IoT.

  • No replay capability and no data storage.

  • It can optionally do a data transformation.

  • Only need to pay for data going through Firehose.

  • Once data is received from producers, it can be written in batches into destinations.

  • Destinations can be of

    • AWS Destinations

      • S3

      • Redshift (It first writes data to S3 and use a COPY command)

      • OpenSearch

    • Third-Party Partner Integrations

      • New Relic

      • MongoDB

      • DataDog

      • Dynatrace

      • Splunk etc

    • Custom Destinations

      • HTTP Endpoint

  • Once all data has been send to Destinations or if it failed to send the data, it can be send to S3 bucket as backup.

  • When sending data to different destination, data can be encrypted and compressed.

  • Near Real Time data writes with buffer interval of 60 to 900 seconds, by default it being 300 seconds.

  • Buffer size is minimum of 1 MB and maximum is 128 MB, with default being 5 MB.

  • Supports many data formats, conversions, transformations, compression.

  • Supports custom data transformation using Lambda.

Kinesis Data Analytics

Icon

Kinesis Data Analytics

About

For SQL Applications

  • Fully managed service, auto-scaling with Pay as you go only for data going through Kinesis Data Analytics.

  • Analyse data streams with SQL or Apache Flink

  • It can read from Kinesis Data Streams or Kinesis Data Firehose.

  • Possible to join reference data from S3 to enrich data in real time.

  • Data can be outputed to (sinks),

    • Kinesis Data Streams

    • Kinesis Data Firehose

      • From this it can be send to S3, OpenSearch, Redshift or other destinations as mentioned before.

  • Use cases includes,

    • Time-series analytics

    • Real-time metrics

    • Real-time dashboards

  • Use Flink to process and analyze streaming data.

  • Data can be sourced from Kinesis Data Stream or Amazon MSK.

  • With this service one can run any Flink Application on a managed cluster on AWS.

  • It provides automatic provisioning of compute resources, parallel computation and automatic scaling.

  • Application backups (implemented as checkpoints and snapshots).

  • It does not support read from Kinesis Firehose (use Kinesis Analytics SQL Analytics instead).

Kinesis Video Streams

Icon

Kinesis Video Streams

About

  • Capture, process and store video streams.

References

Last updated