sankalp's blog

19th September

CDC: Loved this video on change data capture

For CDC, you can connect Flink via connectors to relevant DB and it should be able to read the relevant logs from the db for the same

Good question! Jordan introduces Kafka and Flink into the architecture to address the potential performance limitations of HBase, specifically regarding its write throughput, and to improve message delivery reliability. Here's a breakdown of their roles:

Kafka:

Flink:

In summary:

This architecture effectively balances the strengths of each component, allowing for high message throughput and reliable delivery while maintaining the read optimization benefits of HBase.

18th - finally learnt consistent hashing, watched whatsapp system design 16th, 17th -> leetcode 15th sept

studying chapter 3 (storage and retrieval from ddia)

hash index

let's say you are appending rows to a log file. instead of scanning entire file, you can use a hash index which is simply a hashmap / key-value store in memory. you can record the key and it's byte offset (i.e byte in memory from where it starts). then when reading, can refer that offset and jump to that location.

Bitcask

The above is what Bitcask does. It offers high perf in reads and writes subject to req that all keys fit in the available RAM.

If you are only appending to a log file, you will eventually run out of disk space. Solution -> store on another segment when reach certain size. then perform compaction (i.e throw away duplicate keys in the log and keeping only the most recent update for each key)

logs are the best write strategy because appending to a file is the simplest write operation.

indexes are additional data structures derived from the data to help in optimized reads.

cons of index is they slow down writes

any kind of index usually slows down writes, because the index also needs to be updated every time data is written. well chosen indexes speed read queries but every index slows down writes. for this reason, db don't usually index everything by default, but require you the appplication develor to choose indexes manually, using your knowledge of the appplication's typical query patterns.

14th September

Watched a bit on nosql vs sql Watched video on choosing db design by jordan

Watching a video on wide column storage

main difference is how data is stored on disk. In wide column storage, all column data is stored in a text file. Can be useful for analytics when you want all the data for a particular field, min, max, avg etc. data is closer / contigous so faster.

you can also do column compression - data needs to be similar like int. he gave examples using bitmap encoding -> runlength encoding (running count of alternate 0s and 1s)

Benefits of column compression

  1. less data to send over the network to a server
  2. can keep more data stored in memory of a cpu cache

Predicate pushdown

Predicate pushdown involves moving the filtering operation (the predicate) from the query processor down to the storage layer or even to the data source itself.

In simple words, some meta-data like min/max, avg is added to the file. So when a query like "select data from table where column > 60", all files with max < 60 can be skipped for querying.

Downside of column oriented storage

Column Oriented Storage (with Parquet!) _ Systems Design Interview_ 0 to 1 with Ex-Google SWE 12-10 screenshot


Watching this

partition: the "queue". An ordered immutable sequence of messages that we appened to, like a log file. these are physical things like separate files.

topic: a logical grouping of partitions.can have multiple partition.

Kafka Deep Dive w_ a Ex-Meta Staff Engineer 8-22 screenshot

Kafka - distributed streaming platform that serves three key functions

Key Features partitioning: topics are divided into partitions for parallelism and scalability

retention: configurable data retention policies allow kafka to store data for a specified period

What Are Exactly-Once Semantics?

Exactly-Once Semantics (EOS) is a guarantee provided by distributed data processing systems that ensures each record or message is processed exactly one time—no more, no less. This means that:

No Duplicates: A message won't be processed multiple times, preventing duplicate entries.

No Missed Processes: Every message sent to the system will be processed, ensuring no data loss.

In high-thruput systems like whatsapp, handling billions of messages, maintaining data integrity is crucial

Checkpointing is Flink's mechanism to achieve fault tolerance and Exactly once processing guarantees. It involves periodically taking snapshots of the application's state and the positions of the data streams (like kafka topics) to allow recovery in case of failures

How Does Checkpointing Work?

Snapshot Creation: Flink periodically captures the state of all operators (tasks) in the data processing pipeline. This includes any state stored in memory, such as windowed aggregations or keyed state.

Barrier Alignment: During checkpointing, Flink injects barriers into the data streams. These barriers ensure a consistent point across all input streams where the snapshot is taken.

State Persistence: The snapshots are stored in a Durable Storage system, such as HDFS, S3, or any other reliable distributed filesystem.

Operator State Management: Each operator (e.g., map, filter, sink) records its state as part of the snapshot. This includes information necessary to resume processing without duplication or loss.

Failure Recovery: If a failure occurs, Flink uses the latest successful checkpoint to restore the state of the application. Processing resumes from this checkpoint, ensuring that each message is processed exactly once.