19th September
CDC: Loved this video on change data capture
For CDC, you can connect Flink via connectors to relevant DB and it should be able to read the relevant logs from the db for the same
Whatsapp / FB messenger message processing using kafka and flink
Good question! Jordan introduces Kafka and Flink into the architecture to address the potential performance limitations of HBase, specifically regarding its write throughput, and to improve message delivery reliability. Here's a breakdown of their roles:
Kafka:
- High Write Throughput: Kafka is a distributed message broker designed to handle high volumes of data in real-time. By buffering messages in Kafka first, we avoid the potential bottleneck of directly writing every message to HBase, which might be slow due to its single-leader replication model.
- Durability and Reliability: Kafka is highly durable and fault-tolerant. Messages are replicated across multiple brokers, ensuring that even if a broker fails, messages are not lost. This improves the reliability of message delivery, as even if HBase is temporarily unavailable, messages are safely stored in Kafka until Flink can process them.
Flink:
- Stream Processing: Flink is a powerful stream processing framework that enables complex data processing tasks in real-time. It consumes messages from Kafka and performs necessary operations before writing to HBase.
- Idempotency: Flink helps ensure idempotent writes to HBase by assigning a unique identifier (UUID) to each message. This way, if a message is accidentally processed multiple times (due to network issues or Flink restarts), HBase can use the UUID to identify duplicates and prevent data corruption.
- Message Routing: Flink reads changes in the "Chat Members" table (Change Data Capture) and routes messages to the appropriate chat servers. This ensures that each user only receives messages for the chats they are a member of.
In summary:
- Kafka acts as a high-throughput, durable buffer for incoming messages, decoupling the message ingestion process from the slower write operations in HBase.
- Flink provides stream processing capabilities to handle message routing, ensure idempotency, and perform any additional processing before storing the messages in HBase.
This architecture effectively balances the strengths of each component, allowing for high message throughput and reliable delivery while maintaining the read optimization benefits of HBase.
18th - finally learnt consistent hashing, watched whatsapp system design 16th, 17th -> leetcode 15th sept
studying chapter 3 (storage and retrieval from ddia)
hash index
let's say you are appending rows to a log file. instead of scanning entire file, you can use a hash index which is simply a hashmap / key-value store in memory. you can record the key and it's byte offset (i.e byte in memory from where it starts). then when reading, can refer that offset and jump to that location.
Bitcask
The above is what Bitcask does. It offers high perf in reads and writes subject to req that all keys fit in the available RAM.
If you are only appending to a log file, you will eventually run out of disk space. Solution -> store on another segment when reach certain size. then perform compaction (i.e throw away duplicate keys in the log and keeping only the most recent update for each key)
logs are the best write strategy because appending to a file is the simplest write operation.
indexes are additional data structures derived from the data to help in optimized reads.
cons of index is they slow down writes
any kind of index usually slows down writes, because the index also needs to be updated every time data is written. well chosen indexes speed read queries but every index slows down writes. for this reason, db don't usually index everything by default, but require you the appplication develor to choose indexes manually, using your knowledge of the appplication's typical query patterns.
14th September
Watched a bit on nosql vs sql Watched video on choosing db design by jordan
Watching a video on wide column storage
main difference is how data is stored on disk. In wide column storage, all column data is stored in a text file. Can be useful for analytics when you want all the data for a particular field, min, max, avg etc. data is closer / contigous so faster.
you can also do column compression - data needs to be similar like int. he gave examples using bitmap encoding -> runlength encoding (running count of alternate 0s and 1s)
Benefits of column compression
- less data to send over the network to a server
- can keep more data stored in memory of a cpu cache
Predicate pushdown
Predicate pushdown involves moving the filtering operation (the predicate) from the query processor down to the storage layer or even to the data source itself.
In simple words, some meta-data like min/max, avg is added to the file. So when a query like "select data from table where column > 60", all files with max < 60 can be skipped for querying.
Downside of column oriented storage
- Every column must have the same sort order
- Each write needs to go different places
Watching this
partition: the "queue". An ordered immutable sequence of messages that we appened to, like a log file. these are physical things like separate files.
topic: a logical grouping of partitions.can have multiple partition.
Kafka - distributed streaming platform that serves three key functions
- Publish and subscribe to streams for records (similar to a message queue)
- Store streams for records in a fault tolerant durable way
- Process streams of records as they occur
Key Features partitioning: topics are divided into partitions for parallelism and scalability
- Scalability => kafka can handle large volumes of data with high thruput and low latency durability: data is written to disk and replicated across cluster to prevent data loss
retention: configurable data retention policies allow kafka to store data for a specified period
What Are Exactly-Once Semantics?
Exactly-Once Semantics (EOS) is a guarantee provided by distributed data processing systems that ensures each record or message is processed exactly one time—no more, no less. This means that:
No Duplicates: A message won't be processed multiple times, preventing duplicate entries.
No Missed Processes: Every message sent to the system will be processed, ensuring no data loss.
In high-thruput systems like whatsapp, handling billions of messages, maintaining data integrity is crucial
- Avoiding duplicates => you don't want identical messages appearing
- Ensuring data consistency => Users expect consistent and reliable messaging experiences without data loss or corruption
- Analytics pov => duplicate or lost data can skew results
Apache Flink's checkpointing mechanism
Checkpointing is Flink's mechanism to achieve fault tolerance and Exactly once processing guarantees. It involves periodically taking snapshots of the application's state and the positions of the data streams (like kafka topics) to allow recovery in case of failures
How Does Checkpointing Work?
Snapshot Creation: Flink periodically captures the state of all operators (tasks) in the data processing pipeline. This includes any state stored in memory, such as windowed aggregations or keyed state.
Barrier Alignment: During checkpointing, Flink injects barriers into the data streams. These barriers ensure a consistent point across all input streams where the snapshot is taken.
State Persistence: The snapshots are stored in a Durable Storage system, such as HDFS, S3, or any other reliable distributed filesystem.
Operator State Management: Each operator (e.g., map, filter, sink) records its state as part of the snapshot. This includes information necessary to resume processing without duplication or loss.
Failure Recovery: If a failure occurs, Flink uses the latest successful checkpoint to restore the state of the application. Processing resumes from this checkpoint, ensuring that each message is processed exactly once.