100 multiple-choice questions explores key concepts in Big Data processing paradigms, including real-time processing with tools such as Apache Storm and Flink, streaming data processing with Kafka and Spark Streaming, and batch processing with MapReduce and Hive. Ideal for understanding scalable data pipelines.
1. What is the primary goal of real-time processing in Big Data?
Correct Answer: b) Immediate data ingestion and analysis
Explanation:
Real-time processing focuses on low-latency handling of data as it arrives, enabling instant insights and decisions.
2. Which framework is designed for distributed real-time computation?
Correct Answer: b) Apache Storm
Explanation:
Apache Storm processes unbounded streams of data in real-time, supporting topologies for continuous computation.
3. In streaming data, what does 'exactly-once' semantics guarantee?
Correct Answer: b) Each record is processed once without loss or duplication
Explanation:
Exactly-once semantics ensures fault-tolerant processing where each input produces precisely one output.
4. What is Apache Kafka primarily used for in streaming?
Correct Answer: b) Distributed event streaming and messaging
Explanation:
Kafka acts as a scalable pub-sub system for handling real-time data feeds with durability.
5. Batch processing in Big Data is best suited for:
Correct Answer: b) Large-scale historical data analysis
Explanation:
Batch processing handles massive datasets offline, optimizing for throughput over latency.
6. Which component in Storm represents a stream processing pipeline?
Correct Answer: c) Topology
Explanation:
A Storm topology is a graph of spouts and bolts defining the data flow for real-time processing.
7. In Spark Streaming, data is processed in:
Correct Answer: b) Micro-batches
Explanation:
Spark Streaming discretizes streams into small batches (DStreams) for near-real-time processing.
8. What is the role of a Kafka Producer?
Correct Answer: b) Publishes messages to topics
Explanation:
Producers send data records to Kafka topics, which are then replicated across brokers.
9. MapReduce is a model for:
Correct Answer: b) Distributed batch processing
Explanation:
MapReduce breaks jobs into map and reduce phases for parallel processing of large datasets.
10. Apache Flink supports which processing type natively?
Correct Answer: c) Both batch and stream
Explanation:
Flink unifies batch (bounded streams) and stream processing with stateful computations.
11. In streaming, what is a 'window'?
Correct Answer: b) Time or count-based aggregation interval
Explanation:
Windows group streaming data for computations like sums over sliding or tumbling periods.
12. Hive is primarily for batch processing using:
Correct Answer: a) SQL queries on HDFS
Explanation:
Hive translates SQL to MapReduce or Tez jobs for batch analysis of structured data.
13. What does a Kafka Consumer do?
Correct Answer: b) Subscribes to and processes topics
Explanation:
Consumers poll messages from topics, often in groups for load balancing.
14. In real-time processing, latency is measured in:
Correct Answer: b) Milliseconds to seconds
Explanation:
Low latency (ms-s) distinguishes real-time from batch (minutes-hours).
15. Apache Samza processes streams using:
Correct Answer: b) Kafka for storage and YARN for execution
Explanation:
Samza leverages Kafka's changelog for state and YARN for distributed tasks.
16. Batch processing often uses which architecture?
Correct Answer: b) MapReduce on Hadoop
Explanation:
Hadoop's MapReduce is the classic batch framework for fault-tolerant processing.
17. What is a 'spout' in Storm?
Correct Answer: b) Stream source
Explanation:
Spouts emit tuples into the topology, sourcing data from queues or APIs.
18. In streaming data, 'at-least-once' delivery means:
Correct Answer: a) Possible duplicates
Explanation:
At-least-once ensures delivery but may retry, causing duplicates.
19. Apache Beam is a unified model for:
Correct Answer: b) Batch and streaming
Explanation:
Beam's portable pipelines run on runners like Flink or Dataflow for both paradigms.
20. What is the Mapper phase in batch processing?
Correct Answer: b) Processes input into key-value pairs
Explanation:
Mappers filter and transform raw input independently.
21. Kafka topics are divided into:
Correct Answer: b) Partitions
Explanation:
Partitions enable parallelism and scalability in Kafka streams.
22. In Flink, 'checkpoints' are used for:
Correct Answer: b) Fault tolerance via state snapshots
Explanation:
Checkpoints periodically save state for exactly-once recovery.
23. Pig Latin is a scripting language for:
Correct Answer: b) Batch ETL on Hadoop
Explanation:
Pig simplifies data transformation for MapReduce jobs.
24. What is 'backpressure' in streaming?
Correct Answer: b) Handling overload by throttling producers
Explanation:
Backpressure signals upstream to slow down, preventing downstream failures.
25. The Reduce phase in MapReduce performs:
Correct Answer: b) Aggregation on grouped data
Explanation:
Reducers receive shuffled data grouped by key for summarization.
26. Apache Trident is an extension for Storm providing:
Correct Answer: b) Exactly-once semantics
Explanation:
Trident adds higher-level abstractions with transactional guarantees.
27. In streaming, tumbling windows are:
Correct Answer: b) Non-overlapping fixed intervals
Explanation:
Tumbling windows process data in discrete, non-overlapping time slots.
28. YARN in Hadoop manages resources for:
Correct Answer: b) Batch and interactive workloads
Explanation:
YARN decouples resource management from job scheduling for multi-tenancy.
29. What is a Kafka Broker?
Correct Answer: b) Server in the Kafka cluster
Explanation:
Brokers store and manage topic partitions, handling replication.
30. Sliding windows in streaming have:
Correct Answer: b) Overlapping intervals
Explanation:
Sliding windows advance by a slide duration, overlapping for smoother aggregations.
31. Apache Tez optimizes batch processing with:
Correct Answer: a) DAG execution
Explanation:
Tez executes complex workflows as DAGs, reducing MapReduce overhead.
32. In Storm, a 'bolt' is:
Correct Answer: b) Processing unit
Explanation:
Bolts transform, filter, or aggregate streams from spouts.
33. Consumer groups in Kafka enable:
Correct Answer: b) Load-balanced parallel consumption
Explanation:
Groups allow multiple consumers to share partitions for scalability.
34. What is the Shuffle phase in MapReduce?
Correct Answer: b) Grouping and transferring data to reducers
Explanation:
Shuffle sorts and sends mapped output by key across the network.
35. Flink's state management supports:
Correct Answer: b) Keyed and operator state for computations
Explanation:
State backend like RocksDB stores keyed data for consistent processing.
36. Session windows in streaming are based on:
Correct Answer: b) Gaps in data activity
Explanation:
Session windows group data with inactivity timeouts for variable durations.
37. Apache Sqoop is for batch transfer between:
Correct Answer: b) RDBMS and Hadoop
Explanation:
Sqoop uses MapReduce for efficient bulk import/export to HDFS.
38. What is 'watermarking' in streaming?
Correct Answer: b) Handling late data with event time
Explanation:
Watermarks indicate progress in event time, closing windows for out-of-order data.
39. In batch processing, fault tolerance is achieved via:
Correct Answer: b) Data replication and task retry
Explanation:
Hadoop retries failed tasks and uses replicated input for reliability.
40. Kafka Streams API is for:
Correct Answer: b) Lightweight stream processing
Explanation:
It builds topologies for transformations directly on Kafka topics.
41. What is a 'tuple' in Storm?
Correct Answer: a) Data unit in streams
Explanation:
Tuples are named lists of values flowing through the topology.
42. In Flink, DataStream API handles:
Correct Answer: b) Unbounded streams
Explanation:
DataStream processes continuous, potentially infinite data flows.
43. The Combiner in MapReduce runs on:
Correct Answer: b) Mapper output locally
Explanation:
Combiners pre-aggregate to reduce shuffle data volume.
44. What is 'offset' in Kafka?
Correct Answer: b) Sequential ID in a partition
Explanation:
Offsets track consumer progress in log partitions.
45. Apache Spark's Structured Streaming uses:
Correct Answer: b) DataFrames for declarative streams
Explanation:
It models streams as unbounded tables with SQL-like operations.
46. In batch processing, 'job' refers to:
Correct Answer: b) Complete workflow from input to output
Explanation:
A job encompasses the full MapReduce execution.
47. Storm's Trident supports:
Correct Answer: a) Micro-batch processing
Explanation:
Trident batches tuples for transactional and stateful operations.
48. What is 'chaining' in stream processing?
Correct Answer: a) Linking multiple operations
Explanation:
Chaining composes transformations into a pipeline for efficiency.
49. Hive's partitioning in batch processing:
Correct Answer: b) Organizes data by columns for faster queries
Explanation:
Partitioning prunes irrelevant data during scans.
50. Kafka's log compaction retains:
Correct Answer: b) Latest value per key
Explanation:
Compaction enables changelog semantics for stateful apps.
51. In Flink, 'side outputs' are for:
Correct Answer: b) Emitting late data separately
Explanation:
Side outputs handle out-of-order or special events without blocking.
52. What is 'speculative execution' in batch processing?
Correct Answer: a) Running backup tasks for slow ones
Explanation:
It launches duplicates to mitigate stragglers in MapReduce.
53. Spark Streaming's DStream is:
Correct Answer: a) Sequence of RDDs
Explanation:
DStreams represent discretized streams as RDD chains.
54. In streaming, 'event time' vs 'processing time':
Correct Answer: b) Event time is generation timestamp
Explanation:
Event time reflects source time, handling delays unlike processing time.
55. Apache Oozie coordinates:
Correct Answer: b) Batch workflows
Explanation:
Oozie orchestrates Hadoop jobs like Hive and Pig in DAGs.
56. What is a 'sink' in stream processing?
Correct Answer: b) Output destination
Explanation:
Sinks write processed data to stores like HDFS or databases.
57. In MapReduce, InputFormat defines:
Correct Answer: b) How to read input data
Explanation:
InputFormat splits files and provides RecordReaders for records.
58. Kafka's replication factor ensures:
Correct Answer: b) Data durability across brokers
Explanation:
It copies partitions to multiple brokers for fault tolerance.
59. Flink's Table API provides:
Correct Answer: a) SQL-like operations on streams
Explanation:
It unifies relational queries over dynamic tables from streams.
60. Batch processing's strength is:
Correct Answer: b) High throughput for large volumes
Explanation:
It excels at processing terabytes efficiently, though with higher latency.
61. Storm's 'anchoring' ensures:
Correct Answer: b) Tuple acknowledgments for at-least-once
Explanation:
Anchoring tracks tuple lineages for failure recovery.
62. In streaming, 'join' types include:
Correct Answer: a) Stream-stream, stream-table
Explanation:
Joins enrich streams with reference data or merge co-streams.
63. Apache Flume is for:
Correct Answer: b) Streaming log collection
Explanation:
Flume aggregates and moves log data reliably into HDFS.
64. What is 'skew' in batch processing?
Correct Answer: b) Uneven data leading to hotspots
Explanation:
Skew causes some tasks to process more data, slowing jobs.
65. Kafka Connect integrates:
Correct Answer: a) Streams with external systems
Explanation:
Connect uses connectors for scalable data import/export.
66. In Flink, 'broadcast state' is for:
Correct Answer: a) Sharing configuration across keyed streams
Explanation:
It broadcasts read-only data to all tasks for joins.
67. Pig's LOAD statement reads data from:
Correct Answer: b) HDFS or local files
Explanation:
LOAD uses loaders for various formats into relations.
68. Streaming 'triggers' control:
Correct Answer: a) Window emission frequency
Explanation:
Triggers fire computations based on time, count, or conditions.
69. In MapReduce, the default Partitioner uses:
Correct Answer: b) Key hash for reducer assignment
Explanation:
HashPartitioner distributes by key hash modulo numReducers.
70. What is 'replay' in streaming?
Correct Answer: a) Reprocessing from stored logs
Explanation:
Replay enables re-computation for recovery or corrections.
71. Apache NiFi automates:
Correct Answer: b) Data flows in streaming pipelines
Explanation:
NiFi provides visual design for routing and transforming data.
72. In batch, 'combiners' are optional for:
Correct Answer: a) Reducing shuffle
Explanation:
They locally aggregate before shuffle, like mini-reducers.
73. Kafka's Zookeeper manages:
Correct Answer: a) Cluster metadata and leader election
Explanation:
Zookeeper coordinates brokers for topics and partitions.
74. Flink SQL supports:
Correct Answer: a) Stream and batch queries
Explanation:
It uses continuous queries for dynamic tables.
75. What is 'eviction' in streaming state?
Correct Answer: a) Removing old state
Explanation:
Eviction policies like time-to-live manage memory for windowed state.
76. Hive on Tez improves batch queries by:
Correct Answer: a) DAG optimization
Explanation:
Tez vectorizes HiveQL for fewer stages and better performance.
77. In Storm, 'guaranteed processing' uses:
Correct Answer: b) Acker bolts for reliability
Explanation:
Ackers track tuple trees for at-least-once delivery.
78. Streaming 'connectors' in Beam are for:
Correct Answer: a) Source/sink integrations
Explanation:
Connectors link pipelines to systems like Kafka or GCS.
79. MapReduce's fault tolerance relies on:
Correct Answer: b) Restarting failed tasks from lineage
Explanation:
Deterministic tasks allow re-execution from input splits.
80. What is 'fan-out' in streaming?
Correct Answer: b) Broadcasting to multiple consumers
Explanation:
Fan-out duplicates streams for parallel processing or routing.
81. Apache Pulsar is a:
Correct Answer: a) Messaging system with multi-tenancy
Explanation:
Pulsar separates compute from storage for scalable streaming.
82. In batch, 'vectorization' speeds up by:
Correct Answer: b) SIMD on columns
Explanation:
Vectorized execution uses CPU instructions for batch ops.
83. Kafka's 'idempotent producer' prevents:
Correct Answer: a) Duplicates
Explanation:
It uses sequence numbers for exactly-once writes.
84. Flink's 'process function' allows:
Correct Answer: a) Low-level stream access with timers
Explanation:
Process functions provide event-time control and side effects.
85. What is 'counters' in MapReduce?
Correct Answer: a) Global metrics tracking
Explanation:
Counters aggregate job statistics across tasks.
86. Streaming 'deduplication' removes:
Correct Answer: b) Duplicates using keys/windows
Explanation:
It ensures uniqueness within time or key scopes.
87. Apache Heron is a:
Correct Answer: a) Storm replacement for real-time
Explanation:
Heron improves Storm with better scheduling and metrics.
88. In batch, 'bloom filters' optimize:
Correct Answer: a) Joins by skipping non-matches
Explanation:
They probabilistically test membership to reduce I/O.
89. Kafka's 'transactions' enable:
Correct Answer: a) Exactly-once across topics
Explanation:
Transactions atomically produce to multiple partitions.
90. In Spark Structured Streaming, 'output mode' controls:
Correct Answer: a) Append, complete, update
Explanation:
Modes define how results are emitted for different queries.
91. What is 'fork' in stream processing?
Correct Answer: b) Split one stream to multiple paths
Explanation:
Fork duplicates for parallel or conditional routing.
92. Apache Airflow schedules:
Correct Answer: b) Batch workflows
Explanation:
Airflow uses DAGs for orchestrating complex batch pipelines.
93. In Flink, 'CEP' stands for:
Correct Answer: a) Complex Event Processing
Explanation:
CEP detects patterns in event streams for anomaly detection.
94. MapReduce's 'secondary sort' ensures:
Correct Answer: b) Value order within keys
Explanation:
It sorts both key and value for ordered reducers.
95. Streaming 'rate limiting' prevents:
Correct Answer: a) Overload
Explanation:
It caps ingestion rates for system stability.
96. Apache Kinesis is Amazon's:
Correct Answer: a) Streaming service
Explanation:
Kinesis captures and processes real-time data at scale.
97. In batch, 'skew join' is mitigated by:
Correct Answer: a) Broadcast small tables
Explanation:
Broadcast avoids skew by sending small sides to all nodes.
98. What is 'schema registry' in streaming?
Correct Answer: a) Central schema management for topics
Explanation:
It enforces and evolves schemas for Avro/Protobuf in Kafka.
99. Storm's 'fields grouping' routes by:
Correct Answer: b) Specific fields
Explanation:
It hashes selected fields for consistent routing.
100. Flink's 'Async I/O' allows:
Correct Answer: a) Non-blocking external calls
Explanation:
Async functions enrich streams with async DB lookups.
101. What is 'groupByKey' in batch processing?
Correct Answer: a) Shuffles by key
Explanation:
It groups values per key, often expensive due to shuffle.
102. In streaming, 'materialized views' are:
Correct Answer: a) Precomputed for fast queries
Explanation:
They cache incremental results for low-latency access.
103. Apache Kafka supports which durability level?
Correct Answer: a) acks=0,1,all
Explanation:
Acks configure write acknowledgments for throughput vs. durability.
104. Spark Streaming integrates with:
Correct Answer: a) Kafka, Flume, TCP
Explanation:
Receivers pull/push from various sources for DStreams.
105. In batch, 'caching' in Spark uses:
Correct Answer: a) MEMORY_ONLY
Explanation:
Persist levels control storage for reused RDDs.
106. What is 'coalescing' in streaming?
Correct Answer: a) Merging small batches
Explanation:
Coalesce reduces partitions for efficiency.
107. Apache Apex supports:
Correct Answer: a) Unified stream and batch
Explanation:
Apex uses YARN for resilient, stateful processing.
108. MapReduce counters track:
Correct Answer: a) Custom metrics
Explanation:
User-defined counters monitor job progress.
109. In streaming, 'enrichment' means:
Correct Answer: a) Adding context via joins
Explanation:
Enrich streams with external data for deeper insights.
110. Kafka's 'consumer rebalance' occurs when:
Correct Answer: a) Group membership changes
Explanation:
Rebalance redistributes partitions among consumers.
111. Flink ML supports:
Correct Answer: a) Distributed learning on streams
Explanation:
It trains models incrementally from continuous data.
112. What is 'sampling' in batch processing?
Correct Answer: b) Subset for approximation
Explanation:
Sampling reduces compute for large datasets.
113. Storm's 'shuffle grouping' routes:
Correct Answer: a) Round-robin
Explanation:
Shuffle evenly distributes tuples for load balancing.
114. In streaming, 'fault tolerance' uses:
Correct Answer: a) Replication and snapshots
Explanation:
It recovers state from backups on failure.
115. Apache Spark's batch mode uses:
Correct Answer: a) RDDs for transformations
Explanation:
Core Spark processes finite datasets with actions.
116. Kafka Streams' 'KTable' represents:
Correct Answer: a) Changelog stream
Explanation:
KTables model updatable tables from compacted topics.
117. What is 'partitioning' in batch?
Correct Answer: a) Data distribution for parallelism
Explanation:
It splits work across nodes for scalability.
118. Flink's 'savepoints' are for:
Correct Answer: a) Manual state backups
Explanation:
Savepoints allow upgrades and migrations.
119. In MapReduce, 'distributed cache' shares:
Correct Answer: a) Read-only files across nodes
Explanation:
It avoids shipping large jars or data per task.
120. Streaming 'metrics' monitor:
Correct Answer: a) Throughput, latency
Explanation:
Metrics help tune and detect issues in pipelines.
121. Apache Gearpump is for:
Correct Answer: a) Lightweight stream processing
Explanation:
Gearpump uses actor model for low-latency streams.
122. Batch 'compression' reduces:
Correct Answer: a) Storage and I/O
Explanation:
Formats like Snappy compress intermediate data.
123. What is 'dead letter queue' in streaming?
Correct Answer: a) For failed messages
Explanation:
DLQ stores unprocessable records for later inspection.
124. Storm's 'global grouping' sends to:
Correct Answer: a) All bolts
Explanation:
Global broadcasts to every downstream instance.
125. Flink's 'keyBy' partitions by:
Correct Answer: a) Hash of key
Explanation:
KeyBy groups for stateful keyed operations.
126. In batch, 'union' combines:
Correct Answer: a) Disjoint RDDs
Explanation:
Union creates a new RDD from multiple sources.
127. Streaming 'serialization' uses:
Correct Answer: a) Efficient formats like Avro
Explanation:
It minimizes network overhead for records.
128. Apache Quarkus for streaming provides:
Correct Answer: a) Reactive extensions
Explanation:
Quarkus integrates Kafka for non-blocking streams.
129. MapReduce 'job tracker' in Hadoop 1.x managed:
Correct Answer: a) Resources and scheduling
Explanation:
It coordinated jobs; replaced by YARN in Hadoop 2.
130. In streaming, 'latency' is the time from:
Correct Answer: a) Event to output
Explanation:
End-to-end latency measures processing delay.
131. Apache Akka Streams is for:
Correct Answer: a) Actor-based streaming
Explanation:
Akka uses backpressure-aware flows in Scala/Java.
132. Batch 'indexing' accelerates:
Correct Answer: a) Joins and filters
Explanation:
Indexes like bitmap speed up selections.
133. Kafka 'mirror maker' does:
Correct Answer: a) Cluster replication
Explanation:
It copies topics across geo-distributed clusters.
134. Flink's 'batch execution' treats data as:
Correct Answer: a) Bounded streams
Explanation:
Unified API processes finite data similarly to streams.
135. What is 'reduceByKey' in batch?
Correct Answer: a) Aggregates by key with shuffle
Explanation:
It combines values per key efficiently.
136. Streaming 'health checks' monitor:
Correct Answer: a) Pipeline status
Explanation:
They alert on backlogs or failures.
137. Apache Ignite for streaming offers:
Correct Answer: a) In-memory stream processing
Explanation:
Ignite accelerates streams with SQL and caching.
138. In batch, 'spill to disk' happens when:
Correct Answer: a) Memory overflows
Explanation:
Spill maintains correctness during shuffles.
139. What is 'assigning' in Kafka consumers?
Correct Answer: a) Manual partition assignment
Explanation:
It overrides auto for custom balancing.
140. Storm's 'all grouping' is like:
Correct Answer: a) Broadcast
Explanation:
All sends to every bolt instance.
141. Flink's 'DataSet' API is for:
Correct Answer: a) Batch processing
Explanation:
DataSet handles bounded datasets with transformations.
142. Batch 'sampling' methods include:
Correct Answer: a) Reservoir, stratified
Explanation:
They select representative subsets.
143. In streaming, 'throttling' limits:
Correct Answer: a) Input rate
Explanation:
Throttling caps sources for stability.
144. Apache Reactive Streams provide:
Correct Answer: a) Backpressure protocol
Explanation:
It standardizes async stream processing.
145. MapReduce 'input splits' are:
Correct Answer: a) Logical data chunks
Explanation:
Splits define mapper inputs, not always block-aligned.
146. Streaming 'alerting' uses:
Correct Answer: a) Thresholds on metrics
Explanation:
It notifies on anomalies like high latency.
147. Apache Vert.x for streaming is:
Correct Answer: a) Reactive toolkit
Explanation:
Vert.x handles event-driven streams non-blockingly.
148. In batch, 'cogroup' performs:
Correct Answer: a) Group-wise join
Explanation:
Cogroup iterates over grouped key-value pairs.
149. Kafka 'retention' policy deletes:
Correct Answer: a) Old logs after time/size
Explanation:
It bounds storage for topics.
150. Flink's 'timer service' in process functions:
Correct Answer: a) Schedules callbacks
Explanation:
Timers fire on event or processing time.
151. What is 'distinct' in batch?
Correct Answer: a) Removes duplicates
Explanation:
Distinct shuffles to unique values.
152. Streaming 'monitoring' tools include:
Correct Answer: a) Prometheus, Grafana
Explanation:
They visualize stream metrics.
153. Apache Ratpack for streaming:
Correct Answer: a) Reactive web/streams
Explanation:
Ratpack uses Netty for async processing.
154. In batch, 'sortBy' orders:
Correct Answer: a) Globally by key
Explanation:
It shuffles for total order.
155. Kafka 'connectors' are:
Correct Answer: a) Plugins for sources/sinks
Explanation:
They standardize integrations.
156. Storm's 'custom grouping' allows:
Correct Answer: a) User-defined routing
Explanation:
It implements logic for tuple distribution.
157. Flink's 'env' is:
Correct Answer: a) Execution environment
Explanation:
StreamExecutionEnvironment configures jobs.
158. Batch 'flatMap' returns:
Correct Answer: a) Variable elements per input
Explanation:
It explodes or flattens collections.
159. In streaming, 'scalability' via:
Correct Answer: a) Horizontal scaling
Explanation:
Add nodes for more throughput.
160. Apache RxJava for streaming:
Correct Answer: a) Reactive observables
Explanation:
RxJava handles async sequences with backpressure.


