160 Big Data Real-time Processing, Streaming Data, and Batch Processing - MCQs

Category: 1000 Big Data Technologies MCQDate: Published: November 1, 2025Posted by: MCQs Generator

100 multiple-choice questions explores key concepts in Big Data processing paradigms, including real-time processing with tools such as Apache Storm and Flink, streaming data processing with Kafka and Spark Streaming, and batch processing with MapReduce and Hive. Ideal for understanding scalable data pipelines.

1. What is the primary goal of real-time processing in Big Data?

a) Historical analysis

b) Immediate data ingestion and analysis

c) Long-term storage

d) Batch aggregation

✅ Correct Answer: b) Immediate data ingestion and analysis

📝 Explanation:

Real-time processing focuses on low-latency handling of data as it arrives, enabling instant insights and decisions.

2. Which framework is designed for distributed real-time computation?

a) Hadoop MapReduce

b) Apache Storm

c) Hive

d) Pig

✅ Correct Answer: b) Apache Storm

📝 Explanation:

Apache Storm processes unbounded streams of data in real-time, supporting topologies for continuous computation.

3. In streaming data, what does 'exactly-once' semantics guarantee?

a) Data may be processed multiple times

b) Each record is processed once without loss or duplication

c) Data is processed at least once

d) No guarantees

✅ Correct Answer: b) Each record is processed once without loss or duplication

📝 Explanation:

Exactly-once semantics ensures fault-tolerant processing where each input produces precisely one output.

4. What is Apache Kafka primarily used for in streaming?

a) Batch processing

b) Distributed event streaming and messaging

c) SQL querying

d) Graph analytics

✅ Correct Answer: b) Distributed event streaming and messaging

📝 Explanation:

Kafka acts as a scalable pub-sub system for handling real-time data feeds with durability.

5. Batch processing in Big Data is best suited for:

a) Real-time alerts

b) Large-scale historical data analysis

c) Live video streaming

d) IoT sensor updates

✅ Correct Answer: b) Large-scale historical data analysis

📝 Explanation:

Batch processing handles massive datasets offline, optimizing for throughput over latency.

6. Which component in Storm represents a stream processing pipeline?

a) Spout

b) Bolt

c) Topology

d) Trident

✅ Correct Answer: c) Topology

📝 Explanation:

A Storm topology is a graph of spouts and bolts defining the data flow for real-time processing.

7. In Spark Streaming, data is processed in:

a) Continuous mode

b) Micro-batches

c) Batch only

d) Event-by-event

✅ Correct Answer: b) Micro-batches

📝 Explanation:

Spark Streaming discretizes streams into small batches (DStreams) for near-real-time processing.

8. What is the role of a Kafka Producer?

a) Consumes messages

b) Publishes messages to topics

c) Manages partitions

d) Stores logs

✅ Correct Answer: b) Publishes messages to topics

📝 Explanation:

Producers send data records to Kafka topics, which are then replicated across brokers.

9. MapReduce is a model for:

a) Real-time streaming

b) Distributed batch processing

c) In-memory caching

d) Graph traversal

✅ Correct Answer: b) Distributed batch processing

📝 Explanation:

MapReduce breaks jobs into map and reduce phases for parallel processing of large datasets.

10. Apache Flink supports which processing type natively?

a) Batch only

b) Stream only

c) Both batch and stream

d) Neither

✅ Correct Answer: c) Both batch and stream

📝 Explanation:

Flink unifies batch (bounded streams) and stream processing with stateful computations.

11. In streaming, what is a 'window'?

a) Fixed-size buffer

b) Time or count-based aggregation interval

c) Error handler

d) Partition key

✅ Correct Answer: b) Time or count-based aggregation interval

📝 Explanation:

Windows group streaming data for computations like sums over sliding or tumbling periods.

12. Hive is primarily for batch processing using:

a) SQL queries on HDFS

b) Real-time streams

c) In-memory joins

d) Graph queries

✅ Correct Answer: a) SQL queries on HDFS

📝 Explanation:

Hive translates SQL to MapReduce or Tez jobs for batch analysis of structured data.

13. What does a Kafka Consumer do?

a) Produces topics

b) Subscribes to and processes topics

c) Replicates brokers

d) Compacts logs

✅ Correct Answer: b) Subscribes to and processes topics

📝 Explanation:

Consumers poll messages from topics, often in groups for load balancing.

14. In real-time processing, latency is measured in:

a) Hours

b) Milliseconds to seconds

c) Days

d) Weeks

✅ Correct Answer: b) Milliseconds to seconds

📝 Explanation:

Low latency (ms-s) distinguishes real-time from batch (minutes-hours).

15. Apache Samza processes streams using:

a) YARN

b) Kafka for storage and YARN for execution

c) HDFS

d) Spark

✅ Correct Answer: b) Kafka for storage and YARN for execution

📝 Explanation:

Samza leverages Kafka's changelog for state and YARN for distributed tasks.

16. Batch processing often uses which architecture?

a) Lambda

b) MapReduce on Hadoop

c) Microservices

d) Event sourcing

✅ Correct Answer: b) MapReduce on Hadoop

📝 Explanation:

Hadoop's MapReduce is the classic batch framework for fault-tolerant processing.

17. What is a 'spout' in Storm?

a) Data processor

b) Stream source

c) Aggregator

d) Sink

✅ Correct Answer: b) Stream source

📝 Explanation:

Spouts emit tuples into the topology, sourcing data from queues or APIs.

18. In streaming data, 'at-least-once' delivery means:

a) Possible duplicates

b) No duplicates

c) Possible losses

d) Exactly once

✅ Correct Answer: a) Possible duplicates

📝 Explanation:

At-least-once ensures delivery but may retry, causing duplicates.

19. Apache Beam is a unified model for:

a) Batch only

b) Batch and streaming

c) Storage

d) Security

✅ Correct Answer: b) Batch and streaming

📝 Explanation:

Beam's portable pipelines run on runners like Flink or Dataflow for both paradigms.

20. What is the Mapper phase in batch processing?

a) Aggregates data

b) Processes input into key-value pairs

c) Shuffles data

d) Outputs results

✅ Correct Answer: b) Processes input into key-value pairs

📝 Explanation:

Mappers filter and transform raw input independently.

21. Kafka topics are divided into:

a) Blocks

b) Partitions

c) Windows

d) Batches

✅ Correct Answer: b) Partitions

📝 Explanation:

Partitions enable parallelism and scalability in Kafka streams.

22. In Flink, 'checkpoints' are used for:

a) Data validation

b) Fault tolerance via state snapshots

c) Query optimization

d) Partitioning

✅ Correct Answer: b) Fault tolerance via state snapshots

📝 Explanation:

Checkpoints periodically save state for exactly-once recovery.

23. Pig Latin is a scripting language for:

a) Real-time processing

b) Batch ETL on Hadoop

c) Streaming joins

d) Graph algorithms

✅ Correct Answer: b) Batch ETL on Hadoop

📝 Explanation:

Pig simplifies data transformation for MapReduce jobs.

24. What is 'backpressure' in streaming?

a) Data acceleration

b) Handling overload by throttling producers

c) Data loss prevention

d) Batch merging

✅ Correct Answer: b) Handling overload by throttling producers

📝 Explanation:

Backpressure signals upstream to slow down, preventing downstream failures.

25. The Reduce phase in MapReduce performs:

a) Input splitting

b) Aggregation on grouped data

c) Mapping

d) Shuffling

✅ Correct Answer: b) Aggregation on grouped data

📝 Explanation:

Reducers receive shuffled data grouped by key for summarization.

26. Apache Trident is an extension for Storm providing:

a) Batch processing

b) Exactly-once semantics

c) Storage

d) Querying

✅ Correct Answer: b) Exactly-once semantics

📝 Explanation:

Trident adds higher-level abstractions with transactional guarantees.

27. In streaming, tumbling windows are:

a) Overlapping

b) Non-overlapping fixed intervals

c) Session-based

d) Global

✅ Correct Answer: b) Non-overlapping fixed intervals

📝 Explanation:

Tumbling windows process data in discrete, non-overlapping time slots.

28. YARN in Hadoop manages resources for:

a) Real-time streams

b) Batch and interactive workloads

c) Storage only

d) Networking

✅ Correct Answer: b) Batch and interactive workloads

📝 Explanation:

YARN decouples resource management from job scheduling for multi-tenancy.

29. What is a Kafka Broker?

a) Message producer

b) Server in the Kafka cluster

c) Consumer group

d) Topic partition

✅ Correct Answer: b) Server in the Kafka cluster

📝 Explanation:

Brokers store and manage topic partitions, handling replication.

30. Sliding windows in streaming have:

a) Fixed start/end

b) Overlapping intervals

c) No overlap

d) Random size

✅ Correct Answer: b) Overlapping intervals

📝 Explanation:

Sliding windows advance by a slide duration, overlapping for smoother aggregations.

31. Apache Tez optimizes batch processing with:

a) DAG execution

b) Micro-batches

c) Event loops

d) Windows

✅ Correct Answer: a) DAG execution

📝 Explanation:

Tez executes complex workflows as DAGs, reducing MapReduce overhead.

32. In Storm, a 'bolt' is:

a) Data source

b) Processing unit

c) Storage sink

d) Scheduler

✅ Correct Answer: b) Processing unit

📝 Explanation:

Bolts transform, filter, or aggregate streams from spouts.

33. Consumer groups in Kafka enable:

a) Single consumer

b) Load-balanced parallel consumption

c) Data production

d) Topic creation

✅ Correct Answer: b) Load-balanced parallel consumption

📝 Explanation:

Groups allow multiple consumers to share partitions for scalability.

34. What is the Shuffle phase in MapReduce?

a) Mapping

b) Grouping and transferring data to reducers

c) Reducing

d) Splitting

✅ Correct Answer: b) Grouping and transferring data to reducers

📝 Explanation:

Shuffle sorts and sends mapped output by key across the network.

35. Flink's state management supports:

a) Stateless only

b) Keyed and operator state for computations

c) Batch state

d) No state

✅ Correct Answer: b) Keyed and operator state for computations

📝 Explanation:

State backend like RocksDB stores keyed data for consistent processing.

36. Session windows in streaming are based on:

a) Fixed time

b) Gaps in data activity

c) Count

d) Global clock

✅ Correct Answer: b) Gaps in data activity

📝 Explanation:

Session windows group data with inactivity timeouts for variable durations.

37. Apache Sqoop is for batch transfer between:

a) Streams and HDFS

b) RDBMS and Hadoop

c) Kafka and Storm

d) Files and Spark

✅ Correct Answer: b) RDBMS and Hadoop

📝 Explanation:

Sqoop uses MapReduce for efficient bulk import/export to HDFS.

38. What is 'watermarking' in streaming?

a) Data encryption

b) Handling late data with event time

c) Compression

d) Partitioning

✅ Correct Answer: b) Handling late data with event time

📝 Explanation:

Watermarks indicate progress in event time, closing windows for out-of-order data.

39. In batch processing, fault tolerance is achieved via:

a) Checkpoints

b) Data replication and task retry

c) Windows

d) Backpressure

✅ Correct Answer: b) Data replication and task retry

📝 Explanation:

Hadoop retries failed tasks and uses replicated input for reliability.

40. Kafka Streams API is for:

a) Batch jobs

b) Lightweight stream processing

c) Storage

d) Querying

✅ Correct Answer: b) Lightweight stream processing

📝 Explanation:

It builds topologies for transformations directly on Kafka topics.

41. What is a 'tuple' in Storm?

a) Data unit in streams

b) Processing node

c) Cluster unit

d) Log entry

✅ Correct Answer: a) Data unit in streams

📝 Explanation:

Tuples are named lists of values flowing through the topology.

42. In Flink, DataStream API handles:

a) Batch data

b) Unbounded streams

c) Static files

d) Graphs

✅ Correct Answer: b) Unbounded streams

📝 Explanation:

DataStream processes continuous, potentially infinite data flows.

43. The Combiner in MapReduce runs on:

a) Reducers only

b) Mapper output locally

c) Input splits

d) Final output

✅ Correct Answer: b) Mapper output locally

📝 Explanation:

Combiners pre-aggregate to reduce shuffle data volume.

44. What is 'offset' in Kafka?

a) Message timestamp

b) Sequential ID in a partition

c) Key hash

d) Replication factor

✅ Correct Answer: b) Sequential ID in a partition

📝 Explanation:

Offsets track consumer progress in log partitions.

45. Apache Spark's Structured Streaming uses:

a) RDDs

b) DataFrames for declarative streams

c) Maps

d) Lists

✅ Correct Answer: b) DataFrames for declarative streams

📝 Explanation:

It models streams as unbounded tables with SQL-like operations.

46. In batch processing, 'job' refers to:

a) Single task

b) Complete workflow from input to output

c) Stream segment

d) Window

✅ Correct Answer: b) Complete workflow from input to output

📝 Explanation:

A job encompasses the full MapReduce execution.

47. Storm's Trident supports:

a) Micro-batch processing

b) Continuous tuples

c) Batch only

d) No state

✅ Correct Answer: a) Micro-batch processing

📝 Explanation:

Trident batches tuples for transactional and stateful operations.

48. What is 'chaining' in stream processing?

a) Linking multiple operations

b) Data replication

c) Error handling

d) Storage

✅ Correct Answer: a) Linking multiple operations

📝 Explanation:

Chaining composes transformations into a pipeline for efficiency.

49. Hive's partitioning in batch processing:

a) Dynamic only

b) Organizes data by columns for faster queries

c) Real-time

d) Streaming

✅ Correct Answer: b) Organizes data by columns for faster queries

📝 Explanation:

Partitioning prunes irrelevant data during scans.

50. Kafka's log compaction retains:

a) All messages

b) Latest value per key

c) Oldest messages

d) Random samples

✅ Correct Answer: b) Latest value per key

📝 Explanation:

Compaction enables changelog semantics for stateful apps.

51. In Flink, 'side outputs' are for:

a) Main stream only

b) Emitting late data separately

c) Batching

d) Partitioning

✅ Correct Answer: b) Emitting late data separately

📝 Explanation:

Side outputs handle out-of-order or special events without blocking.

52. What is 'speculative execution' in batch processing?

a) Running backup tasks for slow ones

b) Real-time speculation

c) Data guessing

d) No execution

✅ Correct Answer: a) Running backup tasks for slow ones

📝 Explanation:

It launches duplicates to mitigate stragglers in MapReduce.

53. Spark Streaming's DStream is:

a) Sequence of RDDs

b) Single RDD

c) DataFrame

d) Table

✅ Correct Answer: a) Sequence of RDDs

📝 Explanation:

DStreams represent discretized streams as RDD chains.

54. In streaming, 'event time' vs 'processing time':

a) Event time is when data arrives

b) Event time is generation timestamp

c) Same always

d) No difference

✅ Correct Answer: b) Event time is generation timestamp

📝 Explanation:

Event time reflects source time, handling delays unlike processing time.

55. Apache Oozie coordinates:

a) Streams

b) Batch workflows

c) Real-time tasks

d) Storage

✅ Correct Answer: b) Batch workflows

📝 Explanation:

Oozie orchestrates Hadoop jobs like Hive and Pig in DAGs.

56. What is a 'sink' in stream processing?

a) Data source

b) Output destination

c) Processor

d) Buffer

✅ Correct Answer: b) Output destination

📝 Explanation:

Sinks write processed data to stores like HDFS or databases.

57. In MapReduce, InputFormat defines:

a) Output format

b) How to read input data

c) Reduce logic

d) Shuffle

✅ Correct Answer: b) How to read input data

📝 Explanation:

InputFormat splits files and provides RecordReaders for records.

58. Kafka's replication factor ensures:

a) Single copy

b) Data durability across brokers

c) No replication

d) Partitioning

✅ Correct Answer: b) Data durability across brokers

📝 Explanation:

It copies partitions to multiple brokers for fault tolerance.

59. Flink's Table API provides:

a) SQL-like operations on streams

b) Batch only

c) No API

d) Graph

✅ Correct Answer: a) SQL-like operations on streams

📝 Explanation:

It unifies relational queries over dynamic tables from streams.

60. Batch processing's strength is:

a) Low latency

b) High throughput for large volumes

c) Real-time decisions

d) Small data

✅ Correct Answer: b) High throughput for large volumes

📝 Explanation:

It excels at processing terabytes efficiently, though with higher latency.

61. Storm's 'anchoring' ensures:

a) No reliability

b) Tuple acknowledgments for at-least-once

c) Exactly-once

d) No acks

✅ Correct Answer: b) Tuple acknowledgments for at-least-once

📝 Explanation:

Anchoring tracks tuple lineages for failure recovery.

62. In streaming, 'join' types include:

a) Stream-stream, stream-table

b) Batch only

c) No joins

d) Static only

✅ Correct Answer: a) Stream-stream, stream-table

📝 Explanation:

Joins enrich streams with reference data or merge co-streams.

63. Apache Flume is for:

a) Batch ETL

b) Streaming log collection

c) SQL

d) Graph

✅ Correct Answer: b) Streaming log collection

📝 Explanation:

Flume aggregates and moves log data reliably into HDFS.

64. What is 'skew' in batch processing?

a) Even distribution

b) Uneven data leading to hotspots

c) No issue

d) Fast processing

✅ Correct Answer: b) Uneven data leading to hotspots

📝 Explanation:

Skew causes some tasks to process more data, slowing jobs.

65. Kafka Connect integrates:

a) Streams with external systems

b) Batch only

c) Storage

d) Query

✅ Correct Answer: a) Streams with external systems

📝 Explanation:

Connect uses connectors for scalable data import/export.

66. In Flink, 'broadcast state' is for:

a) Sharing configuration across keyed streams

b) Local state

c) No sharing

d) Batch

✅ Correct Answer: a) Sharing configuration across keyed streams

📝 Explanation:

It broadcasts read-only data to all tasks for joins.

67. Pig's LOAD statement reads data from:

a) Streams

b) HDFS or local files

c) Databases only

d) Kafka

✅ Correct Answer: b) HDFS or local files

📝 Explanation:

LOAD uses loaders for various formats into relations.

68. Streaming 'triggers' control:

a) Window emission frequency

b) Data source

c) Partition

d) Sink

✅ Correct Answer: a) Window emission frequency

📝 Explanation:

Triggers fire computations based on time, count, or conditions.

69. In MapReduce, the default Partitioner uses:

a) Time-based hash

b) Key hash for reducer assignment

c) Random

d) No partitioning

✅ Correct Answer: b) Key hash for reducer assignment

📝 Explanation:

HashPartitioner distributes by key hash modulo numReducers.

70. What is 'replay' in streaming?

a) Reprocessing from stored logs

b) Live only

c) No replay

d) Forward only

✅ Correct Answer: a) Reprocessing from stored logs

📝 Explanation:

Replay enables re-computation for recovery or corrections.

71. Apache NiFi automates:

a) Batch jobs

b) Data flows in streaming pipelines

c) Storage

d) ML

✅ Correct Answer: b) Data flows in streaming pipelines

📝 Explanation:

NiFi provides visual design for routing and transforming data.

72. In batch, 'combiners' are optional for:

a) Reducing shuffle

b) Increasing data

c) Mapping

d) No role

✅ Correct Answer: a) Reducing shuffle

📝 Explanation:

They locally aggregate before shuffle, like mini-reducers.

73. Kafka's Zookeeper manages:

a) Cluster metadata and leader election

b) Data storage

c) Processing

d) Consumers

✅ Correct Answer: a) Cluster metadata and leader election

📝 Explanation:

Zookeeper coordinates brokers for topics and partitions.

74. Flink SQL supports:

a) Stream and batch queries

b) Batch only

c) No SQL

d) Static

✅ Correct Answer: a) Stream and batch queries

📝 Explanation:

It uses continuous queries for dynamic tables.

75. What is 'eviction' in streaming state?

a) Removing old state

b) Adding state

c) No state

d) Infinite state

✅ Correct Answer: a) Removing old state

📝 Explanation:

Eviction policies like time-to-live manage memory for windowed state.

76. Hive on Tez improves batch queries by:

a) DAG optimization

b) Micro-batches

c) Streams

d) No improvement

✅ Correct Answer: a) DAG optimization

📝 Explanation:

Tez vectorizes HiveQL for fewer stages and better performance.

77. In Storm, 'guaranteed processing' uses:

a) No acks

b) Acker bolts for reliability

c) Batch only

d) At-most-once

✅ Correct Answer: b) Acker bolts for reliability

📝 Explanation:

Ackers track tuple trees for at-least-once delivery.

78. Streaming 'connectors' in Beam are for:

a) Source/sink integrations

b) Transformations

c) Windows

d) Triggers

✅ Correct Answer: a) Source/sink integrations

📝 Explanation:

Connectors link pipelines to systems like Kafka or GCS.

79. MapReduce's fault tolerance relies on:

a) Checkpoints

b) Restarting failed tasks from lineage

c) No tolerance

d) Manual

✅ Correct Answer: b) Restarting failed tasks from lineage

📝 Explanation:

Deterministic tasks allow re-execution from input splits.

80. What is 'fan-out' in streaming?

a) Single consumer

b) Broadcasting to multiple consumers

c) Merge streams

d) Filter

✅ Correct Answer: b) Broadcasting to multiple consumers

📝 Explanation:

Fan-out duplicates streams for parallel processing or routing.

81. Apache Pulsar is a:

a) Messaging system with multi-tenancy

b) Batch engine

c) Storage

d) Query

✅ Correct Answer: a) Messaging system with multi-tenancy

📝 Explanation:

Pulsar separates compute from storage for scalable streaming.

82. In batch, 'vectorization' speeds up by:

a) Processing rows one-by-one

b) SIMD on columns

c) No speed

d) Scalar

✅ Correct Answer: b) SIMD on columns

📝 Explanation:

Vectorized execution uses CPU instructions for batch ops.

83. Kafka's 'idempotent producer' prevents:

a) Duplicates

b) Losses

c) Both

d) No prevention

✅ Correct Answer: a) Duplicates

📝 Explanation:

It uses sequence numbers for exactly-once writes.

84. Flink's 'process function' allows:

a) Low-level stream access with timers

b) High-level only

c) Batch

d) No timers

✅ Correct Answer: a) Low-level stream access with timers

📝 Explanation:

Process functions provide event-time control and side effects.

85. What is 'counters' in MapReduce?

a) Global metrics tracking

b) Local only

c) No tracking

d) Storage

✅ Correct Answer: a) Global metrics tracking

📝 Explanation:

Counters aggregate job statistics across tasks.

86. Streaming 'deduplication' removes:

a) Unique records

b) Duplicates using keys/windows

c) All data

d) No removal

✅ Correct Answer: b) Duplicates using keys/windows

📝 Explanation:

It ensures uniqueness within time or key scopes.

87. Apache Heron is a:

a) Storm replacement for real-time

b) Batch

c) Storage

d) SQL

✅ Correct Answer: a) Storm replacement for real-time

📝 Explanation:

Heron improves Storm with better scheduling and metrics.

88. In batch, 'bloom filters' optimize:

a) Joins by skipping non-matches

b) Full scans

c) No optimization

d) Writes

✅ Correct Answer: a) Joins by skipping non-matches

📝 Explanation:

They probabilistically test membership to reduce I/O.

89. Kafka's 'transactions' enable:

a) Exactly-once across topics

b) At-least-once

c) No transactions

d) Batch only

✅ Correct Answer: a) Exactly-once across topics

📝 Explanation:

Transactions atomically produce to multiple partitions.

90. In Spark Structured Streaming, 'output mode' controls:

a) Append, complete, update

b) Input only

c) No control

d) Batch

✅ Correct Answer: a) Append, complete, update

📝 Explanation:

Modes define how results are emitted for different queries.

91. What is 'fork' in stream processing?

a) Merge streams

b) Split one stream to multiple paths

c) Filter

d) Aggregate

✅ Correct Answer: b) Split one stream to multiple paths

📝 Explanation:

Fork duplicates for parallel or conditional routing.

92. Apache Airflow schedules:

a) Real-time tasks

b) Batch workflows

c) Streams

d) Storage

✅ Correct Answer: b) Batch workflows

📝 Explanation:

Airflow uses DAGs for orchestrating complex batch pipelines.

93. In Flink, 'CEP' stands for:

a) Complex Event Processing

b) Simple events

c) Batch CEP

d) No processing

✅ Correct Answer: a) Complex Event Processing

📝 Explanation:

CEP detects patterns in event streams for anomaly detection.

94. MapReduce's 'secondary sort' ensures:

a) Key order only

b) Value order within keys

c) No sort

d) Random

✅ Correct Answer: b) Value order within keys

📝 Explanation:

It sorts both key and value for ordered reducers.

95. Streaming 'rate limiting' prevents:

a) Overload

b) Underload

c) No limit

d) Storage

✅ Correct Answer: a) Overload

📝 Explanation:

It caps ingestion rates for system stability.

96. Apache Kinesis is Amazon's:

a) Streaming service

b) Batch

c) Storage

d) Query

✅ Correct Answer: a) Streaming service

📝 Explanation:

Kinesis captures and processes real-time data at scale.

97. In batch, 'skew join' is mitigated by:

a) Broadcast small tables

b) Full shuffle

c) No mitigation

d) Split large

✅ Correct Answer: a) Broadcast small tables

📝 Explanation:

Broadcast avoids skew by sending small sides to all nodes.

98. What is 'schema registry' in streaming?

a) Central schema management for topics

b) No schema

c) Batch schema

d) Storage

✅ Correct Answer: a) Central schema management for topics

📝 Explanation:

It enforces and evolves schemas for Avro/Protobuf in Kafka.

99. Storm's 'fields grouping' routes by:

a) All tuples

b) Specific fields

c) Random

d) Global

✅ Correct Answer: b) Specific fields

📝 Explanation:

It hashes selected fields for consistent routing.

100. Flink's 'Async I/O' allows:

a) Non-blocking external calls

b) Blocking only

c) No I/O

d) Batch I/O

✅ Correct Answer: a) Non-blocking external calls

📝 Explanation:

Async functions enrich streams with async DB lookups.

101. What is 'groupByKey' in batch processing?

a) Shuffles by key

b) No shuffle

c) Filter

d) Map

✅ Correct Answer: a) Shuffles by key

📝 Explanation:

It groups values per key, often expensive due to shuffle.

102. In streaming, 'materialized views' are:

a) Precomputed for fast queries

b) Live only

c) No views

d) Batch

✅ Correct Answer: a) Precomputed for fast queries

📝 Explanation:

They cache incremental results for low-latency access.

103. Apache Kafka supports which durability level?

a) acks=0,1,all

b) acks=none

c) No durability

d) Batch only

✅ Correct Answer: a) acks=0,1,all

📝 Explanation:

Acks configure write acknowledgments for throughput vs. durability.

104. Spark Streaming integrates with:

a) Kafka, Flume, TCP

b) HDFS only

c) SQL only

d) No integration

✅ Correct Answer: a) Kafka, Flume, TCP

📝 Explanation:

Receivers pull/push from various sources for DStreams.

105. In batch, 'caching' in Spark uses:

a) MEMORY_ONLY

b) Disk only

c) No cache

d) Stream cache

✅ Correct Answer: a) MEMORY_ONLY

📝 Explanation:

Persist levels control storage for reused RDDs.

106. What is 'coalescing' in streaming?

a) Merging small batches

b) Splitting

c) No merge

d) Filter

✅ Correct Answer: a) Merging small batches

📝 Explanation:

Coalesce reduces partitions for efficiency.

107. Apache Apex supports:

a) Unified stream and batch

b) Batch only

c) Storage

d) Graph

✅ Correct Answer: a) Unified stream and batch

📝 Explanation:

Apex uses YARN for resilient, stateful processing.

108. MapReduce counters track:

a) Custom metrics

b) No tracking

c) Streams

d) Windows

✅ Correct Answer: a) Custom metrics

📝 Explanation:

User-defined counters monitor job progress.

109. In streaming, 'enrichment' means:

a) Adding context via joins

b) Removing data

c) No add

d) Batch

✅ Correct Answer: a) Adding context via joins

📝 Explanation:

Enrich streams with external data for deeper insights.

110. Kafka's 'consumer rebalance' occurs when:

a) Group membership changes

b) No change

c) Fixed

d) Manual

✅ Correct Answer: a) Group membership changes

📝 Explanation:

Rebalance redistributes partitions among consumers.

111. Flink ML supports:

a) Distributed learning on streams

b) Local only

c) No ML

d) Batch ML

✅ Correct Answer: a) Distributed learning on streams

📝 Explanation:

It trains models incrementally from continuous data.

112. What is 'sampling' in batch processing?

a) Full data use

b) Subset for approximation

c) No sample

d) Stream

✅ Correct Answer: b) Subset for approximation

📝 Explanation:

Sampling reduces compute for large datasets.

113. Storm's 'shuffle grouping' routes:

a) Round-robin

b) By key

c) Global

d) Fields

✅ Correct Answer: a) Round-robin

📝 Explanation:

Shuffle evenly distributes tuples for load balancing.

114. In streaming, 'fault tolerance' uses:

a) Replication and snapshots

b) No fault

c) Batch only

d) Manual

✅ Correct Answer: a) Replication and snapshots

📝 Explanation:

It recovers state from backups on failure.

115. Apache Spark's batch mode uses:

a) RDDs for transformations

b) Streams

c) Windows

d) No mode

✅ Correct Answer: a) RDDs for transformations

📝 Explanation:

Core Spark processes finite datasets with actions.

116. Kafka Streams' 'KTable' represents:

a) Changelog stream

b) Append-only

c) No table

d) Batch

✅ Correct Answer: a) Changelog stream

📝 Explanation:

KTables model updatable tables from compacted topics.

117. What is 'partitioning' in batch?

a) Data distribution for parallelism

b) No partition

c) Stream

d) Window

✅ Correct Answer: a) Data distribution for parallelism

📝 Explanation:

It splits work across nodes for scalability.

118. Flink's 'savepoints' are for:

a) Manual state backups

b) Auto only

c) No save

d) Delete

✅ Correct Answer: a) Manual state backups

📝 Explanation:

Savepoints allow upgrades and migrations.

119. In MapReduce, 'distributed cache' shares:

a) Read-only files across nodes

b) No share

c) Writes

d) Streams

✅ Correct Answer: a) Read-only files across nodes

📝 Explanation:

It avoids shipping large jars or data per task.

120. Streaming 'metrics' monitor:

a) Throughput, latency

b) No monitor

c) Batch only

d) Storage

✅ Correct Answer: a) Throughput, latency

📝 Explanation:

Metrics help tune and detect issues in pipelines.

121. Apache Gearpump is for:

a) Lightweight stream processing

b) Batch

c) Storage

d) SQL

✅ Correct Answer: a) Lightweight stream processing

📝 Explanation:

Gearpump uses actor model for low-latency streams.

122. Batch 'compression' reduces:

a) Storage and I/O

b) No reduce

c) Speed

d) CPU

✅ Correct Answer: a) Storage and I/O

📝 Explanation:

Formats like Snappy compress intermediate data.

123. What is 'dead letter queue' in streaming?

a) For failed messages

b) Success queue

c) No queue

d) Batch

✅ Correct Answer: a) For failed messages

📝 Explanation:

DLQ stores unprocessable records for later inspection.

124. Storm's 'global grouping' sends to:

a) All bolts

b) One bolt

c) Random

d) Key

✅ Correct Answer: a) All bolts

📝 Explanation:

Global broadcasts to every downstream instance.

125. Flink's 'keyBy' partitions by:

a) Hash of key

b) Random

c) Global

d) No partition

✅ Correct Answer: a) Hash of key

📝 Explanation:

KeyBy groups for stateful keyed operations.

126. In batch, 'union' combines:

a) Disjoint RDDs

b) Intersect

c) No combine

d) Subtract

✅ Correct Answer: a) Disjoint RDDs

📝 Explanation:

Union creates a new RDD from multiple sources.

127. Streaming 'serialization' uses:

a) Efficient formats like Avro

b) Text only

c) No serialize

d) Batch

✅ Correct Answer: a) Efficient formats like Avro

📝 Explanation:

It minimizes network overhead for records.

128. Apache Quarkus for streaming provides:

a) Reactive extensions

b) Batch

c) Storage

d) No reactive

✅ Correct Answer: a) Reactive extensions

📝 Explanation:

Quarkus integrates Kafka for non-blocking streams.

129. MapReduce 'job tracker' in Hadoop 1.x managed:

a) Resources and scheduling

b) Storage

c) Streams

d) No manage

✅ Correct Answer: a) Resources and scheduling

📝 Explanation:

It coordinated jobs; replaced by YARN in Hadoop 2.

130. In streaming, 'latency' is the time from:

a) Event to output

b) Batch start

c) Storage

d) No time

✅ Correct Answer: a) Event to output

📝 Explanation:

End-to-end latency measures processing delay.

131. Apache Akka Streams is for:

a) Actor-based streaming

b) Batch

c) Storage

d) SQL

✅ Correct Answer: a) Actor-based streaming

📝 Explanation:

Akka uses backpressure-aware flows in Scala/Java.

132. Batch 'indexing' accelerates:

a) Joins and filters

b) Full scans

c) No accel

d) Writes

✅ Correct Answer: a) Joins and filters

📝 Explanation:

Indexes like bitmap speed up selections.

133. Kafka 'mirror maker' does:

a) Cluster replication

b) No mirror

c) Produce

d) Consume

✅ Correct Answer: a) Cluster replication

📝 Explanation:

It copies topics across geo-distributed clusters.

134. Flink's 'batch execution' treats data as:

a) Bounded streams

b) Unbounded

c) No batch

d) Windows

✅ Correct Answer: a) Bounded streams

📝 Explanation:

Unified API processes finite data similarly to streams.

135. What is 'reduceByKey' in batch?

a) Aggregates by key with shuffle

b) No aggregate

c) Map only

d) Filter

✅ Correct Answer: a) Aggregates by key with shuffle

📝 Explanation:

It combines values per key efficiently.

136. Streaming 'health checks' monitor:

a) Pipeline status

b) No check

c) Batch

d) Storage

✅ Correct Answer: a) Pipeline status

📝 Explanation:

They alert on backlogs or failures.

137. Apache Ignite for streaming offers:

a) In-memory stream processing

b) Disk only

c) No stream

d) Batch

✅ Correct Answer: a) In-memory stream processing

📝 Explanation:

Ignite accelerates streams with SQL and caching.

138. In batch, 'spill to disk' happens when:

a) Memory overflows

b) No spill

c) Always disk

d) Stream

✅ Correct Answer: a) Memory overflows

📝 Explanation:

Spill maintains correctness during shuffles.

139. What is 'assigning' in Kafka consumers?

a) Manual partition assignment

b) Auto only

c) No assign

d) Produce

✅ Correct Answer: a) Manual partition assignment

📝 Explanation:

It overrides auto for custom balancing.

140. Storm's 'all grouping' is like:

a) Broadcast

b) Shuffle

c) Key

d) Fields

✅ Correct Answer: a) Broadcast

📝 Explanation:

All sends to every bolt instance.

141. Flink's 'DataSet' API is for:

a) Batch processing

b) Streams

c) No API

d) Graph

✅ Correct Answer: a) Batch processing

📝 Explanation:

DataSet handles bounded datasets with transformations.

142. Batch 'sampling' methods include:

a) Reservoir, stratified

b) Full

c) No sample

d) Random delete

✅ Correct Answer: a) Reservoir, stratified

📝 Explanation:

They select representative subsets.

143. In streaming, 'throttling' limits:

a) Input rate

b) Output

c) No limit

d) Batch

✅ Correct Answer: a) Input rate

📝 Explanation:

Throttling caps sources for stability.

144. Apache Reactive Streams provide:

a) Backpressure protocol

b) No backpressure

c) Batch

d) Storage

✅ Correct Answer: a) Backpressure protocol

📝 Explanation:

It standardizes async stream processing.

145. MapReduce 'input splits' are:

a) Logical data chunks

b) Physical blocks

c) No split

d) Output

✅ Correct Answer: a) Logical data chunks

📝 Explanation:

Splits define mapper inputs, not always block-aligned.

146. Streaming 'alerting' uses:

a) Thresholds on metrics

b) No alert

c) Batch

d) Storage

✅ Correct Answer: a) Thresholds on metrics

📝 Explanation:

It notifies on anomalies like high latency.

147. Apache Vert.x for streaming is:

a) Reactive toolkit

b) Batch

c) Storage

d) SQL

✅ Correct Answer: a) Reactive toolkit

📝 Explanation:

Vert.x handles event-driven streams non-blockingly.

148. In batch, 'cogroup' performs:

a) Group-wise join

b) Simple join

c) No group

d) Filter

✅ Correct Answer: a) Group-wise join

📝 Explanation:

Cogroup iterates over grouped key-value pairs.

149. Kafka 'retention' policy deletes:

a) Old logs after time/size

b) All

c) New

d) No delete

✅ Correct Answer: a) Old logs after time/size

📝 Explanation:

It bounds storage for topics.

150. Flink's 'timer service' in process functions:

a) Schedules callbacks

b) No timer

c) Batch

d) Static

✅ Correct Answer: a) Schedules callbacks

📝 Explanation:

Timers fire on event or processing time.

151. What is 'distinct' in batch?

a) Removes duplicates

b) Adds dups

c) No remove

d) Stream

✅ Correct Answer: a) Removes duplicates

📝 Explanation:

Distinct shuffles to unique values.

152. Streaming 'monitoring' tools include:

a) Prometheus, Grafana

b) No tool

c) Batch

d) Storage

✅ Correct Answer: a) Prometheus, Grafana

📝 Explanation:

They visualize stream metrics.

153. Apache Ratpack for streaming:

a) Reactive web/streams

b) Batch

c) Storage

d) No reactive

✅ Correct Answer: a) Reactive web/streams

📝 Explanation:

Ratpack uses Netty for async processing.

154. In batch, 'sortBy' orders:

a) Globally by key

b) Local only

c) No sort

d) Random

✅ Correct Answer: a) Globally by key

📝 Explanation:

It shuffles for total order.

155. Kafka 'connectors' are:

a) Plugins for sources/sinks

b) No plugin

c) Batch

d) Query

✅ Correct Answer: a) Plugins for sources/sinks

📝 Explanation:

They standardize integrations.

156. Storm's 'custom grouping' allows:

a) User-defined routing

b) Standard only

c) No custom

d) Batch

✅ Correct Answer: a) User-defined routing

📝 Explanation:

It implements logic for tuple distribution.

157. Flink's 'env' is:

a) Execution environment

b) No env

c) Stream env

d) Batch env

✅ Correct Answer: a) Execution environment

📝 Explanation:

StreamExecutionEnvironment configures jobs.

158. Batch 'flatMap' returns:

a) Variable elements per input

b) One per

c) No return

d) Fixed

✅ Correct Answer: a) Variable elements per input

📝 Explanation:

It explodes or flattens collections.

159. In streaming, 'scalability' via:

a) Horizontal scaling

b) Vertical only

c) No scale

d) Batch

✅ Correct Answer: a) Horizontal scaling

📝 Explanation:

Add nodes for more throughput.

160. Apache RxJava for streaming:

a) Reactive observables

b) Batch

c) Storage

d) SQL

✅ Correct Answer: a) Reactive observables

📝 Explanation:

RxJava handles async sequences with backpressure.

New

100 Big Data Storage and Data Processing MCQs

130 Big Data Storage and Data Processing MCQs

130 multiple-choice questions designed to test and deepen understanding of Big Data storage mechanisms, including distributed file systems, NoSQL databases,…

November 1, 2025

By MCQs Generator

New

70 Big Data in IoT, Healthcare Analytics, and Marketing - MCQs

70 multiple-choice questions delves into the transformative role of Big Data across IoT ecosystems, healthcare analytics for improved patient outcomes,…

November 1, 2025

By MCQs Generator

New

60 Big Data Analytics MCQs Questions

1. What is the primary goal of real-time processing in Big Data? a) Historical analysis b) Immediate data ingestion and…

October 31, 2025

By MCQs Generator