100 multiple-choice questions explores key concepts in Big Data processing paradigms, including real-time processing with tools such as Apache Storm and Flink, streaming data processing with Kafka and Spark Streaming, and batch processing with MapReduce and Hive. Ideal for understanding scalable data pipelines.
160 Big Data Real-time Processing, Streaming Data, and Batch Processing - MCQs
✅ Correct Answer: b) Immediate data ingestion and analysis
📝 Explanation:
Real-time processing focuses on low-latency handling of data as it arrives, enabling instant insights and decisions.
✅ Correct Answer: b) Apache Storm
📝 Explanation:
Apache Storm processes unbounded streams of data in real-time, supporting topologies for continuous computation.
✅ Correct Answer: b) Each record is processed once without loss or duplication
📝 Explanation:
Exactly-once semantics ensures fault-tolerant processing where each input produces precisely one output.
✅ Correct Answer: b) Distributed event streaming and messaging
📝 Explanation:
Kafka acts as a scalable pub-sub system for handling real-time data feeds with durability.
✅ Correct Answer: b) Large-scale historical data analysis
📝 Explanation:
Batch processing handles massive datasets offline, optimizing for throughput over latency.
✅ Correct Answer: c) Topology
📝 Explanation:
A Storm topology is a graph of spouts and bolts defining the data flow for real-time processing.
✅ Correct Answer: b) Micro-batches
📝 Explanation:
Spark Streaming discretizes streams into small batches (DStreams) for near-real-time processing.
✅ Correct Answer: b) Publishes messages to topics
📝 Explanation:
Producers send data records to Kafka topics, which are then replicated across brokers.
✅ Correct Answer: b) Distributed batch processing
📝 Explanation:
MapReduce breaks jobs into map and reduce phases for parallel processing of large datasets.
✅ Correct Answer: c) Both batch and stream
📝 Explanation:
Flink unifies batch (bounded streams) and stream processing with stateful computations.
✅ Correct Answer: b) Time or count-based aggregation interval
📝 Explanation:
Windows group streaming data for computations like sums over sliding or tumbling periods.
✅ Correct Answer: a) SQL queries on HDFS
📝 Explanation:
Hive translates SQL to MapReduce or Tez jobs for batch analysis of structured data.
✅ Correct Answer: b) Subscribes to and processes topics
📝 Explanation:
Consumers poll messages from topics, often in groups for load balancing.
✅ Correct Answer: b) Milliseconds to seconds
📝 Explanation:
Low latency (ms-s) distinguishes real-time from batch (minutes-hours).
✅ Correct Answer: b) Kafka for storage and YARN for execution
📝 Explanation:
Samza leverages Kafka's changelog for state and YARN for distributed tasks.
✅ Correct Answer: b) MapReduce on Hadoop
📝 Explanation:
Hadoop's MapReduce is the classic batch framework for fault-tolerant processing.
✅ Correct Answer: b) Stream source
📝 Explanation:
Spouts emit tuples into the topology, sourcing data from queues or APIs.
✅ Correct Answer: a) Possible duplicates
📝 Explanation:
At-least-once ensures delivery but may retry, causing duplicates.
✅ Correct Answer: b) Batch and streaming
📝 Explanation:
Beam's portable pipelines run on runners like Flink or Dataflow for both paradigms.
✅ Correct Answer: b) Processes input into key-value pairs
📝 Explanation:
Mappers filter and transform raw input independently.
✅ Correct Answer: b) Partitions
📝 Explanation:
Partitions enable parallelism and scalability in Kafka streams.
✅ Correct Answer: b) Fault tolerance via state snapshots
📝 Explanation:
Checkpoints periodically save state for exactly-once recovery.
✅ Correct Answer: b) Batch ETL on Hadoop
📝 Explanation:
Pig simplifies data transformation for MapReduce jobs.
✅ Correct Answer: b) Handling overload by throttling producers
📝 Explanation:
Backpressure signals upstream to slow down, preventing downstream failures.
✅ Correct Answer: b) Aggregation on grouped data
📝 Explanation:
Reducers receive shuffled data grouped by key for summarization.
✅ Correct Answer: b) Exactly-once semantics
📝 Explanation:
Trident adds higher-level abstractions with transactional guarantees.
✅ Correct Answer: b) Non-overlapping fixed intervals
📝 Explanation:
Tumbling windows process data in discrete, non-overlapping time slots.
✅ Correct Answer: b) Batch and interactive workloads
📝 Explanation:
YARN decouples resource management from job scheduling for multi-tenancy.
✅ Correct Answer: b) Server in the Kafka cluster
📝 Explanation:
Brokers store and manage topic partitions, handling replication.
✅ Correct Answer: b) Overlapping intervals
📝 Explanation:
Sliding windows advance by a slide duration, overlapping for smoother aggregations.
✅ Correct Answer: a) DAG execution
📝 Explanation:
Tez executes complex workflows as DAGs, reducing MapReduce overhead.
✅ Correct Answer: b) Processing unit
📝 Explanation:
Bolts transform, filter, or aggregate streams from spouts.
✅ Correct Answer: b) Load-balanced parallel consumption
📝 Explanation:
Groups allow multiple consumers to share partitions for scalability.
✅ Correct Answer: b) Grouping and transferring data to reducers
📝 Explanation:
Shuffle sorts and sends mapped output by key across the network.
✅ Correct Answer: b) Keyed and operator state for computations
📝 Explanation:
State backend like RocksDB stores keyed data for consistent processing.
✅ Correct Answer: b) Gaps in data activity
📝 Explanation:
Session windows group data with inactivity timeouts for variable durations.
✅ Correct Answer: b) RDBMS and Hadoop
📝 Explanation:
Sqoop uses MapReduce for efficient bulk import/export to HDFS.
✅ Correct Answer: b) Handling late data with event time
📝 Explanation:
Watermarks indicate progress in event time, closing windows for out-of-order data.
✅ Correct Answer: b) Data replication and task retry
📝 Explanation:
Hadoop retries failed tasks and uses replicated input for reliability.
✅ Correct Answer: b) Lightweight stream processing
📝 Explanation:
It builds topologies for transformations directly on Kafka topics.
✅ Correct Answer: a) Data unit in streams
📝 Explanation:
Tuples are named lists of values flowing through the topology.
✅ Correct Answer: b) Unbounded streams
📝 Explanation:
DataStream processes continuous, potentially infinite data flows.
✅ Correct Answer: b) Mapper output locally
📝 Explanation:
Combiners pre-aggregate to reduce shuffle data volume.
✅ Correct Answer: b) Sequential ID in a partition
📝 Explanation:
Offsets track consumer progress in log partitions.
✅ Correct Answer: b) DataFrames for declarative streams
📝 Explanation:
It models streams as unbounded tables with SQL-like operations.
✅ Correct Answer: b) Complete workflow from input to output
📝 Explanation:
A job encompasses the full MapReduce execution.
✅ Correct Answer: a) Micro-batch processing
📝 Explanation:
Trident batches tuples for transactional and stateful operations.
✅ Correct Answer: a) Linking multiple operations
📝 Explanation:
Chaining composes transformations into a pipeline for efficiency.
✅ Correct Answer: b) Organizes data by columns for faster queries
📝 Explanation:
Partitioning prunes irrelevant data during scans.
✅ Correct Answer: b) Latest value per key
📝 Explanation:
Compaction enables changelog semantics for stateful apps.
✅ Correct Answer: b) Emitting late data separately
📝 Explanation:
Side outputs handle out-of-order or special events without blocking.
✅ Correct Answer: a) Running backup tasks for slow ones
📝 Explanation:
It launches duplicates to mitigate stragglers in MapReduce.
✅ Correct Answer: a) Sequence of RDDs
📝 Explanation:
DStreams represent discretized streams as RDD chains.
✅ Correct Answer: b) Event time is generation timestamp
📝 Explanation:
Event time reflects source time, handling delays unlike processing time.
✅ Correct Answer: b) Batch workflows
📝 Explanation:
Oozie orchestrates Hadoop jobs like Hive and Pig in DAGs.
✅ Correct Answer: b) Output destination
📝 Explanation:
Sinks write processed data to stores like HDFS or databases.
✅ Correct Answer: b) How to read input data
📝 Explanation:
InputFormat splits files and provides RecordReaders for records.
✅ Correct Answer: b) Data durability across brokers
📝 Explanation:
It copies partitions to multiple brokers for fault tolerance.
✅ Correct Answer: a) SQL-like operations on streams
📝 Explanation:
It unifies relational queries over dynamic tables from streams.
✅ Correct Answer: b) High throughput for large volumes
📝 Explanation:
It excels at processing terabytes efficiently, though with higher latency.
✅ Correct Answer: b) Tuple acknowledgments for at-least-once
📝 Explanation:
Anchoring tracks tuple lineages for failure recovery.
✅ Correct Answer: a) Stream-stream, stream-table
📝 Explanation:
Joins enrich streams with reference data or merge co-streams.
✅ Correct Answer: b) Streaming log collection
📝 Explanation:
Flume aggregates and moves log data reliably into HDFS.
✅ Correct Answer: b) Uneven data leading to hotspots
📝 Explanation:
Skew causes some tasks to process more data, slowing jobs.
✅ Correct Answer: a) Streams with external systems
📝 Explanation:
Connect uses connectors for scalable data import/export.
✅ Correct Answer: a) Sharing configuration across keyed streams
📝 Explanation:
It broadcasts read-only data to all tasks for joins.
✅ Correct Answer: b) HDFS or local files
📝 Explanation:
LOAD uses loaders for various formats into relations.
✅ Correct Answer: a) Window emission frequency
📝 Explanation:
Triggers fire computations based on time, count, or conditions.
✅ Correct Answer: b) Key hash for reducer assignment
📝 Explanation:
HashPartitioner distributes by key hash modulo numReducers.
✅ Correct Answer: a) Reprocessing from stored logs
📝 Explanation:
Replay enables re-computation for recovery or corrections.
✅ Correct Answer: b) Data flows in streaming pipelines
📝 Explanation:
NiFi provides visual design for routing and transforming data.
✅ Correct Answer: a) Reducing shuffle
📝 Explanation:
They locally aggregate before shuffle, like mini-reducers.
✅ Correct Answer: a) Cluster metadata and leader election
📝 Explanation:
Zookeeper coordinates brokers for topics and partitions.
✅ Correct Answer: a) Stream and batch queries
📝 Explanation:
It uses continuous queries for dynamic tables.
✅ Correct Answer: a) Removing old state
📝 Explanation:
Eviction policies like time-to-live manage memory for windowed state.
✅ Correct Answer: a) DAG optimization
📝 Explanation:
Tez vectorizes HiveQL for fewer stages and better performance.
✅ Correct Answer: b) Acker bolts for reliability
📝 Explanation:
Ackers track tuple trees for at-least-once delivery.
✅ Correct Answer: a) Source/sink integrations
📝 Explanation:
Connectors link pipelines to systems like Kafka or GCS.
✅ Correct Answer: b) Restarting failed tasks from lineage
📝 Explanation:
Deterministic tasks allow re-execution from input splits.
✅ Correct Answer: b) Broadcasting to multiple consumers
📝 Explanation:
Fan-out duplicates streams for parallel processing or routing.
✅ Correct Answer: a) Messaging system with multi-tenancy
📝 Explanation:
Pulsar separates compute from storage for scalable streaming.
✅ Correct Answer: b) SIMD on columns
📝 Explanation:
Vectorized execution uses CPU instructions for batch ops.
✅ Correct Answer: a) Duplicates
📝 Explanation:
It uses sequence numbers for exactly-once writes.
✅ Correct Answer: a) Low-level stream access with timers
📝 Explanation:
Process functions provide event-time control and side effects.
✅ Correct Answer: a) Global metrics tracking
📝 Explanation:
Counters aggregate job statistics across tasks.
✅ Correct Answer: b) Duplicates using keys/windows
📝 Explanation:
It ensures uniqueness within time or key scopes.
✅ Correct Answer: a) Storm replacement for real-time
📝 Explanation:
Heron improves Storm with better scheduling and metrics.
✅ Correct Answer: a) Joins by skipping non-matches
📝 Explanation:
They probabilistically test membership to reduce I/O.
✅ Correct Answer: a) Exactly-once across topics
📝 Explanation:
Transactions atomically produce to multiple partitions.
✅ Correct Answer: a) Append, complete, update
📝 Explanation:
Modes define how results are emitted for different queries.
✅ Correct Answer: b) Split one stream to multiple paths
📝 Explanation:
Fork duplicates for parallel or conditional routing.
✅ Correct Answer: b) Batch workflows
📝 Explanation:
Airflow uses DAGs for orchestrating complex batch pipelines.
✅ Correct Answer: a) Complex Event Processing
📝 Explanation:
CEP detects patterns in event streams for anomaly detection.
✅ Correct Answer: b) Value order within keys
📝 Explanation:
It sorts both key and value for ordered reducers.
✅ Correct Answer: a) Overload
📝 Explanation:
It caps ingestion rates for system stability.
✅ Correct Answer: a) Streaming service
📝 Explanation:
Kinesis captures and processes real-time data at scale.
✅ Correct Answer: a) Broadcast small tables
📝 Explanation:
Broadcast avoids skew by sending small sides to all nodes.
✅ Correct Answer: a) Central schema management for topics
📝 Explanation:
It enforces and evolves schemas for Avro/Protobuf in Kafka.
✅ Correct Answer: b) Specific fields
📝 Explanation:
It hashes selected fields for consistent routing.
✅ Correct Answer: a) Non-blocking external calls
📝 Explanation:
Async functions enrich streams with async DB lookups.
✅ Correct Answer: a) Shuffles by key
📝 Explanation:
It groups values per key, often expensive due to shuffle.
✅ Correct Answer: a) Precomputed for fast queries
📝 Explanation:
They cache incremental results for low-latency access.
✅ Correct Answer: a) acks=0,1,all
📝 Explanation:
Acks configure write acknowledgments for throughput vs. durability.
✅ Correct Answer: a) Kafka, Flume, TCP
📝 Explanation:
Receivers pull/push from various sources for DStreams.
✅ Correct Answer: a) MEMORY_ONLY
📝 Explanation:
Persist levels control storage for reused RDDs.
✅ Correct Answer: a) Merging small batches
📝 Explanation:
Coalesce reduces partitions for efficiency.
✅ Correct Answer: a) Unified stream and batch
📝 Explanation:
Apex uses YARN for resilient, stateful processing.
✅ Correct Answer: a) Custom metrics
📝 Explanation:
User-defined counters monitor job progress.
✅ Correct Answer: a) Adding context via joins
📝 Explanation:
Enrich streams with external data for deeper insights.
✅ Correct Answer: a) Group membership changes
📝 Explanation:
Rebalance redistributes partitions among consumers.
✅ Correct Answer: a) Distributed learning on streams
📝 Explanation:
It trains models incrementally from continuous data.
✅ Correct Answer: b) Subset for approximation
📝 Explanation:
Sampling reduces compute for large datasets.
✅ Correct Answer: a) Round-robin
📝 Explanation:
Shuffle evenly distributes tuples for load balancing.
✅ Correct Answer: a) Replication and snapshots
📝 Explanation:
It recovers state from backups on failure.
✅ Correct Answer: a) RDDs for transformations
📝 Explanation:
Core Spark processes finite datasets with actions.
✅ Correct Answer: a) Changelog stream
📝 Explanation:
KTables model updatable tables from compacted topics.
✅ Correct Answer: a) Data distribution for parallelism
📝 Explanation:
It splits work across nodes for scalability.
✅ Correct Answer: a) Manual state backups
📝 Explanation:
Savepoints allow upgrades and migrations.
✅ Correct Answer: a) Read-only files across nodes
📝 Explanation:
It avoids shipping large jars or data per task.
✅ Correct Answer: a) Throughput, latency
📝 Explanation:
Metrics help tune and detect issues in pipelines.
✅ Correct Answer: a) Lightweight stream processing
📝 Explanation:
Gearpump uses actor model for low-latency streams.
✅ Correct Answer: a) Storage and I/O
📝 Explanation:
Formats like Snappy compress intermediate data.
✅ Correct Answer: a) For failed messages
📝 Explanation:
DLQ stores unprocessable records for later inspection.
✅ Correct Answer: a) All bolts
📝 Explanation:
Global broadcasts to every downstream instance.
✅ Correct Answer: a) Hash of key
📝 Explanation:
KeyBy groups for stateful keyed operations.
✅ Correct Answer: a) Disjoint RDDs
📝 Explanation:
Union creates a new RDD from multiple sources.
✅ Correct Answer: a) Efficient formats like Avro
📝 Explanation:
It minimizes network overhead for records.
✅ Correct Answer: a) Reactive extensions
📝 Explanation:
Quarkus integrates Kafka for non-blocking streams.
✅ Correct Answer: a) Resources and scheduling
📝 Explanation:
It coordinated jobs; replaced by YARN in Hadoop 2.
✅ Correct Answer: a) Event to output
📝 Explanation:
End-to-end latency measures processing delay.
✅ Correct Answer: a) Actor-based streaming
📝 Explanation:
Akka uses backpressure-aware flows in Scala/Java.
✅ Correct Answer: a) Joins and filters
📝 Explanation:
Indexes like bitmap speed up selections.
✅ Correct Answer: a) Cluster replication
📝 Explanation:
It copies topics across geo-distributed clusters.
✅ Correct Answer: a) Bounded streams
📝 Explanation:
Unified API processes finite data similarly to streams.
✅ Correct Answer: a) Aggregates by key with shuffle
📝 Explanation:
It combines values per key efficiently.
✅ Correct Answer: a) Pipeline status
📝 Explanation:
They alert on backlogs or failures.
✅ Correct Answer: a) In-memory stream processing
📝 Explanation:
Ignite accelerates streams with SQL and caching.
✅ Correct Answer: a) Memory overflows
📝 Explanation:
Spill maintains correctness during shuffles.
✅ Correct Answer: a) Manual partition assignment
📝 Explanation:
It overrides auto for custom balancing.
✅ Correct Answer: a) Broadcast
📝 Explanation:
All sends to every bolt instance.
✅ Correct Answer: a) Batch processing
📝 Explanation:
DataSet handles bounded datasets with transformations.
✅ Correct Answer: a) Reservoir, stratified
📝 Explanation:
They select representative subsets.
✅ Correct Answer: a) Input rate
📝 Explanation:
Throttling caps sources for stability.
✅ Correct Answer: a) Backpressure protocol
📝 Explanation:
It standardizes async stream processing.
✅ Correct Answer: a) Logical data chunks
📝 Explanation:
Splits define mapper inputs, not always block-aligned.
✅ Correct Answer: a) Thresholds on metrics
📝 Explanation:
It notifies on anomalies like high latency.
✅ Correct Answer: a) Reactive toolkit
📝 Explanation:
Vert.x handles event-driven streams non-blockingly.
✅ Correct Answer: a) Group-wise join
📝 Explanation:
Cogroup iterates over grouped key-value pairs.
✅ Correct Answer: a) Old logs after time/size
📝 Explanation:
It bounds storage for topics.
✅ Correct Answer: a) Schedules callbacks
📝 Explanation:
Timers fire on event or processing time.
✅ Correct Answer: a) Removes duplicates
📝 Explanation:
Distinct shuffles to unique values.
✅ Correct Answer: a) Prometheus, Grafana
📝 Explanation:
They visualize stream metrics.
✅ Correct Answer: a) Reactive web/streams
📝 Explanation:
Ratpack uses Netty for async processing.
✅ Correct Answer: a) Globally by key
📝 Explanation:
It shuffles for total order.
✅ Correct Answer: a) Plugins for sources/sinks
📝 Explanation:
They standardize integrations.
✅ Correct Answer: a) User-defined routing
📝 Explanation:
It implements logic for tuple distribution.
✅ Correct Answer: a) Execution environment
📝 Explanation:
StreamExecutionEnvironment configures jobs.
✅ Correct Answer: a) Variable elements per input
📝 Explanation:
It explodes or flattens collections.
✅ Correct Answer: a) Horizontal scaling
📝 Explanation:
Add nodes for more throughput.
✅ Correct Answer: a) Reactive observables
📝 Explanation:
RxJava handles async sequences with backpressure.
Related Posts
New
New
New
130 Big Data Storage and Data Processing MCQs
130 multiple-choice questions designed to test and deepen understanding of Big Data storage mechanisms, including distributed file systems, NoSQL databases,…
November 1, 2025By MCQs Generator
70 Big Data in IoT, Healthcare Analytics, and Marketing - MCQs
70 multiple-choice questions delves into the transformative role of Big Data across IoT ecosystems, healthcare analytics for improved patient outcomes,…
November 1, 2025By MCQs Generator
60 Big Data Analytics MCQs Questions
1. What is the primary goal of real-time processing in Big Data? a) Historical analysis b) Immediate data ingestion and…
October 31, 2025By MCQs Generator