130 Big Data Storage and Data Processing MCQs

Category: 1000 Big Data Technologies MCQDate: Published: November 1, 2025Posted by: MCQs Generator

130 multiple-choice questions designed to test and deepen understanding of Big Data storage mechanisms, including distributed file systems, NoSQL databases, and data lakes, alongside data processing paradigms like batch processing, stream processing, and frameworks such as Hadoop MapReduce and Apache Spark

1. What is the primary storage system in the Hadoop ecosystem?

a) HBase

b) HDFS

c) Hive

d) Pig

✅ Correct Answer: b) HDFS

📝 Explanation:

Hadoop Distributed File System (HDFS) is designed for storing large datasets across distributed clusters with high fault tolerance via data replication.

2. In HDFS, what is the default replication factor for data blocks?

a) 1

b) 2

c) 3

d) 4

✅ Correct Answer: c) 3

📝 Explanation:

The default replication factor of 3 ensures data durability by maintaining three copies of each block across different nodes.

3. Which NoSQL database is column-oriented and part of the Hadoop ecosystem?

a) MongoDB

b) Cassandra

c) HBase

d) Redis

✅ Correct Answer: c) HBase

📝 Explanation:

HBase is a distributed, scalable, big data store modeled after Google's Bigtable, providing random access to HDFS data.

4. What is a Data Lake in Big Data storage?

a) A structured relational database

b) A centralized repository for raw data in native format

c) A real-time processing engine

d) A batch ETL tool

✅ Correct Answer: b) A centralized repository for raw data in native format

📝 Explanation:

Data Lakes store raw, unprocessed data from various sources, supporting schema-on-read for flexible analysis.

5. Which storage format is columnar and optimized for analytical queries in Big Data?

a) Avro

b) JSON

c) Parquet

d) CSV

✅ Correct Answer: c) Parquet

📝 Explanation:

Apache Parquet is a columnar storage format that supports efficient compression and encoding for complex data types.

6. In MapReduce, what is the role of the Mapper?

a) To aggregate data

b) To process input data into key-value pairs

c) To sort data

d) To store output

✅ Correct Answer: b) To process input data into key-value pairs

📝 Explanation:

The Mapper phase filters and transforms input data into intermediate key-value pairs for parallel processing.

7. What is the default block size in HDFS?

a) 64 MB

b) 128 MB

c) 256 MB

d) 512 MB

✅ Correct Answer: b) 128 MB

📝 Explanation:

The default block size of 128 MB optimizes for large file storage and sequential access in distributed environments.

8. Which processing framework uses Directed Acyclic Graphs (DAGs) for execution?

a) MapReduce

b) Spark

c) Tez

d) Both b and c

✅ Correct Answer: d) Both b and c

📝 Explanation:

Both Apache Spark and Tez use DAGs to optimize execution plans beyond the rigid MapReduce model.

9. What is Apache Cassandra known for in Big Data storage?

a) Document storage

b) Wide-column store with high availability

c) Key-value caching

d) Graph relationships

✅ Correct Answer: b) Wide-column store with high availability

📝 Explanation:

Cassandra is a distributed NoSQL database designed for handling large amounts of data across commodity servers with no single point of failure.

10. In Spark, what is an RDD?

a) A relational database

b) Resilient Distributed Dataset

c) Remote Data Driver

d) Random Data Distributor

✅ Correct Answer: b) Resilient Distributed Dataset

📝 Explanation:

RDDs are immutable, partitioned collections of records that provide fault tolerance through lineage.

11. Which storage solution is best for hierarchical data in Big Data?

a) Key-value stores like Redis

b) Document stores like MongoDB

c) Column-family stores like Cassandra

d) Graph databases like Neo4j

✅ Correct Answer: b) Document stores like MongoDB

📝 Explanation:

Document databases store data in flexible, JSON-like documents, ideal for semi-structured hierarchical data.

12. What is the purpose of the Shuffle phase in MapReduce?

a) To map keys

b) To group and sort data for reducers

c) To reduce output

d) To replicate data

✅ Correct Answer: b) To group and sort data for reducers

📝 Explanation:

The Shuffle phase transfers mapped data from mappers to reducers, grouping by key for aggregation.

13. Which file format supports schema evolution in Big Data storage?

a) Parquet

b) ORC

c) Avro

d) SequenceFile

✅ Correct Answer: c) Avro

📝 Explanation:

Avro's compact binary format includes embedded schema information, allowing evolution without data rewriting.

14. What is Apache Hive used for in data processing?

a) Real-time querying

b) SQL-like querying on HDFS data

c) Graph processing

d) Machine learning

✅ Correct Answer: b) SQL-like querying on HDFS data

📝 Explanation:

Hive provides HiveQL for declarative querying and analysis of large datasets in HDFS.

15. In HDFS, the NameNode is responsible for:

a) Storing data blocks

b) Managing metadata and namespace

c) Processing MapReduce jobs

d) Replicating data

✅ Correct Answer: b) Managing metadata and namespace

📝 Explanation:

The NameNode maintains the file system namespace and metadata in memory for fast access.

16. Which processing model handles both batch and stream data in a unified way?

a) Lambda Architecture

b) Kappa Architecture

c) MapReduce

d) Tez

✅ Correct Answer: b) Kappa Architecture

📝 Explanation:

Kappa uses stream processing for everything, replaying historical data from logs for batch-like computations.

17. What is the role of DataNodes in HDFS?

a) Metadata management

b) Storing and retrieving data blocks

c) Job scheduling

d) User authentication

✅ Correct Answer: b) Storing and retrieving data blocks

📝 Explanation:

DataNodes manage storage on individual machines, handling read/write requests for blocks.

18. Which NoSQL type is best for caching and session storage?

a) Document

b) Column-family

c) Key-value

d) Graph

✅ Correct Answer: c) Key-value

📝 Explanation:

Key-value stores like Redis provide fast lookups for simple data types, ideal for caching.

19. In Spark, what enables in-memory processing?

a) DAG execution

b) RDD caching

c) MapReduce integration

d) HDFS blocks

✅ Correct Answer: b) RDD caching

📝 Explanation:

Spark's RDDs can be persisted in memory, reducing disk I/O for iterative algorithms.

20. What is ORC file format optimized for?

a) Row storage

b) Hive queries with compression and indexing

c) Streaming ingestion

d) Transaction processing

✅ Correct Answer: b) Hive queries with compression and indexing

📝 Explanation:

Optimized Row Columnar (ORC) format supports predicate pushdown and advanced compression for analytics.

21. What is the Reduce phase in MapReduce responsible for?

a) Input splitting

b) Aggregating shuffled data

c) Data replication

d) Block storage

✅ Correct Answer: b) Aggregating shuffled data

📝 Explanation:

Reducers perform final aggregation on grouped key-value pairs to produce output.

22. Which tool is used for data ingestion into HDFS?

a) Sqoop

b) Flume

c) Both a and b

d) Hive

✅ Correct Answer: c) Both a and b

📝 Explanation:

Sqoop imports from relational DBs, while Flume handles streaming data like logs.

23. What is schema-on-read in Big Data storage?

a) Enforcing schema on write

b) Applying schema during query

c) No schema needed

d) Schema on ingestion

✅ Correct Answer: b) Applying schema during query

📝 Explanation:

Schema-on-read allows flexible ingestion of raw data, interpreting structure at analysis time.

24. Apache Flink is primarily used for:

a) Batch processing only

b) Unified batch and stream processing

c) Storage management

d) Graph databases

✅ Correct Answer: b) Unified batch and stream processing

📝 Explanation:

Flink treats batch as bounded streams, offering low-latency and stateful computations.

25. What is rack awareness in HDFS?

a) Placing replicas across racks for fault tolerance

b) Single rack storage

c) Data compression

d) Block splitting

✅ Correct Answer: a) Placing replicas across racks for fault tolerance

📝 Explanation:

Rack awareness optimizes data locality and ensures replicas are not lost in rack failures.

26. Which database is graph-oriented for Big Data?

a) MongoDB

b) Neo4j

c) HBase

d) Redis

✅ Correct Answer: b) Neo4j

📝 Explanation:

Neo4j is a graph database that stores data as nodes and relationships for complex queries.

27. In Spark Streaming, data is processed using:

a) Micro-batches

b) Continuous streams

c) Batch jobs only

d) MapReduce

✅ Correct Answer: a) Micro-batches

📝 Explanation:

Spark Streaming discretizes streams into small batches for near-real-time processing.

28. What is the purpose of Combiners in MapReduce?

a) To reduce network traffic by local aggregation

b) To split input

c) To store data

d) To schedule jobs

✅ Correct Answer: a) To reduce network traffic by local aggregation

📝 Explanation:

Combiners act as mini-reducers on mapper output to minimize data shuffled to reducers.

29. Which storage is used for time-series data in Big Data?

a) Relational DB

b) InfluxDB or OpenTSDB

c) Document store

d) Key-value cache

✅ Correct Answer: b) InfluxDB or OpenTSDB

📝 Explanation:

Time-series databases like InfluxDB optimize for high-ingestion rates and timestamped data.

30. Apache Tez improves upon MapReduce by:

a) Using DAGs for efficient execution

b) In-memory processing

c) Stream handling

d) NoSQL integration

✅ Correct Answer: a) Using DAGs for efficient execution

📝 Explanation:

Tez allows complex workflows as DAGs, reducing job stages and latency.

31. What is a Partition in Spark?

a) A full dataset

b) A unit of parallelism for RDDs

c) A storage block

d) A query result

✅ Correct Answer: b) A unit of parallelism for RDDs

📝 Explanation:

Partitions divide data across the cluster, enabling parallel computation.

32. Which format is sequence-based in Hadoop?

a) Parquet

b) Avro

c) SequenceFile

d) ORC

✅ Correct Answer: c) SequenceFile

📝 Explanation:

SequenceFile is a flat file format for key-value pairs, used for intermediate MapReduce data.

33. What is the serving layer in Lambda Architecture?

a) Raw storage

b) Merging batch and real-time views

c) Stream ingestion

d) Batch computation

✅ Correct Answer: b) Merging batch and real-time views

📝 Explanation:

The serving layer provides low-latency access by combining precomputed batch views with recent updates.

34. Apache Pig is used for:

a) SQL querying

b) Data transformation scripting

c) Real-time processing

d) Graph analytics

✅ Correct Answer: b) Data transformation scripting

📝 Explanation:

Pig Latin scripts simplify complex data flows on Hadoop for ETL processes.

35. In HDFS Federation, multiple NameNodes manage:

a) Single namespace

b) Separate namespaces

c) Data blocks only

d) Replication

✅ Correct Answer: b) Separate namespaces

📝 Explanation:

Federation scales HDFS by allowing multiple NameNodes for different namespaces on shared DataNodes.

36. What is Storm used for in data processing?

a) Batch analytics

b) Real-time stream processing

c) Storage

d) Machine learning

✅ Correct Answer: b) Real-time stream processing

📝 Explanation:

Apache Storm processes unbounded data streams with low latency for applications like fraud detection.

37. Which database supports ACID transactions in Big Data?

a) Cassandra

b) MongoDB

c) NewSQL like CockroachDB

d) Redis

✅ Correct Answer: c) NewSQL like CockroachDB

📝 Explanation:

NewSQL databases provide distributed scalability with full SQL and ACID guarantees.

38. In Spark SQL, data is processed using:

a) DataFrames

b) RDDs only

c) Maps

d) Lists

✅ Correct Answer: a) DataFrames

📝 Explanation:

DataFrames offer structured APIs with optimizations like Catalyst for SQL queries.

39. What is sharding in Big Data storage?

a) Data replication

b) Horizontal partitioning across nodes

c) Vertical scaling

d) Caching

✅ Correct Answer: b) Horizontal partitioning across nodes

📝 Explanation:

Sharding distributes data subsets to balance load and improve scalability in NoSQL systems.

40. Apache Kafka is primarily a:

a) Storage database

b) Distributed streaming platform

c) Batch processor

d) Query engine

✅ Correct Answer: b) Distributed streaming platform

📝 Explanation:

Kafka handles high-throughput event streaming, serving as a message broker for pipelines.

41. What is lazy evaluation in Spark?

a) Immediate computation

b) Recording transformations until an action

c) Eager caching

d) Synchronous execution

✅ Correct Answer: b) Recording transformations until an action

📝 Explanation:

Lazy evaluation builds an optimized DAG before computing results on trigger.

42. Which tool orchestrates workflows in Hadoop?

a) Oozie

b) Sqoop

c) Flume

d) Hive

✅ Correct Answer: a) Oozie

📝 Explanation:

Apache Oozie schedules and manages Hadoop job workflows, including dependencies.

43. What is a hot spot in data partitioning?

a) Evenly distributed data

b) Overloaded partition due to skew

c) Cold storage area

d) Archived data

✅ Correct Answer: b) Overloaded partition due to skew

📝 Explanation:

Data skew causes hot spots, leading to uneven load and performance bottlenecks.

44. Apache Beam provides:

a) Unified model for batch and stream

b) Storage only

c) Graph processing

d) NoSQL queries

✅ Correct Answer: a) Unified model for batch and stream

📝 Explanation:

Beam is a portable API for defining pipelines executable on multiple runners like Flink or Spark.

45. In HBase, data is stored in:

a) Rows only

b) Column families

c) Documents

d) Graphs

✅ Correct Answer: b) Column families

📝 Explanation:

HBase organizes data into column families for sparse, wide tables with dynamic columns.

46. What is the Combiner in MapReduce similar to?

a) Reducer

b) Mapper

c) Partitioner

d) Shuffler

✅ Correct Answer: a) Reducer

📝 Explanation:

Combiners run locally after mapping to pre-aggregate data, like a reducer.

47. Which storage supports vector databases for Big Data?

a) Pinecone

b) Cassandra

c) HDFS

d) Parquet

✅ Correct Answer: a) Pinecone

📝 Explanation:

Vector databases like Pinecone store embeddings for similarity searches in ML applications.

48. Spark's MLlib is for:

a) Machine learning algorithms

b) SQL processing

c) Streaming

d) Graph processing

✅ Correct Answer: a) Machine learning algorithms

📝 Explanation:

MLlib provides scalable ML tools like classification, regression, and clustering.

49. What is eventual consistency in distributed storage?

a) Immediate consistency

b) Consistency after updates propagate

c) No consistency

d) Strong consistency

✅ Correct Answer: b) Consistency after updates propagate

📝 Explanation:

Eventual consistency prioritizes availability, ensuring reads eventually reflect writes.

50. Apache Samza processes data using:

a) Kafka streams

b) HDFS batches

c) MapReduce

d) SQL

✅ Correct Answer: a) Kafka streams

📝 Explanation:

Samza is a stream processing framework integrated with Kafka for fault-tolerant processing.

51. What is bloom filter used for in storage?

a) Data compression

b) Probabilistic membership testing

c) Encryption

d) Indexing

✅ Correct Answer: b) Probabilistic membership testing

📝 Explanation:

Bloom filters quickly check if an element exists in a set with minimal false positives.

52. In YARN, containers are used for:

a) Data storage

b) Resource allocation for tasks

c) Metadata

d) Replication

✅ Correct Answer: b) Resource allocation for tasks

📝 Explanation:

YARN allocates CPU/memory via containers to applications like MapReduce or Spark.

53. Which format is optimized for Delta Lake?

a) Parquet with ACID support

b) Avro

c) JSON

d) CSV

✅ Correct Answer: a) Parquet with ACID support

📝 Explanation:

Delta Lake uses Parquet files with transaction logs for reliable data lakes.

54. What is the Partitioner in MapReduce?

a) Groups data by key

b) Decides reducer assignment

c) Compresses data

d) Splits input

✅ Correct Answer: b) Decides reducer assignment

📝 Explanation:

Partitioner hashes keys to distribute data evenly across reducers.

55. Apache Accumulo is a:

a) Key-value store with cell-level security

b) Document database

c) Graph store

d) Column store without security

✅ Correct Answer: a) Key-value store with cell-level security

📝 Explanation:

Accumulo provides fine-grained access control at the cell level for sensitive data.

56. Spark's GraphX is for:

a) Graph processing

b) SQL

c) Streaming

d) ML

✅ Correct Answer: a) Graph processing

📝 Explanation:

GraphX extends RDDs for graph analytics like PageRank and connected components.

57. What is compaction in NoSQL storage?

a) Data deletion

b) Merging small files for efficiency

c) Replication

d) Sharding

✅ Correct Answer: b) Merging small files for efficiency

📝 Explanation:

Compaction reduces storage overhead and improves read performance by rewriting data.

58. Apache NiFi is for:

a) Data flow automation

b) Batch processing

c) Storage

d) Querying

✅ Correct Answer: a) Data flow automation

📝 Explanation:

NiFi automates data routing and transformation between systems with visual design.

59. What is a Tombstone in Cassandra?

a) Deleted record marker

b) Active record

c) Index entry

d) Replication log

✅ Correct Answer: a) Deleted record marker

📝 Explanation:

Tombstones mark deletions for eventual consistency, preventing resurrection of old data.

60. In Spark, transformations are:

a) Lazy

b) Eager

c) Blocking

d) Synchronous

✅ Correct Answer: a) Lazy

📝 Explanation:

Transformations define the DAG but don't compute until an action is called.

61. What is Iceberg in Big Data storage?

a) Table format for data lakes

b) Compression tool

c) Stream processor

d) Query engine

✅ Correct Answer: a) Table format for data lakes

📝 Explanation:

Apache Iceberg provides schema evolution and time travel for open table formats.

62. The InputFormat in MapReduce handles:

a) Output writing

b) Input splitting and record reading

c) Reduction

d) Shuffling

✅ Correct Answer: b) Input splitting and record reading

📝 Explanation:

InputFormat divides input into splits and provides RecordReaders for key-value pairs.

63. Which is a wide-column store?

a) MongoDB

b) Bigtable

c) Redis

d) Neo4j

✅ Correct Answer: b) Bigtable

📝 Explanation:

Bigtable-inspired stores like HBase support sparse data with many columns per row.

64. Apache Apex is for:

a) Stream and batch processing

b) Storage

c) Graph

d) ML

✅ Correct Answer: a) Stream and batch processing

📝 Explanation:

Apex provides a unified engine for real-time and batch data processing.

65. What is denormalization in NoSQL?

a) Normalizing data

b) Duplicating data for read performance

c) Encrypting data

d) Compressing data

✅ Correct Answer: b) Duplicating data for read performance

📝 Explanation:

Denormalization trades storage for faster queries in distributed systems.

66. In Spark, actions trigger:

a) Transformations

b) Computation

c) Caching

d) Partitioning

✅ Correct Answer: b) Computation

📝 Explanation:

Actions like collect() or count() execute the lazy DAG and return results.

67. What is Hudi for data lakes?

a) Upserts and incremental processing

b) Batch only

c) Storage only

d) Querying

✅ Correct Answer: a) Upserts and incremental processing

📝 Explanation:

Apache Hudi enables update/delete operations and time-travel in data lakes.

68. The OutputFormat in MapReduce writes:

a) Input data

b) Reducer output to storage

c) Intermediate data

d) Metadata

✅ Correct Answer: b) Reducer output to storage

📝 Explanation:

OutputFormat and RecordWriter handle final data commit to HDFS or other sinks.

69. Which is a key-value store?

a) DynamoDB

b) MongoDB

c) Cassandra

d) Neo4j

✅ Correct Answer: a) DynamoDB

📝 Explanation:

DynamoDB is Amazon's managed NoSQL key-value and document store.

70. Apache Giraph is for:

a) Graph processing on Hadoop

b) Streaming

c) Storage

d) SQL

✅ Correct Answer: a) Graph processing on Hadoop

📝 Explanation:

Giraph implements Pregel for large-scale graph computations like social networks.

71. What is gossip protocol in storage systems?

a) Failure detection and data dissemination

b) Encryption

c) Compression

d) Indexing

✅ Correct Answer: a) Failure detection and data dissemination

📝 Explanation:

Gossip protocols enable decentralized communication in systems like Cassandra.

72. In Spark, broadcast variables are for:

a) Sharing read-only data efficiently

b) Writing data

c) Partitioning

d) Shuffling

✅ Correct Answer: a) Sharing read-only data efficiently

📝 Explanation:

Broadcast variables cache a read-only value across nodes, avoiding repeated shipping.

73. Apache Kudu is designed for:

a) Fast analytics on changing data

b) Static storage

c) Graph data

d) Documents

✅ Correct Answer: a) Fast analytics on changing data

📝 Explanation:

Kudu supports low-latency random access and updates alongside analytics workloads.

74. What is anti-entropy in distributed storage?

a) Repairing replica inconsistencies

b) Data deletion

c) Compression

d) Sharding

✅ Correct Answer: a) Repairing replica inconsistencies

📝 Explanation:

Anti-entropy mechanisms like Merkle trees detect and fix divergent replicas.

75. Apache Crunch is a:

a) Processing pipeline library

b) Storage format

c) Query language

d) Scheduler

✅ Correct Answer: a) Processing pipeline library

📝 Explanation:

Crunch simplifies MapReduce pipelines with high-level abstractions.

76. What is LSM-tree in storage?

a) Log-structured merge-tree for writes

b) B-tree alternative

c) Hash index

d) Graph structure

✅ Correct Answer: a) Log-structured merge-tree for writes

📝 Explanation:

LSM-trees optimize for high write throughput by batching to disk and merging later.

77. In Spark, accumulators are for:

a) Aggregating values across tasks

b) Broadcasting

c) Partitioning

d) Caching

✅ Correct Answer: a) Aggregating values across tasks

📝 Explanation:

Accumulators provide a way to update a variable in parallel, useful for counters.

78. Apache Drill supports:

a) Schema-free SQL on diverse sources

b) Batch only

c) Storage

d) Graph

✅ Correct Answer: a) Schema-free SQL on diverse sources

📝 Explanation:

Drill queries NoSQL, files, and cloud storage without predefined schemas.

79. What is hinted handoff in Cassandra?

a) Temporary storage for failed writes

b) Permanent storage

c) Read repair

d) Compaction

✅ Correct Answer: a) Temporary storage for failed writes

📝 Explanation:

Hinted handoff queues writes for unavailable nodes, delivering when they recover.

80. Apache Mahout is for:

a) Scalable machine learning

b) Streaming

c) Storage

d) Workflows

✅ Correct Answer: a) Scalable machine learning

📝 Explanation:

Mahout provides algorithms for recommendation, clustering on large datasets.

81. What is read repair in distributed storage?

a) Fixing inconsistencies during reads

b) Write optimization

c) Replication

d) Sharding

✅ Correct Answer: a) Fixing inconsistencies during reads

📝 Explanation:

Read repair synchronizes replicas when a read involves multiple inconsistent copies.

82. In Spark, Catalyst optimizer does what?

a) Query optimization

b) Data partitioning

c) Caching

d) Shuffling

✅ Correct Answer: a) Query optimization

📝 Explanation:

Catalyst uses rule-based and cost-based optimizations for Spark SQL plans.

83. Apache Phoenix provides:

a) SQL interface over HBase

b) Stream processing

c) Graph queries

d) Batch ETL

✅ Correct Answer: a) SQL interface over HBase

📝 Explanation:

Phoenix enables ANSI SQL on HBase with low-latency access via JDBC.

84. What is consistent hashing in storage?

a) Even data distribution for scaling

b) Random hashing

c) No hashing

d) Static partitioning

✅ Correct Answer: a) Even data distribution for scaling

📝 Explanation:

Consistent hashing minimizes data movement when nodes are added/removed.

85. Apache Parquet supports:

a) Columnar storage with nesting

b) Row storage

c) Key-value

d) Graph

✅ Correct Answer: a) Columnar storage with nesting

📝 Explanation:

Parquet efficiently stores nested data structures for analytics.

86. What is vectorization in processing?

a) Processing multiple records at once

b) Scalar only

c) Batch splitting

d) Single record

✅ Correct Answer: a) Processing multiple records at once

📝 Explanation:

Vectorization uses SIMD instructions for faster columnar processing.

87. Apache Zeppelin's primary use?

a) Interactive notebooks for data

b) Storage

c) Streaming

d) Graph

✅ Correct Answer: a) Interactive notebooks for data

📝 Explanation:

Zeppelin supports visualization and execution for Spark, Hive, etc.

88. What is quorum in Cassandra?

a) Majority replicas for consistency

b) All replicas

c) One replica

d) No replicas

✅ Correct Answer: a) Majority replicas for consistency

📝 Explanation:

Quorum writes/reads ensure tunable consistency by contacting majority nodes.

89. In Spark, Tungsten optimizes:

a) Memory and CPU usage

b) Disk I/O

c) Network

d) All

✅ Correct Answer: d) All

📝 Explanation:

Tungsten provides whole-stage codegen and efficient serialization.

90. Apache Solr is for:

a) Search and indexing

b) Storage

c) Processing

d) ML

✅ Correct Answer: a) Search and indexing

📝 Explanation:

Solr is a search platform built on Lucene for full-text search.

91. What is leveling in LSM-trees?

a) Sorted merging levels

b) Random levels

c) No levels

d) Single level

✅ Correct Answer: a) Sorted merging levels

📝 Explanation:

Leveling compacts by merging into sorted runs at each level.

92. Apache Calcite is a:

a) Query optimizer framework

b) Storage engine

c) Stream processor

d) Graph library

✅ Correct Answer: a) Query optimizer framework

📝 Explanation:

Calcite provides SQL parsing and optimization for various backends.

93. What is tiering in storage?

a) Moving data between storage tiers

b) Single tier

c) No movement

d) Random tier

✅ Correct Answer: a) Moving data between storage tiers

📝 Explanation:

Tiering places hot data on fast storage, cold on cheaper.

94. In Spark, Delta Lake adds:

a) ACID transactions to data lakes

b) No transactions

c) Batch only

d) Stream only

✅ Correct Answer: a) ACID transactions to data lakes

📝 Explanation:

Delta Lake brings reliability to open formats like Parquet.

95. Apache Lucene is the core of:

a) Full-text search

b) Storage

c) Processing

d) ML

✅ Correct Answer: a) Full-text search

📝 Explanation:

Lucene provides inverted indexing for fast text retrieval.

96. What is snapshot isolation in storage?

a) Consistent view at a point in time

b) Real-time only

c) No isolation

d) Dirty reads

✅ Correct Answer: a) Consistent view at a point in time

📝 Explanation:

Snapshots allow reads without locking, seeing committed data.

97. Apache Arrow is for:

a) In-memory columnar format

b) Row format

c) Key-value

d) Graph

✅ Correct Answer: a) In-memory columnar format

📝 Explanation:

Arrow enables zero-copy data sharing between systems.

98. What is write-ahead logging (WAL)?

a) Logging changes before commit

b) Post-commit log

c) No logging

d) Read log

✅ Correct Answer: a) Logging changes before commit

📝 Explanation:

WAL ensures durability by persisting changes durably before acknowledgment.

99. In Spark, Adaptive Query Execution (AQE) does:

a) Runtime plan optimization

b) Static only

c) No optimization

d) Batch only

✅ Correct Answer: a) Runtime plan optimization

📝 Explanation:

AQE adjusts plans based on runtime statistics for better performance.

100. Apache Geode is for:

a) In-memory data grid

b) Disk storage

c) Stream

d) Graph

✅ Correct Answer: a) In-memory data grid

📝 Explanation:

Geode provides distributed caching and processing for low-latency apps.

101. What is foreign key in NoSQL?

a) Not native, emulated via app logic

b) Enforced like SQL

c) No keys

d) Primary only

✅ Correct Answer: a) Not native, emulated via app logic

📝 Explanation:

NoSQL prioritizes denormalization over joins, handling references in code.

102. Apache Ignite is a:

a) Distributed database and cache

b) Single node

c) Stream only

d) Batch only

✅ Correct Answer: a) Distributed database and cache

📝 Explanation:

Ignite supports SQL, transactions, and in-memory computing across clusters.

103. What is predicate pushdown?

a) Filtering at storage level

b) Post-processing filter

c) No filtering

d) Join only

✅ Correct Answer: a) Filtering at storage level

📝 Explanation:

Pushdown reduces data transfer by applying filters early in the pipeline.

104. In Spark, Project Tungsten focuses on:

a) Performance via code generation

b) Storage

c) Networking

d) Security

✅ Correct Answer: a) Performance via code generation

📝 Explanation:

Tungsten generates JVM bytecode for efficient execution.

105. Apache Voldemort is:

a) Distributed key-value store

b) Document store

c) Column store

d) Graph store

✅ Correct Answer: a) Distributed key-value store

📝 Explanation:

Voldemort provides consistent hashing and partitioning for scalability.

106. What is log compaction in Kafka?

a) Retaining latest value per key

b) Full log retention

c) No compaction

d) Delete only

✅ Correct Answer: a) Retaining latest value per key

📝 Explanation:

Compaction keeps the most recent message for each key, enabling changelog use.

107. Apache Presto is a:

a) Distributed SQL query engine

b) Storage

c) Batch processor

d) ML library

✅ Correct Answer: a) Distributed SQL query engine

📝 Explanation:

Presto queries multiple data sources with low latency.

108. What is CAP theorem implication for storage?

a) Trade-offs in Consistency, Availability, Partition tolerance

b) All three always

c) No trade-offs

d) Availability only

✅ Correct Answer: a) Trade-offs in Consistency, Availability, Partition tolerance

📝 Explanation:

CAP states only two can be guaranteed in distributed systems during partitions.

109. In Spark, DataFrames are:

a) Typed collections of rows

b) Unstructured

c) Key-value only

d) Graphs

✅ Correct Answer: a) Typed collections of rows

📝 Explanation:

DataFrames provide schema-based access with optimizations over RDDs.

110. Apache Riak is:

a) Distributed NoSQL key-value store

b) Relational

c) Document

d) Graph

✅ Correct Answer: a) Distributed NoSQL key-value store

📝 Explanation:

Riak uses consistent hashing and vector clocks for conflict resolution.

111. What is exactly-once semantics in processing?

a) No duplicates or losses

b) At-least-once

c) At-most-once

d) No semantics

✅ Correct Answer: a) No duplicates or losses

📝 Explanation:

Exactly-once ensures each input produces one output despite failures.

112. Apache HAWQ is:

a) SQL on Hadoop

b) Stream

c) Storage

d) Graph

✅ Correct Answer: a) SQL on Hadoop

📝 Explanation:

HAWQ (Hadoop Advanced Workload) provides PostgreSQL-compatible queries on HDFS.

113. What is data durability in storage?

a) Persistence despite failures

b) Speed

c) Variety

d) Volume

✅ Correct Answer: a) Persistence despite failures

📝 Explanation:

Durability ensures committed data survives crashes via replication or WAL.

114. In Spark, Dataset API unifies:

a) Structured and semi-structured data

b) Unstructured only

c) Storage

d) Networking

✅ Correct Answer: a) Structured and semi-structured data

📝 Explanation:

Datasets provide type-safe access like DataFrames but with stronger typing.

115. Apache Aerospike is:

a) Flash-optimized NoSQL database

b) Relational

c) Document

d) Graph

✅ Correct Answer: a) Flash-optimized NoSQL database

📝 Explanation:

Aerospike combines key-value with in-memory speed using hybrid memory.

116. What is idempotency in processing?

a) Repeatable without side effects

b) Non-repeatable

c) Error-prone

d) Single run

✅ Correct Answer: a) Repeatable without side effects

📝 Explanation:

Idempotent operations allow safe retries in distributed systems.

117. Apache Hive supports which execution engines?

a) MapReduce, Tez, Spark

b) MapReduce only

c) Tez only

d) None

✅ Correct Answer: a) MapReduce, Tez, Spark

📝 Explanation:

Hive can use multiple backends for query execution.

118. What is hybrid storage?

a) Mix of SSD and HDD

b) Single type

c) Cloud only

d) No storage

✅ Correct Answer: a) Mix of SSD and HDD

📝 Explanation:

Hybrid uses fast SSD for hot data, cost-effective HDD for cold.

119. In Spark, barrier execution mode is for:

a) Synchronous stages in streaming

b) Async only

c) Batch

d) No sync

✅ Correct Answer: a) Synchronous stages in streaming

📝 Explanation:

Barriers ensure all tasks in a stage complete before proceeding.

120. Apache Tarantool is:

a) In-memory database with Lua

b) Disk-based

c) Stream

d) Graph

✅ Correct Answer: a) In-memory database with Lua

📝 Explanation:

Tarantool combines database and messaging with stored procedures.

121. What is data locality in processing?

a) Processing data where stored

b) Remote processing

c) No locality

d) Cloud only

✅ Correct Answer: a) Processing data where stored

📝 Explanation:

Locality minimizes network transfer by moving computation to data.

122. Apache Kylin is for:

a) OLAP on Hadoop

b) OLTP

c) Stream

d) Graph

✅ Correct Answer: a) OLAP on Hadoop

📝 Explanation:

Kylin precomputes cube for fast multidimensional analysis.

123. What is multi-version concurrency control (MVCC)?

a) Snapshot reads without locks

b) Locking all

c) No concurrency

d) Single version

✅ Correct Answer: a) Snapshot reads without locks

📝 Explanation:

MVCC allows concurrent transactions with consistent views via versions.

124. In Spark, Structured Streaming uses:

a) DataFrame API for streams

b) RDD only

c) Map only

d) No API

✅ Correct Answer: a) DataFrame API for streams

📝 Explanation:

It models streams as infinite tables for declarative processing.

125. Apache ScyllaDB is compatible with:

a) Cassandra API

b) MongoDB

c) SQL

d) GraphQL

✅ Correct Answer: a) Cassandra API

📝 Explanation:

Scylla is a high-performance rewrite of Cassandra for better throughput.

126. What is backpressure in stream processing?

a) Slowing producers on overload

b) Speeding up

c) No control

d) Buffering all

✅ Correct Answer: a) Slowing producers on overload

📝 Explanation:

Backpressure prevents system collapse by throttling input rates.

127. Apache Doris is:

a) MPP OLAP database

b) OLTP

c) Key-value

d) Document

✅ Correct Answer: a) MPP OLAP database

📝 Explanation:

Doris supports high-concurrency queries with real-time updates.

128. What is join strategy in processing?

a) How tables combine data

b) No join

c) Storage only

d) Filter only

✅ Correct Answer: a) How tables combine data

📝 Explanation:

Strategies like broadcast or shuffle hash optimize large joins.

129. In Spark, whole-stage codegen:

a) Compiles stages to bytecode

b) Interprets

c) No code

d) Partial only

✅ Correct Answer: a) Compiles stages to bytecode

📝 Explanation:

Codegen reduces virtual function calls for faster execution.

130. Apache ClickHouse is optimized for:

a) Real-time analytics

b) Transactions

c) Graphs

d) Caches

✅ Correct Answer: a) Real-time analytics

📝 Explanation:

ClickHouse uses columnar storage for sub-second queries on billions of rows.

131. What is checkpointing in processing?

a) State snapshots for recovery

b) No recovery

c) Logging only

d) Buffering

✅ Correct Answer: a) State snapshots for recovery

📝 Explanation:

Checkpoints enable fault tolerance by restoring from saved state.

132. Apache Pinot is for:

a) Real-time analytics on event data

b) Batch

c) Storage only

d) ML

✅ Correct Answer: a) Real-time analytics on event data

📝 Explanation:

Pinot ingests streams and serves low-latency queries for user-facing apps.

133. What is columnar projection?

a) Selecting only needed columns

b) All columns

c) Row projection

d) No projection

✅ Correct Answer: a) Selecting only needed columns

📝 Explanation:

Projection reduces I/O by reading only required columns in columnar stores.

134. In Spark, dynamic partition pruning:

a) Skips irrelevant partitions at runtime

b) Static only

c) No pruning

d) Full scan

✅ Correct Answer: a) Skips irrelevant partitions at runtime

📝 Explanation:

Pruning uses join stats to avoid scanning unnecessary data.

135. Apache Druid is for:

a) Timeseries analytics

b) Transactions

c) Graphs

d) Documents

✅ Correct Answer: a) Timeseries analytics

📝 Explanation:

Druid ingests streams and supports fast aggregations on time-based data.

136. What is data skipping in storage?

a) Skipping irrelevant blocks via metadata

b) Full scan

c) No skip

d) Random access

✅ Correct Answer: a) Skipping irrelevant blocks via metadata

📝 Explanation:

Skipping uses min/max stats to bypass blocks not matching queries.

137. Apache Kyuubi is:

a) Multi-tenant SQL gateway

b) Storage

c) Stream

d) Graph

✅ Correct Answer: a) Multi-tenant SQL gateway

📝 Explanation:

Kyuubi provides secure, scalable SQL on Spark for big data.

138. What is zone mapping in storage?

a) Metadata for fast filtering

b) Full index

c) No metadata

d) Row map

✅ Correct Answer: a) Metadata for fast filtering

📝 Explanation:

Zone maps store value ranges per block for predicate skipping.

New

160 Important Hadoop MCQs

1. What is the primary storage system in the Hadoop ecosystem? a) HBase b) HDFS c) Hive d) Pig Show…

October 31, 2025

By MCQs Generator

New

70 Big Data in IoT, Healthcare Analytics, and Marketing - MCQs

70 multiple-choice questions delves into the transformative role of Big Data across IoT ecosystems, healthcare analytics for improved patient outcomes,…

November 1, 2025

By MCQs Generator

New

80 Big Data: MapReduce, HDFS, and YARN - MCQs

80 multiple-choice questions provides an in-depth exploration of core Big Data technologies in the Hadoop ecosystem. Covering MapReduce for parallel…

November 1, 2025

By MCQs Generator