130 multiple-choice questions designed to test and deepen understanding of Big Data storage mechanisms, including distributed file systems, NoSQL databases, and data lakes, alongside data processing paradigms like batch processing, stream processing, and frameworks such as Hadoop MapReduce and Apache Spark
1. What is the primary storage system in the Hadoop ecosystem?
✅ Correct Answer: b) HDFS
📝 Explanation:
Hadoop Distributed File System (HDFS) is designed for storing large datasets across distributed clusters with high fault tolerance via data replication.
2. In HDFS, what is the default replication factor for data blocks?
✅ Correct Answer: c) 3
📝 Explanation:
The default replication factor of 3 ensures data durability by maintaining three copies of each block across different nodes.
3. Which NoSQL database is column-oriented and part of the Hadoop ecosystem?
✅ Correct Answer: c) HBase
📝 Explanation:
HBase is a distributed, scalable, big data store modeled after Google's Bigtable, providing random access to HDFS data.
4. What is a Data Lake in Big Data storage?
✅ Correct Answer: b) A centralized repository for raw data in native format
📝 Explanation:
Data Lakes store raw, unprocessed data from various sources, supporting schema-on-read for flexible analysis.
5. Which storage format is columnar and optimized for analytical queries in Big Data?
✅ Correct Answer: c) Parquet
📝 Explanation:
Apache Parquet is a columnar storage format that supports efficient compression and encoding for complex data types.
6. In MapReduce, what is the role of the Mapper?
✅ Correct Answer: b) To process input data into key-value pairs
📝 Explanation:
The Mapper phase filters and transforms input data into intermediate key-value pairs for parallel processing.
7. What is the default block size in HDFS?
✅ Correct Answer: b) 128 MB
📝 Explanation:
The default block size of 128 MB optimizes for large file storage and sequential access in distributed environments.
8. Which processing framework uses Directed Acyclic Graphs (DAGs) for execution?
✅ Correct Answer: d) Both b and c
📝 Explanation:
Both Apache Spark and Tez use DAGs to optimize execution plans beyond the rigid MapReduce model.
9. What is Apache Cassandra known for in Big Data storage?
✅ Correct Answer: b) Wide-column store with high availability
📝 Explanation:
Cassandra is a distributed NoSQL database designed for handling large amounts of data across commodity servers with no single point of failure.
10. In Spark, what is an RDD?
✅ Correct Answer: b) Resilient Distributed Dataset
📝 Explanation:
RDDs are immutable, partitioned collections of records that provide fault tolerance through lineage.
11. Which storage solution is best for hierarchical data in Big Data?
✅ Correct Answer: b) Document stores like MongoDB
📝 Explanation:
Document databases store data in flexible, JSON-like documents, ideal for semi-structured hierarchical data.
12. What is the purpose of the Shuffle phase in MapReduce?
✅ Correct Answer: b) To group and sort data for reducers
📝 Explanation:
The Shuffle phase transfers mapped data from mappers to reducers, grouping by key for aggregation.
13. Which file format supports schema evolution in Big Data storage?
✅ Correct Answer: c) Avro
📝 Explanation:
Avro's compact binary format includes embedded schema information, allowing evolution without data rewriting.
14. What is Apache Hive used for in data processing?
✅ Correct Answer: b) SQL-like querying on HDFS data
📝 Explanation:
Hive provides HiveQL for declarative querying and analysis of large datasets in HDFS.
15. In HDFS, the NameNode is responsible for:
✅ Correct Answer: b) Managing metadata and namespace
📝 Explanation:
The NameNode maintains the file system namespace and metadata in memory for fast access.
16. Which processing model handles both batch and stream data in a unified way?
✅ Correct Answer: b) Kappa Architecture
📝 Explanation:
Kappa uses stream processing for everything, replaying historical data from logs for batch-like computations.
17. What is the role of DataNodes in HDFS?
✅ Correct Answer: b) Storing and retrieving data blocks
📝 Explanation:
DataNodes manage storage on individual machines, handling read/write requests for blocks.
18. Which NoSQL type is best for caching and session storage?
✅ Correct Answer: c) Key-value
📝 Explanation:
Key-value stores like Redis provide fast lookups for simple data types, ideal for caching.
19. In Spark, what enables in-memory processing?
✅ Correct Answer: b) RDD caching
📝 Explanation:
Spark's RDDs can be persisted in memory, reducing disk I/O for iterative algorithms.
20. What is ORC file format optimized for?
✅ Correct Answer: b) Hive queries with compression and indexing
📝 Explanation:
Optimized Row Columnar (ORC) format supports predicate pushdown and advanced compression for analytics.
21. What is the Reduce phase in MapReduce responsible for?
✅ Correct Answer: b) Aggregating shuffled data
📝 Explanation:
Reducers perform final aggregation on grouped key-value pairs to produce output.
22. Which tool is used for data ingestion into HDFS?
✅ Correct Answer: c) Both a and b
📝 Explanation:
Sqoop imports from relational DBs, while Flume handles streaming data like logs.
23. What is schema-on-read in Big Data storage?
✅ Correct Answer: b) Applying schema during query
📝 Explanation:
Schema-on-read allows flexible ingestion of raw data, interpreting structure at analysis time.
24. Apache Flink is primarily used for:
✅ Correct Answer: b) Unified batch and stream processing
📝 Explanation:
Flink treats batch as bounded streams, offering low-latency and stateful computations.
25. What is rack awareness in HDFS?
✅ Correct Answer: a) Placing replicas across racks for fault tolerance
📝 Explanation:
Rack awareness optimizes data locality and ensures replicas are not lost in rack failures.
26. Which database is graph-oriented for Big Data?
✅ Correct Answer: b) Neo4j
📝 Explanation:
Neo4j is a graph database that stores data as nodes and relationships for complex queries.
27. In Spark Streaming, data is processed using:
✅ Correct Answer: a) Micro-batches
📝 Explanation:
Spark Streaming discretizes streams into small batches for near-real-time processing.
28. What is the purpose of Combiners in MapReduce?
✅ Correct Answer: a) To reduce network traffic by local aggregation
📝 Explanation:
Combiners act as mini-reducers on mapper output to minimize data shuffled to reducers.
29. Which storage is used for time-series data in Big Data?
✅ Correct Answer: b) InfluxDB or OpenTSDB
📝 Explanation:
Time-series databases like InfluxDB optimize for high-ingestion rates and timestamped data.
30. Apache Tez improves upon MapReduce by:
✅ Correct Answer: a) Using DAGs for efficient execution
📝 Explanation:
Tez allows complex workflows as DAGs, reducing job stages and latency.
31. What is a Partition in Spark?
✅ Correct Answer: b) A unit of parallelism for RDDs
📝 Explanation:
Partitions divide data across the cluster, enabling parallel computation.
32. Which format is sequence-based in Hadoop?
✅ Correct Answer: c) SequenceFile
📝 Explanation:
SequenceFile is a flat file format for key-value pairs, used for intermediate MapReduce data.
33. What is the serving layer in Lambda Architecture?
✅ Correct Answer: b) Merging batch and real-time views
📝 Explanation:
The serving layer provides low-latency access by combining precomputed batch views with recent updates.
34. Apache Pig is used for:
✅ Correct Answer: b) Data transformation scripting
📝 Explanation:
Pig Latin scripts simplify complex data flows on Hadoop for ETL processes.
35. In HDFS Federation, multiple NameNodes manage:
✅ Correct Answer: b) Separate namespaces
📝 Explanation:
Federation scales HDFS by allowing multiple NameNodes for different namespaces on shared DataNodes.
36. What is Storm used for in data processing?
✅ Correct Answer: b) Real-time stream processing
📝 Explanation:
Apache Storm processes unbounded data streams with low latency for applications like fraud detection.
37. Which database supports ACID transactions in Big Data?
✅ Correct Answer: c) NewSQL like CockroachDB
📝 Explanation:
NewSQL databases provide distributed scalability with full SQL and ACID guarantees.
38. In Spark SQL, data is processed using:
✅ Correct Answer: a) DataFrames
📝 Explanation:
DataFrames offer structured APIs with optimizations like Catalyst for SQL queries.
39. What is sharding in Big Data storage?
✅ Correct Answer: b) Horizontal partitioning across nodes
📝 Explanation:
Sharding distributes data subsets to balance load and improve scalability in NoSQL systems.
40. Apache Kafka is primarily a:
✅ Correct Answer: b) Distributed streaming platform
📝 Explanation:
Kafka handles high-throughput event streaming, serving as a message broker for pipelines.
41. What is lazy evaluation in Spark?
✅ Correct Answer: b) Recording transformations until an action
📝 Explanation:
Lazy evaluation builds an optimized DAG before computing results on trigger.
42. Which tool orchestrates workflows in Hadoop?
✅ Correct Answer: a) Oozie
📝 Explanation:
Apache Oozie schedules and manages Hadoop job workflows, including dependencies.
43. What is a hot spot in data partitioning?
✅ Correct Answer: b) Overloaded partition due to skew
📝 Explanation:
Data skew causes hot spots, leading to uneven load and performance bottlenecks.
44. Apache Beam provides:
✅ Correct Answer: a) Unified model for batch and stream
📝 Explanation:
Beam is a portable API for defining pipelines executable on multiple runners like Flink or Spark.
45. In HBase, data is stored in:
✅ Correct Answer: b) Column families
📝 Explanation:
HBase organizes data into column families for sparse, wide tables with dynamic columns.
46. What is the Combiner in MapReduce similar to?
✅ Correct Answer: a) Reducer
📝 Explanation:
Combiners run locally after mapping to pre-aggregate data, like a reducer.
47. Which storage supports vector databases for Big Data?
✅ Correct Answer: a) Pinecone
📝 Explanation:
Vector databases like Pinecone store embeddings for similarity searches in ML applications.
48. Spark's MLlib is for:
✅ Correct Answer: a) Machine learning algorithms
📝 Explanation:
MLlib provides scalable ML tools like classification, regression, and clustering.
49. What is eventual consistency in distributed storage?
✅ Correct Answer: b) Consistency after updates propagate
📝 Explanation:
Eventual consistency prioritizes availability, ensuring reads eventually reflect writes.
50. Apache Samza processes data using:
✅ Correct Answer: a) Kafka streams
📝 Explanation:
Samza is a stream processing framework integrated with Kafka for fault-tolerant processing.
51. What is bloom filter used for in storage?
✅ Correct Answer: b) Probabilistic membership testing
📝 Explanation:
Bloom filters quickly check if an element exists in a set with minimal false positives.
52. In YARN, containers are used for:
✅ Correct Answer: b) Resource allocation for tasks
📝 Explanation:
YARN allocates CPU/memory via containers to applications like MapReduce or Spark.
53. Which format is optimized for Delta Lake?
✅ Correct Answer: a) Parquet with ACID support
📝 Explanation:
Delta Lake uses Parquet files with transaction logs for reliable data lakes.
54. What is the Partitioner in MapReduce?
✅ Correct Answer: b) Decides reducer assignment
📝 Explanation:
Partitioner hashes keys to distribute data evenly across reducers.
55. Apache Accumulo is a:
✅ Correct Answer: a) Key-value store with cell-level security
📝 Explanation:
Accumulo provides fine-grained access control at the cell level for sensitive data.
56. Spark's GraphX is for:
✅ Correct Answer: a) Graph processing
📝 Explanation:
GraphX extends RDDs for graph analytics like PageRank and connected components.
57. What is compaction in NoSQL storage?
✅ Correct Answer: b) Merging small files for efficiency
📝 Explanation:
Compaction reduces storage overhead and improves read performance by rewriting data.
58. Apache NiFi is for:
✅ Correct Answer: a) Data flow automation
📝 Explanation:
NiFi automates data routing and transformation between systems with visual design.
59. What is a Tombstone in Cassandra?
✅ Correct Answer: a) Deleted record marker
📝 Explanation:
Tombstones mark deletions for eventual consistency, preventing resurrection of old data.
60. In Spark, transformations are:
✅ Correct Answer: a) Lazy
📝 Explanation:
Transformations define the DAG but don't compute until an action is called.
61. What is Iceberg in Big Data storage?
✅ Correct Answer: a) Table format for data lakes
📝 Explanation:
Apache Iceberg provides schema evolution and time travel for open table formats.
62. The InputFormat in MapReduce handles:
✅ Correct Answer: b) Input splitting and record reading
📝 Explanation:
InputFormat divides input into splits and provides RecordReaders for key-value pairs.
63. Which is a wide-column store?
✅ Correct Answer: b) Bigtable
📝 Explanation:
Bigtable-inspired stores like HBase support sparse data with many columns per row.
64. Apache Apex is for:
✅ Correct Answer: a) Stream and batch processing
📝 Explanation:
Apex provides a unified engine for real-time and batch data processing.
65. What is denormalization in NoSQL?
✅ Correct Answer: b) Duplicating data for read performance
📝 Explanation:
Denormalization trades storage for faster queries in distributed systems.
66. In Spark, actions trigger:
✅ Correct Answer: b) Computation
📝 Explanation:
Actions like collect() or count() execute the lazy DAG and return results.
67. What is Hudi for data lakes?
✅ Correct Answer: a) Upserts and incremental processing
📝 Explanation:
Apache Hudi enables update/delete operations and time-travel in data lakes.
68. The OutputFormat in MapReduce writes:
✅ Correct Answer: b) Reducer output to storage
📝 Explanation:
OutputFormat and RecordWriter handle final data commit to HDFS or other sinks.
69. Which is a key-value store?
✅ Correct Answer: a) DynamoDB
📝 Explanation:
DynamoDB is Amazon's managed NoSQL key-value and document store.
70. Apache Giraph is for:
✅ Correct Answer: a) Graph processing on Hadoop
📝 Explanation:
Giraph implements Pregel for large-scale graph computations like social networks.
71. What is gossip protocol in storage systems?
✅ Correct Answer: a) Failure detection and data dissemination
📝 Explanation:
Gossip protocols enable decentralized communication in systems like Cassandra.
72. In Spark, broadcast variables are for:
✅ Correct Answer: a) Sharing read-only data efficiently
📝 Explanation:
Broadcast variables cache a read-only value across nodes, avoiding repeated shipping.
73. Apache Kudu is designed for:
✅ Correct Answer: a) Fast analytics on changing data
📝 Explanation:
Kudu supports low-latency random access and updates alongside analytics workloads.
74. What is anti-entropy in distributed storage?
✅ Correct Answer: a) Repairing replica inconsistencies
📝 Explanation:
Anti-entropy mechanisms like Merkle trees detect and fix divergent replicas.
75. Apache Crunch is a:
✅ Correct Answer: a) Processing pipeline library
📝 Explanation:
Crunch simplifies MapReduce pipelines with high-level abstractions.
76. What is LSM-tree in storage?
✅ Correct Answer: a) Log-structured merge-tree for writes
📝 Explanation:
LSM-trees optimize for high write throughput by batching to disk and merging later.
77. In Spark, accumulators are for:
✅ Correct Answer: a) Aggregating values across tasks
📝 Explanation:
Accumulators provide a way to update a variable in parallel, useful for counters.
78. Apache Drill supports:
✅ Correct Answer: a) Schema-free SQL on diverse sources
📝 Explanation:
Drill queries NoSQL, files, and cloud storage without predefined schemas.
79. What is hinted handoff in Cassandra?
✅ Correct Answer: a) Temporary storage for failed writes
📝 Explanation:
Hinted handoff queues writes for unavailable nodes, delivering when they recover.
80. Apache Mahout is for:
✅ Correct Answer: a) Scalable machine learning
📝 Explanation:
Mahout provides algorithms for recommendation, clustering on large datasets.
81. What is read repair in distributed storage?
✅ Correct Answer: a) Fixing inconsistencies during reads
📝 Explanation:
Read repair synchronizes replicas when a read involves multiple inconsistent copies.
82. In Spark, Catalyst optimizer does what?
✅ Correct Answer: a) Query optimization
📝 Explanation:
Catalyst uses rule-based and cost-based optimizations for Spark SQL plans.
83. Apache Phoenix provides:
✅ Correct Answer: a) SQL interface over HBase
📝 Explanation:
Phoenix enables ANSI SQL on HBase with low-latency access via JDBC.
84. What is consistent hashing in storage?
✅ Correct Answer: a) Even data distribution for scaling
📝 Explanation:
Consistent hashing minimizes data movement when nodes are added/removed.
85. Apache Parquet supports:
✅ Correct Answer: a) Columnar storage with nesting
📝 Explanation:
Parquet efficiently stores nested data structures for analytics.
86. What is vectorization in processing?
✅ Correct Answer: a) Processing multiple records at once
📝 Explanation:
Vectorization uses SIMD instructions for faster columnar processing.
87. Apache Zeppelin's primary use?
✅ Correct Answer: a) Interactive notebooks for data
📝 Explanation:
Zeppelin supports visualization and execution for Spark, Hive, etc.
88. What is quorum in Cassandra?
✅ Correct Answer: a) Majority replicas for consistency
📝 Explanation:
Quorum writes/reads ensure tunable consistency by contacting majority nodes.
89. In Spark, Tungsten optimizes:
✅ Correct Answer: d) All
📝 Explanation:
Tungsten provides whole-stage codegen and efficient serialization.
90. Apache Solr is for:
✅ Correct Answer: a) Search and indexing
📝 Explanation:
Solr is a search platform built on Lucene for full-text search.
91. What is leveling in LSM-trees?
✅ Correct Answer: a) Sorted merging levels
📝 Explanation:
Leveling compacts by merging into sorted runs at each level.
92. Apache Calcite is a:
✅ Correct Answer: a) Query optimizer framework
📝 Explanation:
Calcite provides SQL parsing and optimization for various backends.
93. What is tiering in storage?
✅ Correct Answer: a) Moving data between storage tiers
📝 Explanation:
Tiering places hot data on fast storage, cold on cheaper.
94. In Spark, Delta Lake adds:
✅ Correct Answer: a) ACID transactions to data lakes
📝 Explanation:
Delta Lake brings reliability to open formats like Parquet.
95. Apache Lucene is the core of:
✅ Correct Answer: a) Full-text search
📝 Explanation:
Lucene provides inverted indexing for fast text retrieval.
96. What is snapshot isolation in storage?
✅ Correct Answer: a) Consistent view at a point in time
📝 Explanation:
Snapshots allow reads without locking, seeing committed data.
97. Apache Arrow is for:
✅ Correct Answer: a) In-memory columnar format
📝 Explanation:
Arrow enables zero-copy data sharing between systems.
98. What is write-ahead logging (WAL)?
✅ Correct Answer: a) Logging changes before commit
📝 Explanation:
WAL ensures durability by persisting changes durably before acknowledgment.
99. In Spark, Adaptive Query Execution (AQE) does:
✅ Correct Answer: a) Runtime plan optimization
📝 Explanation:
AQE adjusts plans based on runtime statistics for better performance.
100. Apache Geode is for:
✅ Correct Answer: a) In-memory data grid
📝 Explanation:
Geode provides distributed caching and processing for low-latency apps.
101. What is foreign key in NoSQL?
✅ Correct Answer: a) Not native, emulated via app logic
📝 Explanation:
NoSQL prioritizes denormalization over joins, handling references in code.
102. Apache Ignite is a:
✅ Correct Answer: a) Distributed database and cache
📝 Explanation:
Ignite supports SQL, transactions, and in-memory computing across clusters.
103. What is predicate pushdown?
✅ Correct Answer: a) Filtering at storage level
📝 Explanation:
Pushdown reduces data transfer by applying filters early in the pipeline.
104. In Spark, Project Tungsten focuses on:
✅ Correct Answer: a) Performance via code generation
📝 Explanation:
Tungsten generates JVM bytecode for efficient execution.
105. Apache Voldemort is:
✅ Correct Answer: a) Distributed key-value store
📝 Explanation:
Voldemort provides consistent hashing and partitioning for scalability.
106. What is log compaction in Kafka?
✅ Correct Answer: a) Retaining latest value per key
📝 Explanation:
Compaction keeps the most recent message for each key, enabling changelog use.
107. Apache Presto is a:
✅ Correct Answer: a) Distributed SQL query engine
📝 Explanation:
Presto queries multiple data sources with low latency.
108. What is CAP theorem implication for storage?
✅ Correct Answer: a) Trade-offs in Consistency, Availability, Partition tolerance
📝 Explanation:
CAP states only two can be guaranteed in distributed systems during partitions.
109. In Spark, DataFrames are:
✅ Correct Answer: a) Typed collections of rows
📝 Explanation:
DataFrames provide schema-based access with optimizations over RDDs.
110. Apache Riak is:
✅ Correct Answer: a) Distributed NoSQL key-value store
📝 Explanation:
Riak uses consistent hashing and vector clocks for conflict resolution.
111. What is exactly-once semantics in processing?
✅ Correct Answer: a) No duplicates or losses
📝 Explanation:
Exactly-once ensures each input produces one output despite failures.
112. Apache HAWQ is:
✅ Correct Answer: a) SQL on Hadoop
📝 Explanation:
HAWQ (Hadoop Advanced Workload) provides PostgreSQL-compatible queries on HDFS.
113. What is data durability in storage?
✅ Correct Answer: a) Persistence despite failures
📝 Explanation:
Durability ensures committed data survives crashes via replication or WAL.
114. In Spark, Dataset API unifies:
✅ Correct Answer: a) Structured and semi-structured data
📝 Explanation:
Datasets provide type-safe access like DataFrames but with stronger typing.
115. Apache Aerospike is:
✅ Correct Answer: a) Flash-optimized NoSQL database
📝 Explanation:
Aerospike combines key-value with in-memory speed using hybrid memory.
116. What is idempotency in processing?
✅ Correct Answer: a) Repeatable without side effects
📝 Explanation:
Idempotent operations allow safe retries in distributed systems.
117. Apache Hive supports which execution engines?
✅ Correct Answer: a) MapReduce, Tez, Spark
📝 Explanation:
Hive can use multiple backends for query execution.
118. What is hybrid storage?
✅ Correct Answer: a) Mix of SSD and HDD
📝 Explanation:
Hybrid uses fast SSD for hot data, cost-effective HDD for cold.
119. In Spark, barrier execution mode is for:
✅ Correct Answer: a) Synchronous stages in streaming
📝 Explanation:
Barriers ensure all tasks in a stage complete before proceeding.
120. Apache Tarantool is:
✅ Correct Answer: a) In-memory database with Lua
📝 Explanation:
Tarantool combines database and messaging with stored procedures.
121. What is data locality in processing?
✅ Correct Answer: a) Processing data where stored
📝 Explanation:
Locality minimizes network transfer by moving computation to data.
122. Apache Kylin is for:
✅ Correct Answer: a) OLAP on Hadoop
📝 Explanation:
Kylin precomputes cube for fast multidimensional analysis.
123. What is multi-version concurrency control (MVCC)?
✅ Correct Answer: a) Snapshot reads without locks
📝 Explanation:
MVCC allows concurrent transactions with consistent views via versions.
124. In Spark, Structured Streaming uses:
✅ Correct Answer: a) DataFrame API for streams
📝 Explanation:
It models streams as infinite tables for declarative processing.
125. Apache ScyllaDB is compatible with:
✅ Correct Answer: a) Cassandra API
📝 Explanation:
Scylla is a high-performance rewrite of Cassandra for better throughput.
126. What is backpressure in stream processing?
✅ Correct Answer: a) Slowing producers on overload
📝 Explanation:
Backpressure prevents system collapse by throttling input rates.
127. Apache Doris is:
✅ Correct Answer: a) MPP OLAP database
📝 Explanation:
Doris supports high-concurrency queries with real-time updates.
128. What is join strategy in processing?
✅ Correct Answer: a) How tables combine data
📝 Explanation:
Strategies like broadcast or shuffle hash optimize large joins.
129. In Spark, whole-stage codegen:
✅ Correct Answer: a) Compiles stages to bytecode
📝 Explanation:
Codegen reduces virtual function calls for faster execution.
130. Apache ClickHouse is optimized for:
✅ Correct Answer: a) Real-time analytics
📝 Explanation:
ClickHouse uses columnar storage for sub-second queries on billions of rows.
131. What is checkpointing in processing?
✅ Correct Answer: a) State snapshots for recovery
📝 Explanation:
Checkpoints enable fault tolerance by restoring from saved state.
132. Apache Pinot is for:
✅ Correct Answer: a) Real-time analytics on event data
📝 Explanation:
Pinot ingests streams and serves low-latency queries for user-facing apps.
133. What is columnar projection?
✅ Correct Answer: a) Selecting only needed columns
📝 Explanation:
Projection reduces I/O by reading only required columns in columnar stores.
134. In Spark, dynamic partition pruning:
✅ Correct Answer: a) Skips irrelevant partitions at runtime
📝 Explanation:
Pruning uses join stats to avoid scanning unnecessary data.
135. Apache Druid is for:
✅ Correct Answer: a) Timeseries analytics
📝 Explanation:
Druid ingests streams and supports fast aggregations on time-based data.
136. What is data skipping in storage?
✅ Correct Answer: a) Skipping irrelevant blocks via metadata
📝 Explanation:
Skipping uses min/max stats to bypass blocks not matching queries.
137. Apache Kyuubi is:
✅ Correct Answer: a) Multi-tenant SQL gateway
📝 Explanation:
Kyuubi provides secure, scalable SQL on Spark for big data.
138. What is zone mapping in storage?
✅ Correct Answer: a) Metadata for fast filtering
📝 Explanation:
Zone maps store value ranges per block for predicate skipping.


