1. What does the 'V' in Big Data's 3Vs stand for primarily?
Correct Answer: b) Volume
Explanation:
Volume refers to the massive amounts of data generated, which is a core characteristic of Big Data, requiring scalable storage solutions like HDFS.
2. Which architecture pattern processes both batch and real-time data streams?
Correct Answer: b) Lambda Architecture
Explanation:
Lambda Architecture combines batch processing for historical data and speed layer for real-time data, ensuring fault-tolerant and scalable processing.
3. In Hadoop, what is the primary role of the NameNode?
Correct Answer: b) Metadata management
Explanation:
The NameNode manages the file system's namespace and regulates access to files by clients, storing metadata in memory.
4. What is the default block size in HDFS?
Correct Answer: b) 128 MB
Explanation:
HDFS uses a default block size of 128 MB to balance storage efficiency and processing overhead in distributed environments.
5. Which component in Hadoop 2.x manages cluster resources and job scheduling?
Correct Answer: c) YARN
Explanation:
Yet Another Resource Negotiator (YARN) decouples resource management from job scheduling, enabling multi-tenancy in Hadoop clusters.
6. In MapReduce, what does the Map phase do?
Correct Answer: c) Processes key-value pairs independently
Explanation:
The Map phase takes input data and converts it into intermediate key-value pairs, processed in parallel across nodes.
7. What is the purpose of the Reduce phase in MapReduce?
Correct Answer: b) Merge and aggregate mapped data
Explanation:
The Reduce phase receives grouped key-value pairs from the Map phase and performs aggregation to produce final output.
8. Which file format is optimized for Hive and supports schema evolution?
Correct Answer: a) Avro
Explanation:
Avro is a compact binary format with built-in schema, ideal for streaming data and allowing evolution without rewriting files.
9. What is a Data Lake in Big Data architecture?
Correct Answer: b) A repository for raw, unprocessed data in native format
Explanation:
Data Lakes store vast amounts of raw data from various sources, enabling schema-on-read for flexible analytics.
10. In Spark, what is the role of the Driver Program?
Correct Answer: b) Coordinates the execution of the application
Explanation:
The Driver runs the main() function and creates the SparkContext, directing the overall flow and dividing work among Executors.
11. Which Spark component manages data sharing and caching?
Correct Answer: b) RDD
Explanation:
Resilient Distributed Datasets (RDDs) are immutable collections that support fault-tolerant operations through lineage tracking.
12. What is the main advantage of using Apache Kafka in Big Data pipelines?
Correct Answer: b) High-throughput, fault-tolerant messaging
Explanation:
Kafka acts as a distributed streaming platform, handling real-time data feeds with durability and scalability.
13. In NoSQL databases for Big Data, which type is best for hierarchical data?
Correct Answer: b) Document
Explanation:
Document stores like MongoDB handle semi-structured data in JSON-like documents, suitable for flexible schemas.
14. What does CAP theorem state in distributed Big Data systems?
Correct Answer: a) Consistency, Availability, Partition tolerance - pick two
Explanation:
CAP theorem implies that in a distributed system, only two out of three guarantees can be provided during network partitions.
15. Which tool is used for ETL processes in Hadoop ecosystem?
Correct Answer: a) Pig
Explanation:
Apache Pig provides a high-level scripting language (Pig Latin) for complex data transformations in MapReduce jobs.
16. What is the fault tolerance mechanism in HDFS?
Correct Answer: a) Data replication
Explanation:
HDFS replicates data blocks across multiple DataNodes (default 3 replicas) to ensure availability and recovery from failures.
17. In Lambda Architecture, what is the 'batch layer' responsible for?
Correct Answer: b) Computing arbitrary views on immutable master dataset
Explanation:
The batch layer processes the entire dataset periodically to generate precomputed views for accurate, comprehensive results.
18. Which protocol does HDFS use for data transfer?
Correct Answer: b) TCP/IP
Explanation:
HDFS relies on TCP/IP for reliable data transfer between clients, NameNode, and DataNodes in the cluster.
19. What is Spark's Directed Acyclic Graph (DAG) used for?
Correct Answer: b) Optimizing execution plans
Explanation:
DAG in Spark represents the logical execution plan, allowing optimizations and scheduling of stages for efficient computation.
20. In Big Data, what is a 'sharded' architecture?
Correct Answer: b) Horizontal partitioning of data across nodes
Explanation:
Sharding distributes data subsets across multiple database instances to improve scalability and performance.
21. Which Apache project provides SQL-like querying on Hadoop?
Correct Answer: b) Hive
Explanation:
Hive offers HiveQL, a SQL dialect, to query and analyze large datasets stored in HDFS via MapReduce or Tez.
22. What is the role of ZooKeeper in Big Data architectures?
Correct Answer: b) Distributed coordination and configuration
Explanation:
ZooKeeper provides a centralized service for maintaining configuration information and naming in distributed systems like Hadoop.
23. In Kappa Architecture, what replaces the batch layer?
Correct Answer: b) Stream processing
Explanation:
Kappa simplifies Lambda by using a single stream processing layer for both real-time and historical data via log replay.
24. What is HBase in the Hadoop ecosystem?
Correct Answer: b) Distributed, scalable, big data store
Explanation:
HBase is a NoSQL column-oriented database modeled after Google's Bigtable, providing random read/write access to HDFS data.
25. Which feature of Spark enables in-memory computation?
Correct Answer: a) RDD persistence
Explanation:
Spark allows RDDs to be cached in memory, reducing recomputation and speeding up iterative algorithms significantly.
26. What is the purpose of Apache Flume?
Correct Answer: b) Log data collection and aggregation
Explanation:
Flume is designed for efficiently collecting, aggregating, and moving large amounts of log data from sources to HDFS.
27. In Big Data, what is 'schema-on-read'?
Correct Answer: b) Applying schema when data is queried
Explanation:
Schema-on-read, used in Data Lakes, allows flexible ingestion of raw data and interpretation at query time.
28. Which YARN component negotiates resources from the ResourceManager?
Correct Answer: b) ApplicationMaster
Explanation:
The ApplicationMaster is application-specific and requests resources (containers) from the ResourceManager for task execution.
29. What is Apache Tez?
Correct Answer: b) An execution engine for DAGs on Hadoop
Explanation:
Tez generalizes MapReduce by executing arbitrary DAGs, improving performance for Hive and Pig jobs.
30. In Spark Streaming, what is a DStream?
Correct Answer: b) A continuous stream of RDDs
Explanation:
Discretized Stream (DStream) represents a stream as a sequence of RDDs, enabling micro-batch processing.
31. What is the default replication factor in HDFS?
Correct Answer: c) 3
Explanation:
The default replication factor of 3 ensures data availability by storing three copies across different nodes or racks.
32. Which tool is used for transferring bulk data between Hadoop and relational databases?
Correct Answer: b) Sqoop
Explanation:
Sqoop (SQL-to-Hadoop) uses MapReduce to import/export data efficiently between RDBMS and HDFS.
33. What is a 'hot spot' in Big Data partitioning?
Correct Answer: a) Overloaded partition
Explanation:
Hot spots occur when data skew causes uneven load on nodes, degrading performance in distributed systems.
34. In Cassandra, what ensures data consistency?
Correct Answer: a) Quorum reads/writes
Explanation:
Cassandra uses tunable consistency levels like quorum (majority of replicas) to balance availability and consistency.
35. What is the serving layer in Lambda Architecture?
Correct Answer: b) Merges batch and real-time views
Explanation:
The serving layer indexes and combines precomputed batch views with real-time updates for low-latency queries.
36. Which Spark library is for structured data processing?
Correct Answer: c) Spark SQL
Explanation:
Spark SQL provides a DataFrame API and SQL interface for processing structured and semi-structured data.
37. What is Apache Mahout used for in Big Data?
Correct Answer: b) Scalable machine learning
Explanation:
Mahout offers libraries for collaborative filtering, clustering, and classification on large datasets.
38. In HDFS, what is a 'rack'?
Correct Answer: b) A group of machines in the same data center
Explanation:
Rack awareness in HDFS places replicas across racks to improve fault tolerance and bandwidth.
39. What is the primary storage in a Data Warehouse?
Correct Answer: b) Structured, schema-on-write tables
Explanation:
Data Warehouses enforce schema-on-write for optimized querying on cleaned, integrated data.
40. Which protocol is used for secure communication in Hadoop?
Correct Answer: b) Kerberos
Explanation:
Kerberos provides strong authentication for Hadoop clusters, ensuring secure access control.
41. What is Apache Storm used for?
Correct Answer: b) Real-time stream processing
Explanation:
Storm processes unbounded streams of data in real-time, handling tasks like log analysis and fraud detection.
42. In MapReduce, what handles intermediate data spill to disk?
Correct Answer: c) Spill mechanism
Explanation:
When in-memory buffers fill, MapReduce spills data to disk, sorting and merging during shuffle.
43. What is a 'partition' in Spark?
Correct Answer: b) A logical chunk of an RDD
Explanation:
Partitions are the basic units of parallelism in Spark, distributed across the cluster for computation.
44. Which Big Data tool is for workflow scheduling?
Correct Answer: a) Oozie
Explanation:
Oozie coordinates complex workflows of Hadoop jobs, including MapReduce, Pig, and Hive.
45. What is eventual consistency in Big Data systems?
Correct Answer: b) Agreement after some time
Explanation:
Eventual consistency allows temporary inconsistencies but guarantees convergence under normal conditions, prioritizing availability.
46. In Parquet format, what is columnar storage beneficial for?
Correct Answer: b) Column-wise queries and compression
Explanation:
Parquet's columnar format reduces I/O for analytics queries and enables better compression ratios.
47. What is the ResourceManager's role in YARN?
Correct Answer: b) Arbitrates resources among applications
Explanation:
ResourceManager globally manages cluster resources, allocating them via the Scheduler to ApplicationMasters.
48. Which architecture uses micro-batches for streaming?
Correct Answer: b) Spark Streaming
Explanation:
Spark Streaming divides streams into small batches (e.g., 1-second intervals) processed as RDDs.
49. What is Apache Flink's key feature?
Correct Answer: b) Unified batch and stream processing
Explanation:
Flink treats batches as bounded streams, providing low-latency and exactly-once semantics for both.
50. In HBase, what is a 'RegionServer'?
Correct Answer: b) Hosts regions of tables
Explanation:
RegionServers manage and serve data regions, handling reads, writes, and compactions for HBase tables.
51. What is data partitioning strategy in Big Data for load balancing?
Correct Answer: d) All of the above
Explanation:
Various strategies like hash, round-robin, and range partitioning distribute data evenly to prevent hotspots.
52. Which tool visualizes Big Data workflows?
Correct Answer: a) Zeppelin
Explanation:
Apache Zeppelin is a web-based notebook for interactive data analytics, supporting Spark, Hive, and more.
53. What is the 'speed layer' in Lambda Architecture?
Correct Answer: b) Real-time data processing for recent updates
Explanation:
The speed layer computes recent data increments quickly, complementing the slower batch layer.
54. In Spark, what is lazy evaluation?
Correct Answer: b) Transformations recorded but not computed until action
Explanation:
Lazy evaluation builds the DAG of transformations, optimizing the plan before executing on an action call.
55. What is Apache NiFi for?
Correct Answer: a) Data flow automation
Explanation:
NiFi automates data routing, transformation, and mediation between systems with visual command flow.
56. Which consistency model does DynamoDB use?
Correct Answer: b) Eventual
Explanation:
DynamoDB provides tunable eventual consistency for high availability, with an option for strongly consistent reads.
57. What is a 'checkpoint' in stream processing?
Correct Answer: b) State snapshot for fault recovery
Explanation:
Checkpoints periodically save application state to enable exactly-once processing after failures.
58. In Hadoop, what is 'federation'?
Correct Answer: b) Multiple independent NameNodes for namespaces
Explanation:
HDFS Federation scales by allowing multiple NameNodes to manage separate namespaces on the same DataNodes.
59. What is ORC file format optimized for?
Correct Answer: b) Hive queries with compression and predicate pushdown
Explanation:
Optimized Row Columnar (ORC) format enhances Hive performance through columnar storage and advanced indexing.
60. Which component in YARN monitors node health?
Correct Answer: b) NodeManager
Explanation:
NodeManager runs on each slave node, launching containers and reporting health to the ResourceManager.
61. What is 'backpressure' in stream processing architectures?
Correct Answer: b) Mechanism to handle overload by slowing producers
Explanation:
Backpressure prevents system overload by signaling upstream sources to reduce data emission rates.
62. In GraphX, what is a Property Graph?
Correct Answer: b) Graph with properties on vertices and edges
Explanation:
Spark GraphX represents graphs as RDDs of vertices and edges with associated properties for analytics.
63. What is Apache Phoenix?
Correct Answer: b) SQL layer over HBase
Explanation:
Phoenix provides a JDBC-compliant SQL interface for low-latency queries on HBase data.
64. In Big Data, what is 'data lineage'?
Correct Answer: b) Tracking data flow and transformations
Explanation:
Data lineage records the origin, movement, and processing history of data for auditing and debugging.
65. Which scalability type adds more nodes?
Correct Answer: b) Horizontal
Explanation:
Horizontal scalability (scale-out) distributes load across additional machines, key for Big Data systems.
66. What is Apache Drill for?
Correct Answer: b) Schema-free SQL queries on diverse data sources
Explanation:
Drill enables interactive queries across NoSQL, files, and cloud storage without predefined schemas.
67. In Kafka, what is a 'topic'?
Correct Answer: b) A category or feed name for messages
Explanation:
Topics in Kafka are partitioned logs where producers publish and consumers subscribe to streams.
68. What is 'idempotency' in Big Data processing?
Correct Answer: b) Repeatable operations without side effects
Explanation:
Idempotent operations ensure that retries or duplicates produce the same result, aiding fault tolerance.
69. Which tool manages Hadoop cluster deployment?
Correct Answer: c) Both a and b
Explanation:
Ambari and Cloudera Manager automate installation, configuration, and monitoring of Hadoop ecosystems.
70. What is a 'view' in Big Data serving layers?
Correct Answer: b) Precomputed query results
Explanation:
Views are materialized aggregates or joins stored for fast access in query serving systems.


