1. What does the 'V' in Big Data's 3Vs stand for primarily?
✅ Correct Answer: b) Volume
📝 Explanation:
Volume refers to the massive amounts of data generated, which is a core characteristic of Big Data, requiring scalable storage solutions like HDFS.
2. Which architecture pattern processes both batch and real-time data streams?
✅ Correct Answer: b) Lambda Architecture
📝 Explanation:
Lambda Architecture combines batch processing for historical data and speed layer for real-time data, ensuring fault-tolerant and scalable processing.
3. In Hadoop, what is the primary role of the NameNode?
✅ Correct Answer: b) Metadata management
📝 Explanation:
The NameNode manages the file system's namespace and regulates access to files by clients, storing metadata in memory.
4. What is the default block size in HDFS?
✅ Correct Answer: b) 128 MB
📝 Explanation:
HDFS uses a default block size of 128 MB to balance storage efficiency and processing overhead in distributed environments.
5. Which component in Hadoop 2.x manages cluster resources and job scheduling?
✅ Correct Answer: c) YARN
📝 Explanation:
Yet Another Resource Negotiator (YARN) decouples resource management from job scheduling, enabling multi-tenancy in Hadoop clusters.
6. In MapReduce, what does the Map phase do?
✅ Correct Answer: c) Processes key-value pairs independently
📝 Explanation:
The Map phase takes input data and converts it into intermediate key-value pairs, processed in parallel across nodes.
7. What is the purpose of the Reduce phase in MapReduce?
✅ Correct Answer: b) Merge and aggregate mapped data
📝 Explanation:
The Reduce phase receives grouped key-value pairs from the Map phase and performs aggregation to produce final output.
8. Which file format is optimized for Hive and supports schema evolution?
✅ Correct Answer: a) Avro
📝 Explanation:
Avro is a compact binary format with built-in schema, ideal for streaming data and allowing evolution without rewriting files.
9. What is a Data Lake in Big Data architecture?
✅ Correct Answer: b) A repository for raw, unprocessed data in native format
📝 Explanation:
Data Lakes store vast amounts of raw data from various sources, enabling schema-on-read for flexible analytics.
10. In Spark, what is the role of the Driver Program?
✅ Correct Answer: b) Coordinates the execution of the application
📝 Explanation:
The Driver runs the main() function and creates the SparkContext, directing the overall flow and dividing work among Executors.
11. Which Spark component manages data sharing and caching?
✅ Correct Answer: b) RDD
📝 Explanation:
Resilient Distributed Datasets (RDDs) are immutable collections that support fault-tolerant operations through lineage tracking.
12. What is the main advantage of using Apache Kafka in Big Data pipelines?
✅ Correct Answer: b) High-throughput, fault-tolerant messaging
📝 Explanation:
Kafka acts as a distributed streaming platform, handling real-time data feeds with durability and scalability.
13. In NoSQL databases for Big Data, which type is best for hierarchical data?
✅ Correct Answer: b) Document
📝 Explanation:
Document stores like MongoDB handle semi-structured data in JSON-like documents, suitable for flexible schemas.
14. What does CAP theorem state in distributed Big Data systems?
✅ Correct Answer: a) Consistency, Availability, Partition tolerance - pick two
📝 Explanation:
CAP theorem implies that in a distributed system, only two out of three guarantees can be provided during network partitions.
15. Which tool is used for ETL processes in Hadoop ecosystem?
✅ Correct Answer: a) Pig
📝 Explanation:
Apache Pig provides a high-level scripting language (Pig Latin) for complex data transformations in MapReduce jobs.
16. What is the fault tolerance mechanism in HDFS?
✅ Correct Answer: a) Data replication
📝 Explanation:
HDFS replicates data blocks across multiple DataNodes (default 3 replicas) to ensure availability and recovery from failures.
17. In Lambda Architecture, what is the 'batch layer' responsible for?
✅ Correct Answer: b) Computing arbitrary views on immutable master dataset
📝 Explanation:
The batch layer processes the entire dataset periodically to generate precomputed views for accurate, comprehensive results.
18. Which protocol does HDFS use for data transfer?
✅ Correct Answer: b) TCP/IP
📝 Explanation:
HDFS relies on TCP/IP for reliable data transfer between clients, NameNode, and DataNodes in the cluster.
19. What is Spark's Directed Acyclic Graph (DAG) used for?
✅ Correct Answer: b) Optimizing execution plans
📝 Explanation:
DAG in Spark represents the logical execution plan, allowing optimizations and scheduling of stages for efficient computation.
20. In Big Data, what is a 'sharded' architecture?
✅ Correct Answer: b) Horizontal partitioning of data across nodes
📝 Explanation:
Sharding distributes data subsets across multiple database instances to improve scalability and performance.
21. Which Apache project provides SQL-like querying on Hadoop?
✅ Correct Answer: b) Hive
📝 Explanation:
Hive offers HiveQL, a SQL dialect, to query and analyze large datasets stored in HDFS via MapReduce or Tez.
22. What is the role of ZooKeeper in Big Data architectures?
✅ Correct Answer: b) Distributed coordination and configuration
📝 Explanation:
ZooKeeper provides a centralized service for maintaining configuration information and naming in distributed systems like Hadoop.
23. In Kappa Architecture, what replaces the batch layer?
✅ Correct Answer: b) Stream processing
📝 Explanation:
Kappa simplifies Lambda by using a single stream processing layer for both real-time and historical data via log replay.
24. What is HBase in the Hadoop ecosystem?
✅ Correct Answer: b) Distributed, scalable, big data store
📝 Explanation:
HBase is a NoSQL column-oriented database modeled after Google's Bigtable, providing random read/write access to HDFS data.
25. Which feature of Spark enables in-memory computation?
✅ Correct Answer: a) RDD persistence
📝 Explanation:
Spark allows RDDs to be cached in memory, reducing recomputation and speeding up iterative algorithms significantly.
26. What is the purpose of Apache Flume?
✅ Correct Answer: b) Log data collection and aggregation
📝 Explanation:
Flume is designed for efficiently collecting, aggregating, and moving large amounts of log data from sources to HDFS.
27. In Big Data, what is 'schema-on-read'?
✅ Correct Answer: b) Applying schema when data is queried
📝 Explanation:
Schema-on-read, used in Data Lakes, allows flexible ingestion of raw data and interpretation at query time.
28. Which YARN component negotiates resources from the ResourceManager?
✅ Correct Answer: b) ApplicationMaster
📝 Explanation:
The ApplicationMaster is application-specific and requests resources (containers) from the ResourceManager for task execution.
29. What is Apache Tez?
✅ Correct Answer: b) An execution engine for DAGs on Hadoop
📝 Explanation:
Tez generalizes MapReduce by executing arbitrary DAGs, improving performance for Hive and Pig jobs.
30. In Spark Streaming, what is a DStream?
✅ Correct Answer: b) A continuous stream of RDDs
📝 Explanation:
Discretized Stream (DStream) represents a stream as a sequence of RDDs, enabling micro-batch processing.
31. What is the default replication factor in HDFS?
✅ Correct Answer: c) 3
📝 Explanation:
The default replication factor of 3 ensures data availability by storing three copies across different nodes or racks.
32. Which tool is used for transferring bulk data between Hadoop and relational databases?
✅ Correct Answer: b) Sqoop
📝 Explanation:
Sqoop (SQL-to-Hadoop) uses MapReduce to import/export data efficiently between RDBMS and HDFS.
33. What is a 'hot spot' in Big Data partitioning?
✅ Correct Answer: a) Overloaded partition
📝 Explanation:
Hot spots occur when data skew causes uneven load on nodes, degrading performance in distributed systems.
34. In Cassandra, what ensures data consistency?
✅ Correct Answer: a) Quorum reads/writes
📝 Explanation:
Cassandra uses tunable consistency levels like quorum (majority of replicas) to balance availability and consistency.
35. What is the serving layer in Lambda Architecture?
✅ Correct Answer: b) Merges batch and real-time views
📝 Explanation:
The serving layer indexes and combines precomputed batch views with real-time updates for low-latency queries.
36. Which Spark library is for structured data processing?
✅ Correct Answer: c) Spark SQL
📝 Explanation:
Spark SQL provides a DataFrame API and SQL interface for processing structured and semi-structured data.
37. What is Apache Mahout used for in Big Data?
✅ Correct Answer: b) Scalable machine learning
📝 Explanation:
Mahout offers libraries for collaborative filtering, clustering, and classification on large datasets.
38. In HDFS, what is a 'rack'?
✅ Correct Answer: b) A group of machines in the same data center
📝 Explanation:
Rack awareness in HDFS places replicas across racks to improve fault tolerance and bandwidth.
39. What is the primary storage in a Data Warehouse?
✅ Correct Answer: b) Structured, schema-on-write tables
📝 Explanation:
Data Warehouses enforce schema-on-write for optimized querying on cleaned, integrated data.
40. Which protocol is used for secure communication in Hadoop?
✅ Correct Answer: b) Kerberos
📝 Explanation:
Kerberos provides strong authentication for Hadoop clusters, ensuring secure access control.
41. What is Apache Storm used for?
✅ Correct Answer: b) Real-time stream processing
📝 Explanation:
Storm processes unbounded streams of data in real-time, handling tasks like log analysis and fraud detection.
42. In MapReduce, what handles intermediate data spill to disk?
✅ Correct Answer: c) Spill mechanism
📝 Explanation:
When in-memory buffers fill, MapReduce spills data to disk, sorting and merging during shuffle.
43. What is a 'partition' in Spark?
✅ Correct Answer: b) A logical chunk of an RDD
📝 Explanation:
Partitions are the basic units of parallelism in Spark, distributed across the cluster for computation.
44. Which Big Data tool is for workflow scheduling?
✅ Correct Answer: a) Oozie
📝 Explanation:
Oozie coordinates complex workflows of Hadoop jobs, including MapReduce, Pig, and Hive.
45. What is eventual consistency in Big Data systems?
✅ Correct Answer: b) Agreement after some time
📝 Explanation:
Eventual consistency allows temporary inconsistencies but guarantees convergence under normal conditions, prioritizing availability.
46. In Parquet format, what is columnar storage beneficial for?
✅ Correct Answer: b) Column-wise queries and compression
📝 Explanation:
Parquet's columnar format reduces I/O for analytics queries and enables better compression ratios.
47. What is the ResourceManager's role in YARN?
✅ Correct Answer: b) Arbitrates resources among applications
📝 Explanation:
ResourceManager globally manages cluster resources, allocating them via the Scheduler to ApplicationMasters.
48. Which architecture uses micro-batches for streaming?
✅ Correct Answer: b) Spark Streaming
📝 Explanation:
Spark Streaming divides streams into small batches (e.g., 1-second intervals) processed as RDDs.
49. What is Apache Flink's key feature?
✅ Correct Answer: b) Unified batch and stream processing
📝 Explanation:
Flink treats batches as bounded streams, providing low-latency and exactly-once semantics for both.
50. In HBase, what is a 'RegionServer'?
✅ Correct Answer: b) Hosts regions of tables
📝 Explanation:
RegionServers manage and serve data regions, handling reads, writes, and compactions for HBase tables.
51. What is data partitioning strategy in Big Data for load balancing?
✅ Correct Answer: d) All of the above
📝 Explanation:
Various strategies like hash, round-robin, and range partitioning distribute data evenly to prevent hotspots.
52. Which tool visualizes Big Data workflows?
✅ Correct Answer: a) Zeppelin
📝 Explanation:
Apache Zeppelin is a web-based notebook for interactive data analytics, supporting Spark, Hive, and more.
53. What is the 'speed layer' in Lambda Architecture?
✅ Correct Answer: b) Real-time data processing for recent updates
📝 Explanation:
The speed layer computes recent data increments quickly, complementing the slower batch layer.
54. In Spark, what is lazy evaluation?
✅ Correct Answer: b) Transformations recorded but not computed until action
📝 Explanation:
Lazy evaluation builds the DAG of transformations, optimizing the plan before executing on an action call.
55. What is Apache NiFi for?
✅ Correct Answer: a) Data flow automation
📝 Explanation:
NiFi automates data routing, transformation, and mediation between systems with visual command flow.
56. Which consistency model does DynamoDB use?
✅ Correct Answer: b) Eventual
📝 Explanation:
DynamoDB provides tunable eventual consistency for high availability, with an option for strongly consistent reads.
57. What is a 'checkpoint' in stream processing?
✅ Correct Answer: b) State snapshot for fault recovery
📝 Explanation:
Checkpoints periodically save application state to enable exactly-once processing after failures.
58. In Hadoop, what is 'federation'?
✅ Correct Answer: b) Multiple independent NameNodes for namespaces
📝 Explanation:
HDFS Federation scales by allowing multiple NameNodes to manage separate namespaces on the same DataNodes.
59. What is ORC file format optimized for?
✅ Correct Answer: b) Hive queries with compression and predicate pushdown
📝 Explanation:
Optimized Row Columnar (ORC) format enhances Hive performance through columnar storage and advanced indexing.
60. Which component in YARN monitors node health?
✅ Correct Answer: b) NodeManager
📝 Explanation:
NodeManager runs on each slave node, launching containers and reporting health to the ResourceManager.
61. What is 'backpressure' in stream processing architectures?
✅ Correct Answer: b) Mechanism to handle overload by slowing producers
📝 Explanation:
Backpressure prevents system overload by signaling upstream sources to reduce data emission rates.
62. In GraphX, what is a Property Graph?
✅ Correct Answer: b) Graph with properties on vertices and edges
📝 Explanation:
Spark GraphX represents graphs as RDDs of vertices and edges with associated properties for analytics.
63. What is Apache Phoenix?
✅ Correct Answer: b) SQL layer over HBase
📝 Explanation:
Phoenix provides a JDBC-compliant SQL interface for low-latency queries on HBase data.
64. In Big Data, what is 'data lineage'?
✅ Correct Answer: b) Tracking data flow and transformations
📝 Explanation:
Data lineage records the origin, movement, and processing history of data for auditing and debugging.
65. Which scalability type adds more nodes?
✅ Correct Answer: b) Horizontal
📝 Explanation:
Horizontal scalability (scale-out) distributes load across additional machines, key for Big Data systems.
66. What is Apache Drill for?
✅ Correct Answer: b) Schema-free SQL queries on diverse data sources
📝 Explanation:
Drill enables interactive queries across NoSQL, files, and cloud storage without predefined schemas.
67. In Kafka, what is a 'topic'?
✅ Correct Answer: b) A category or feed name for messages
📝 Explanation:
Topics in Kafka are partitioned logs where producers publish and consumers subscribe to streams.
68. What is 'idempotency' in Big Data processing?
✅ Correct Answer: b) Repeatable operations without side effects
📝 Explanation:
Idempotent operations ensure that retries or duplicates produce the same result, aiding fault tolerance.
69. Which tool manages Hadoop cluster deployment?
✅ Correct Answer: c) Both a and b
📝 Explanation:
Ambari and Cloudera Manager automate installation, configuration, and monitoring of Hadoop ecosystems.
70. What is a 'view' in Big Data serving layers?
✅ Correct Answer: b) Precomputed query results
📝 Explanation:
Views are materialized aggregates or joins stored for fast access in query serving systems.