1. IBM and ________ have announced a major initiative to use Hadoop to support university courses in distributed computer programming.
✅ Correct Answer: d) Google
📝 Explanation:
Google and IBM Announce University Initiative to Address Internet-Scale.
2. Point out the correct statement.
✅ Correct Answer: b) Hadoop stores data in HDFS and supports data compression/decompression
📝 Explanation:
Data compression can be achieved using compression algorithms like bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based on their capabilities.
3. What license is Hadoop distributed under?
✅ Correct Answer: a) Apache License 2.0
📝 Explanation:
Hadoop is Open Source, released under Apache 2 license.
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional Hadoop cluster using a live CD.
✅ Correct Answer: b) OpenSolaris
📝 Explanation:
The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image.
5. Which of the following genres does Hadoop produce?
✅ Correct Answer: a) Distributed file system
📝 Explanation:
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to the user.
6. What was Hadoop written in?
✅ Correct Answer: c) Java (programming language)
📝 Explanation:
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command-line utilities written as shell scripts.
7. Which of the following platforms does Hadoop run on?
✅ Correct Answer: c) Cross-platform
📝 Explanation:
Hadoop has support for cross-platform operating system.
8. Hadoop achieves reliability by replicating the data across multiple hosts and hence does not require ________ storage on hosts.
✅ Correct Answer: a) RAID
📝 Explanation:
With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs.
✅ Correct Answer: a) MapReduce
📝 Explanation:
MapReduce engine uses to distribute work around a cluster.
10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and matrix operations.
✅ Correct Answer: a) Machine learning
📝 Explanation:
The Apache Mahout project’s goal is to build a scalable machine learning tool.
11. Which of the following is a characteristic of HDFS?
✅ Correct Answer: d) All of the mentioned
📝 Explanation:
HDFS is designed to run on commodity hardware, provides high throughput access to application data, and is built using Java.
12. Point out the correct statement.
✅ Correct Answer: d) All of the mentioned
📝 Explanation:
HDFS is designed to run on commodity hardware, built using Java, and is open source.
13. Which of the following is a feature of HDFS?
✅ Correct Answer: d) All of the mentioned
📝 Explanation:
HDFS features include splitting files into blocks, replicating blocks, and having a fixed block size.
14. Which of the following is a benefit of HDFS?
✅ Correct Answer: d) All of the mentioned
📝 Explanation:
HDFS is economical, highly scalable, and highly available.
15. Point out the wrong statement.
✅ Correct Answer: d) HDFS is not fault tolerant
📝 Explanation:
HDFS is fault tolerant, so the statement 'HDFS is not fault tolerant' is wrong.
16. Which of these is not a feature of HDFS?
✅ Correct Answer: c) Low Latency Access
📝 Explanation:
HDFS is designed for high-throughput access to large datasets with high latency, not low latency access.
17. Which of these is a characteristic of HDFS NameNode?
✅ Correct Answer: a) Manages the file system namespace
📝 Explanation:
The NameNode manages the file system namespace and regulates access to files by clients.
18. What is the default block size in HDFS?
✅ Correct Answer: b) 64 MB
📝 Explanation:
The default block size in HDFS is 64 MB.
19. Which command is used to copy files from local file system to HDFS?
✅ Correct Answer: b) hadoop fs -copyFromLocal
📝 Explanation:
The command 'hadoop fs -copyFromLocal' is used to copy files from the local file system to HDFS.
20. What is the purpose of the Secondary NameNode in HDFS?
✅ Correct Answer: c) Performs periodic checkpoints of the NameNode's metadata
📝 Explanation:
The Secondary NameNode is responsible for performing periodic checkpoints of the NameNode's metadata to allow for recovery in case of a failure.
21. Which of these is not a Hadoop file format?
✅ Correct Answer: d) AvroFile
📝 Explanation:
Avro is a data serialization system, not a Hadoop file format. TextFile, SequenceFile, and RCFile are Hadoop-specific file formats.
22. Which of these is not a Hadoop file format?
✅ Correct Answer: d) JSONFile
📝 Explanation:
JSONFile is not a Hadoop-specific file format. SequenceFile, RCFile, and ORCFile are Hadoop-specific file formats.
23. Which of these is not a Hadoop file format?
✅ Correct Answer: None of the mentioned
📝 Explanation:
SequenceFile, RCFile, ORCFile, and ParquetFile are all Hadoop-specific file formats.
24. What is Hadoop primarily used for?
✅ Correct Answer: a) Big data processing
📝 Explanation:
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
25. Which core component of Hadoop is responsible for data storage?
✅ Correct Answer: c) HDFS
📝 Explanation:
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop.
26. What type of architecture does Hadoop use to process large data sets?
✅ Correct Answer: c) Master-slave
📝 Explanation:
Hadoop uses a master-slave architecture where the master manages the cluster and slaves perform the actual data processing.
27. Hadoop can process data that is:
✅ Correct Answer: d) All of the above
📝 Explanation:
Hadoop is versatile and can handle structured, unstructured, and semi-structured data.
28. Which feature of Hadoop makes it suitable for processing large volumes of data?
✅ Correct Answer: a) Fault tolerance
📝 Explanation:
Hadoop's fault tolerance allows it to continue processing even if some nodes fail.
29. What mechanism does Hadoop use to ensure data is not lost in case of a node failure?
✅ Correct Answer: c) Data replication
📝 Explanation:
Hadoop replicates data across multiple nodes to ensure availability and fault tolerance.
30. Which programming model is primarily used by Hadoop to process large data sets?
✅ Correct Answer: d) MapReduce
📝 Explanation:
MapReduce is the core programming model for processing large datasets in Hadoop.
31. Which command is used to view the contents of a directory in HDFS?
✅ Correct Answer: a) hadoop fs -ls
📝 Explanation:
The 'ls' command lists the contents of directories in HDFS.
32. Which component in Hadoop's architecture is responsible for processing data?
✅ Correct Answer: c) JobTracker
📝 Explanation:
The JobTracker manages the execution of MapReduce jobs.
33. What role does the NameNode play in Hadoop Architecture?
✅ Correct Answer: a) Manages the cluster's storage resources
📝 Explanation:
The NameNode is the master server that manages the file system namespace and metadata.
34. In Hadoop, what is the function of a DataNode?
✅ Correct Answer: a) Stores data blocks
📝 Explanation:
DataNodes store the actual data in blocks and report to the NameNode.
35. Which type of file system does Hadoop use?
✅ Correct Answer: a) Distributed
📝 Explanation:
Hadoop uses the Hadoop Distributed File System (HDFS).
36. How does the Hadoop framework handle hardware failures?
✅ Correct Answer: b) Re-routing tasks
📝 Explanation:
Hadoop re-routes tasks to other nodes in case of failure.
37. What mechanism allows Hadoop to scale processing capacity?
✅ Correct Answer: a) Adding more nodes to the network
📝 Explanation:
Hadoop scales horizontally by adding more nodes.
38. How do you list all nodes in a Hadoop cluster using the command line?
✅ Correct Answer: a) hadoop dfsadmin -report
📝 Explanation:
The dfsadmin -report command provides cluster status including nodes.
39. Which command can you use to check the health of the Hadoop file system?
✅ Correct Answer: b) hadoop fsck
📝 Explanation:
hadoop fsck checks the health of files in HDFS.
40. What is the purpose of the hadoop balancer command?
✅ Correct Answer: b) To balance the storage usage across the DataNodes
📝 Explanation:
The balancer evens out the distribution of data blocks across DataNodes.
41. What should you check first if the NameNode is not starting?
✅ Correct Answer: a) Configuration files
📝 Explanation:
Misconfigured files are a common reason for startup failures.
42. When a DataNode is reported as down, what is the first action to take?
✅ Correct Answer: b) Check network connectivity to the DataNode
📝 Explanation:
Network issues are often the cause of a node appearing down.
43. What is a fundamental characteristic of HDFS?
✅ Correct Answer: a) Fault tolerance
📝 Explanation:
HDFS is designed to be fault tolerant through data replication.
44. Which of these is a feature of MapReduce?
✅ Correct Answer: a) Automatic parallelization and distribution
📝 Explanation:
MapReduce automatically parallelizes the execution of the task across a large number of servers in the cluster, distributing the data and the computational logic.
45. Which of these is a key component of MapReduce?
✅ Correct Answer: a) JobTracker and TaskTracker
📝 Explanation:
In the original Hadoop MapReduce implementation, JobTracker and TaskTracker are key components that manage job scheduling and task execution.
46. Which of the following is the primary function of the Map phase in MapReduce?
✅ Correct Answer: c) To map input key-value pairs to intermediate key-value pairs
📝 Explanation:
The Map phase processes input key-value pairs and produces intermediate key-value pairs, which are then shuffled and sorted before the Reduce phase.
47. Which of these is NOT a phase in MapReduce?
✅ Correct Answer: d) Merge
📝 Explanation:
Shuffle and Sort are intermediate steps between Map and Reduce, but Merge is not a distinct phase in MapReduce; it may refer to operations within the framework but is not a core phase.
48. Which of the following best describes the purpose of the Reduce phase?
✅ Correct Answer: b) To aggregate the mapped data
📝 Explanation:
The Reduce phase takes the intermediate data produced by the Map phase, groups it by key, and applies a reduce function to aggregate the values for each key.
49. Which of these classes is used to write the output of a MapReduce job?
✅ Correct Answer: c) FileOutputFormat
📝 Explanation:
FileOutputFormat is used to specify the output location for the MapReduce job. It defines how the output should be written to the file system.
50. Which of these classes is used to read the input for a MapReduce job?
✅ Correct Answer: b) FileInputFormat
📝 Explanation:
FileInputFormat is used to specify the input location for the MapReduce job. It defines how the input should be read from the file system.
51. Which of these is a generic API for MapReduce in Hadoop?
✅ Correct Answer: d) Job
📝 Explanation:
Job is a generic API for MapReduce in Hadoop. It provides a high-level interface to configure and run MapReduce jobs.
52. Which of these classes is used to specify the mapper class in a MapReduce job?
✅ Correct Answer: a) setMapperClass()
📝 Explanation:
setMapperClass() is used to specify the mapper class in a MapReduce job. It defines the class that will perform the map operation.
53. Which of these classes is used to specify the reducer class in a MapReduce job?
✅ Correct Answer: b) setReducerClass()
📝 Explanation:
setReducerClass() is used to specify the reducer class in a MapReduce job. It defines the class that will perform the reduce operation.
54. Which of these classes is used to specify the input format class in a MapReduce job?
✅ Correct Answer: c) setInputFormatClass()
📝 Explanation:
setInputFormatClass() is used to specify the input format class in a MapReduce job. It defines how the input data should be formatted.
55. Which of these classes is used to specify the output format class in a MapReduce job?
✅ Correct Answer: d) setOutputFormatClass()
📝 Explanation:
setOutputFormatClass() is used to specify the output format class in a MapReduce job. It defines how the output data should be formatted.
56. Which of these methods is used to set the number of reduce tasks in a MapReduce job?
✅ Correct Answer: b) setNumReduceTasks()
📝 Explanation:
setNumReduceTasks() is used to set the number of reduce tasks in a MapReduce job. It specifies how many reduce tasks should be executed.
57. Which of these methods is used to set the number of map tasks in a MapReduce job?
✅ Correct Answer: a) setNumMapTasks()
📝 Explanation:
setNumMapTasks() is used to set the number of map tasks in a MapReduce job. It specifies how many map tasks should be executed.
58. What action should you take if you notice that the HDFS capacity is unexpectedly decreasing?
✅ Correct Answer: a) Check for under-replicated blocks
📝 Explanation:
Under-replicated blocks can cause capacity issues as Hadoop tries to replicate them.
59. Which operation is NOT a typical function of the Reduce phase in MapReduce?
✅ Correct Answer: d) Filtering records based on a condition
📝 Explanation:
Filtering is typically done in the Map phase; Reduce focuses on aggregation.
60. How does the MapReduce framework typically divide the processing of data?
✅ Correct Answer: c) Data is split into blocks, which are processed in parallel
📝 Explanation:
Input data is split into blocks and processed in parallel by mappers.
61. What is the role of the Combiner function in a MapReduce job?
✅ Correct Answer: b) To reduce the amount of data transferred between the Map and Reduce tasks
📝 Explanation:
Combiners perform local aggregation to minimize network traffic.
62. In which scenario would you configure multiple reducers in a MapReduce job?
✅ Correct Answer: d) All of the above
📝 Explanation:
Multiple reducers allow for parallelism, handle large data, and produce partitioned output.
63. What determines the number of mappers to be run in a MapReduce job?
✅ Correct Answer: a) The size of the input data
📝 Explanation:
The number of mappers is determined by the input split size.
64. What happens if a mapper fails during the execution of a MapReduce job?
✅ Correct Answer: b) Only the failed mapper tasks are retried
📝 Explanation:
Hadoop retries only the failed tasks to ensure fault tolerance.
65. Which MapReduce method is called once at the end of the task?
✅ Correct Answer: c) cleanup()
📝 Explanation:
The cleanup() method is called once at the end of each task.
66. How do you specify the number of reduce tasks for a Hadoop job?
✅ Correct Answer: a) Set the mapred.reduce.tasks parameter in the job configuration
📝 Explanation:
The number of reducers is set via job configuration parameters.
67. What is the purpose of the Partitioner class in MapReduce?
✅ Correct Answer: d) To control which key-value pairs go to which reducer
📝 Explanation:
The Partitioner determines the reducer for each key.
68. What does the WritableComparable interface in Hadoop define?
✅ Correct Answer: a) Data types that can be compared and written in Hadoop
📝 Explanation:
WritableComparable allows objects to be serialized and compared for sorting.
69. What common issue should be checked first when a MapReduce job is running slower than expected?
✅ Correct Answer: b) Inadequate memory allocation
📝 Explanation:
Insufficient memory can cause spilling to disk and slow performance.
70. What is an effective way to resolve data skew during the reduce phase of a MapReduce job?
✅ Correct Answer: a) Adjusting the number of reducers
📝 Explanation:
Increasing reducers can help distribute skewed data more evenly.
71. What is the primary function of the Resource Manager in YARN?
✅ Correct Answer: a) Managing cluster resources
📝 Explanation:
The ResourceManager is the master daemon that arbitrates resources among applications.
72. How does YARN improve the scalability of Hadoop?
✅ Correct Answer: a) By separating job management and resource management
📝 Explanation:
YARN decouples resource management from job scheduling/monitoring.
73. What role does the NodeManager play in a YARN cluster?
✅ Correct Answer: c) It manages the resources on a single node
📝 Explanation:
NodeManager is per-machine and responsible for containers and monitoring.
74. Which YARN component is responsible for monitoring the health of the cluster nodes?
✅ Correct Answer: b) NodeManager
📝 Explanation:
NodeManagers monitor their node's resource usage and report to the ResourceManager.
75. In YARN, what does the ApplicationMaster do?
✅ Correct Answer: a) Manages the lifecycle of an application
📝 Explanation:
ApplicationMaster negotiates resources and executes tasks for an application.
76. How does YARN handle the failure of an ApplicationMaster?
✅ Correct Answer: b) It automatically restarts the ApplicationMaster
📝 Explanation:
YARN restarts the ApplicationMaster on failure to ensure fault tolerance.
77. Which command is used to list all running applications in YARN?
✅ Correct Answer: a) yarn application -list
📝 Explanation:
This command lists all running YARN applications.
78. How can you kill an application in YARN using the command line?
✅ Correct Answer: a) yarn application -kill
📝 Explanation:
This command terminates a running application by ID.
79. What command would you use to check the logs for a specific YARN application?
✅ Correct Answer: a) yarn logs -applicationId
📝 Explanation:
This command aggregates and prints logs for the specified application.
80. What should be your first step if a YARN application fails to start?
✅ Correct Answer: a) Check the application logs for errors
📝 Explanation:
Logs provide the most direct insight into startup failures.
81. If you notice that applications in YARN are frequently being killed due to insufficient memory, what should you adjust?
✅ Correct Answer: a) Increase the container memory settings in YARN
📝 Explanation:
Adjusting container sizes allows more memory per application.
82. What is Hive primarily used for in the Hadoop ecosystem?
✅ Correct Answer: a) Data warehousing operations
📝 Explanation:
Hive enables SQL-like querying on large datasets in HDFS.
83. Which tool in the Hadoop ecosystem is best suited for real-time data processing?
✅ Correct Answer: d) Storm
📝 Explanation:
Storm is designed for real-time stream processing.
84. How does Pig differ from SQL in terms of data processing?
✅ Correct Answer: a) Pig processes data in a procedural manner, while SQL is declarative
📝 Explanation:
Pig uses a procedural language for data flow, unlike declarative SQL.
85. What is the primary function of Apache Flume?
✅ Correct Answer: b) Data ingestion into Hadoop
📝 Explanation:
Flume collects, aggregates, and moves log data into HDFS.
86. In the Hadoop ecosystem, what is the role of Oozie?
✅ Correct Answer: a) Job scheduling
📝 Explanation:
Oozie is a workflow scheduler for Hadoop jobs.
87. How does HBase provide fast access to large datasets?
✅ Correct Answer: a) By using a column-oriented storage format
📝 Explanation:
HBase is a column-family NoSQL database built on HDFS.
88. Which command in HBase is used to scan all records from a specific table?
✅ Correct Answer: a) scan 'table_name'
📝 Explanation:
The scan command retrieves all or a range of records from a table.
89. How do you create a new table in Hive?
✅ Correct Answer: a) CREATE TABLE table_name (columns)
📝 Explanation:
This is the standard HiveQL command for creating tables.
90. What is the primary command to view the status of a job in Oozie?
✅ Correct Answer: a) oozie job -info job_id
📝 Explanation:
This command displays detailed information about a job's status.
91. What functionality does the sqoop merge command provide?
✅ Correct Answer: d) Merging updates from an RDBMS into an existing Hadoop dataset
📝 Explanation:
Sqoop merge combines incremental imports with existing data.
92. What should you verify first if a Sqoop import fails?
✅ Correct Answer: a) The database connection settings
📝 Explanation:
Connection issues are the most common cause of import failures.
93. If a Hive query runs significantly slower than expected, what should be checked first?
✅ Correct Answer: a) The structure of the tables and indexes
📝 Explanation:
Poor table design can lead to inefficient queries.
94. What is Hive mainly used for in the Hadoop ecosystem?
✅ Correct Answer: a) Data warehousing
📝 Explanation:
Hive provides data summarization and ad-hoc querying.
95. How does Hive handle data storage?
✅ Correct Answer: b) It utilizes HDFS
📝 Explanation:
Hive stores data in HDFS using directories and files.
96. What type of data models does Hive support?
✅ Correct Answer: d) Structured, unstructured, and semi-structured data
📝 Explanation:
Hive supports various data formats including ORC, Parquet, etc.
97. Which Hive component is responsible for converting SQL queries into MapReduce jobs?
✅ Correct Answer: c) Hive Driver
📝 Explanation:
The Driver receives queries and creates execution plans.
98. How does partitioning in Hive improve query performance?
✅ Correct Answer: a) By decreasing the size of data scans
📝 Explanation:
Partitioning allows Hive to skip irrelevant partitions during queries.
99. What is the correct HiveQL command to list all tables in the database?
✅ Correct Answer: a) SHOW TABLES
📝 Explanation:
SHOW TABLES lists all tables in the current database.
100. How do you add a new column to an existing Hive table?
✅ Correct Answer: a) ALTER TABLE table_name ADD COLUMNS (new_column type)
📝 Explanation:
This command adds columns to a table without affecting existing data.
101. In Hive, which command would you use to change the data type of a column in a table?
✅ Correct Answer: a) ALTER TABLE table_name CHANGE COLUMN old_column new_column new_type
📝 Explanation:
CHANGE COLUMN allows modifying column names and types.
102. How can you optimize a Hive query to limit the number of MapReduce jobs it generates?
✅ Correct Answer: a) Use multi-table inserts whenever possible
📝 Explanation:
Multi-table inserts reduce the number of jobs by combining operations.
103. What is a common fix if a Hive query returns incorrect results?
✅ Correct Answer: c) Check and correct the query logic
📝 Explanation:
Incorrect results are usually due to errors in the query itself.
104. What should you check if a Hive job is running longer than expected without errors?
✅ Correct Answer: b) The configuration parameters for resource allocation
📝 Explanation:
Resource allocation affects job execution time.
105. What is Pig primarily used for in the Hadoop ecosystem?
✅ Correct Answer: a) Data transformations
📝 Explanation:
Pig is a high-level platform for creating programs that run on Hadoop.
106. What makes Pig different from traditional SQL in processing data?
✅ Correct Answer: a) Pig processes data iteratively and allows multiple outputs from a single query.
📝 Explanation:
Pig's procedural nature allows for complex data flows.
107. In Pig, what is the difference between 'STORE' and 'DUMP'?
✅ Correct Answer: a) 'STORE' writes the output to the filesystem, while 'DUMP' displays the output on the screen.
📝 Explanation:
DUMP is for viewing results locally, STORE for persisting to HDFS.
108. How does Pig handle schema-less data?
✅ Correct Answer: a) By inferring the schema at runtime.
📝 Explanation:
Pig is schema-flexible and infers types during execution.
109. How can Pig scripts be optimized to handle large datasets more efficiently?
✅ Correct Answer: c) By minimizing data read operations.
📝 Explanation:
Reducing I/O operations improves performance on large data.
110. What Pig command is used to load data from a file?
✅ Correct Answer: a) LOAD 'data.txt' AS (line);
📝 Explanation:
LOAD reads data into a relation with an optional schema.
111. How do you group data by a specific column in Pig?
✅ Correct Answer: a) GROUP data BY column;
📝 Explanation:
GROUP creates groups of data based on a key.
112. What Pig function aggregates data to find the total?
✅ Correct Answer: a) SUM(data.column);
📝 Explanation:
SUM computes the sum of values in a group.
113. How do you filter rows in Pig that match a specific condition?
✅ Correct Answer: a) FILTER data BY condition;
📝 Explanation:
FILTER removes tuples that do not match the condition.
114. What is the first thing you should check if a Pig script fails due to an out-of-memory error?
✅ Correct Answer: a) The data sizes being processed.
📝 Explanation:
Large data can exceed memory limits.
115. If a Pig script is unexpectedly slow, what should be checked first to improve performance?
✅ Correct Answer: b) The amount of data being processed.
📝 Explanation:
Large datasets naturally take longer; optimize accordingly.
116. What is the primary storage model used by HBase?
✅ Correct Answer: b) Column-oriented
📝 Explanation:
HBase uses a column-family based storage model.
117. How does HBase handle scalability?
✅ Correct Answer: a) Through horizontal scaling by adding more nodes
📝 Explanation:
HBase scales by distributing regions across RegionServers.
118. Which of the following is true about Hadoop's design?
✅ Correct Answer: b) It assumes that hardware failures are the norm
📝 Explanation:
Hadoop is built to handle frequent hardware failures gracefully.
119. What is the default replication factor in HDFS?
✅ Correct Answer: c) 3
📝 Explanation:
The default replication factor is 3 for fault tolerance.
120. In MapReduce, what is the purpose of the shuffle phase?
✅ Correct Answer: b) To sort and group intermediate data by key
📝 Explanation:
Shuffle transfers and sorts mapper outputs for reducers.
121. Which Hadoop ecosystem tool is used for data serialization?
✅ Correct Answer: a) Avro
📝 Explanation:
Avro is a data serialization system for Hadoop.
122. What is YARN?
✅ Correct Answer: a) Yet Another Resource Negotiator
📝 Explanation:
YARN stands for Yet Another Resource Negotiator.
123. In Hive, what is a SerDe?
✅ Correct Answer: a) Serializer/Deserializer
📝 Explanation:
SerDe handles serialization and deserialization in Hive.
124. What is the main goal of Hadoop's data locality?
✅ Correct Answer: a) To minimize network traffic
📝 Explanation:
Data locality moves computation to data to avoid network I/O.
125. Which file format in Hadoop is optimized for OLAP workloads?
✅ Correct Answer: c) Parquet
📝 Explanation:
Parquet is columnar storage optimized for analytical queries.
126. What is the role of Zookeeper in Hadoop?
✅ Correct Answer: a) Distributed coordination service
📝 Explanation:
Zookeeper provides coordination for distributed applications.
127. In Pig, what does the FOREACH operator do?
✅ Correct Answer: b) Applies expressions to each tuple
📝 Explanation:
FOREACH generates a new relation by applying expressions.
128. What is a RegionServer in HBase?
✅ Correct Answer: a) Manages regions of tables
📝 Explanation:
RegionServers serve data for read and write requests.
129. Which Sqoop option is used for incremental imports?
✅ Correct Answer: a) --incremental
📝 Explanation:
This option allows importing only new or updated rows.
130. What is the default port for the NameNode web UI?
✅ Correct Answer: a) 50070
📝 Explanation:
Port 50070 is for the NameNode's web interface in Hadoop 1.x; 9870 in 3.x.
131. In MapReduce, what is speculation?
✅ Correct Answer: a) Running duplicate tasks to mitigate slow tasks
📝 Explanation:
Speculative execution launches duplicates of slow tasks.
132. What is the purpose of the /tmp directory in HDFS?
✅ Correct Answer: a) Temporary storage for intermediate files
📝 Explanation:
/tmp is used for temporary data during job execution.
133. Which tool is used for monitoring Hadoop clusters?
✅ Correct Answer: a) Ambari
📝 Explanation:
Ambari provides a web-based UI for provisioning and monitoring.
134. What is a Bloom filter in HBase?
✅ Correct Answer: a) A probabilistic data structure for membership testing
📝 Explanation:
Bloom filters reduce disk seeks for non-existent keys.
135. In Hive, what is bucketing?
✅ Correct Answer: a) Dividing data into buckets based on a hash of a column
📝 Explanation:
Bucketing improves join performance by distributing data evenly.
136. What is the maximum number of characters in a Hadoop block name?
✅ Correct Answer: a) 128
📝 Explanation:
Block IDs are 128-bit numbers.
137. Which is not a valid Hadoop daemon?
✅ Correct Answer: d) QueryNode
📝 Explanation:
QueryNode is not a Hadoop daemon.
138. What does DFS stand for in HDFS?
✅ Correct Answer: a) Distributed File System
📝 Explanation:
HDFS is Hadoop's Distributed File System.
139. In YARN, what is a container?
✅ Correct Answer: a) A unit of resource allocation
📝 Explanation:
Containers encapsulate resources like CPU and memory for tasks.
140. What is the purpose of the InputFormat in MapReduce?
✅ Correct Answer: a) To define how input data is split and read
📝 Explanation:
InputFormat provides the input splits and RecordReader.
141. Which compression codec is splittable in Hadoop?
✅ Correct Answer: b) Bzip2
📝 Explanation:
Bzip2 supports splitting for parallel processing.
142. What is the default sort order in Hadoop?
✅ Correct Answer: a) Ascending
📝 Explanation:
Keys are sorted in ascending order by default.
143. In HBase, what is a column family?
✅ Correct Answer: a) A group of related columns
📝 Explanation:
Column families group columns that are stored together.
144. What is Tez in Hadoop?
✅ Correct Answer: a) An execution engine for DAGs
📝 Explanation:
Tez optimizes MapReduce by using directed acyclic graphs.
145. Which is a NoSQL database in Hadoop ecosystem?
✅ Correct Answer: a) HBase
📝 Explanation:
HBase is a distributed, scalable, big data store.
146. What is the command to start the Hadoop DFS daemon?
✅ Correct Answer: a) start-dfs.sh
📝 Explanation:
This script starts the HDFS daemons.
147. What is rack awareness in Hadoop?
✅ Correct Answer: a) Placing replicas in different racks for fault tolerance
📝 Explanation:
Rack awareness improves data availability.
148. Which language is used to write Hive queries?
✅ Correct Answer: a) HiveQL
📝 Explanation:
HiveQL is similar to SQL for querying Hive tables.
149. What is the purpose of the fair scheduler in Hadoop?
✅ Correct Answer: a) To allocate resources fairly among users
📝 Explanation:
Fair Scheduler ensures equitable resource distribution.
150. In Pig, what is a bag?
✅ Correct Answer: a) A collection of tuples
📝 Explanation:
Bags are multi-sets in Pig data model.
151. What is the maximum number of map tasks per job in Hadoop?
✅ Correct Answer: a) No limit
📝 Explanation:
The number of map tasks is determined by input splits.
152. Which is used for machine learning in Hadoop?
✅ Correct Answer: a) Mahout
📝 Explanation:
Mahout provides scalable machine learning algorithms.
153. What is the block report interval in HDFS?
✅ Correct Answer: a) 6 hours
📝 Explanation:
DataNodes send block reports every 6 hours.
154. In MapReduce, what is a counter?
✅ Correct Answer: a) A way to track job progress
📝 Explanation:
Counters collect statistics during job execution.
155. What is the default input format in MapReduce?
✅ Correct Answer: a) TextInputFormat
📝 Explanation:
TextInputFormat treats each line as a key-value pair.
156. Which tool is used for log aggregation in Hadoop?
✅ Correct Answer: d) All of the above
📝 Explanation:
These tools can aggregate logs into HDFS.
157. What is a split in MapReduce?
✅ Correct Answer: a) A chunk of the input file for a mapper
📝 Explanation:
Input splits define the input for each mapper.
158. In HBase, what is the master node called?
✅ Correct Answer: a) HMaster
📝 Explanation:
HMaster manages the HBase cluster.
159. What is the purpose of the --direct option in Sqoop?
✅ Correct Answer: a) To use direct connectors for faster imports
📝 Explanation:
Direct mode uses database-specific tools for efficiency.
160. Which is a graph processing framework in Hadoop?
✅ Correct Answer: d) All of the above
📝 Explanation:
These support graph processing.
161. What is the heartbeat interval for DataNodes?
✅ Correct Answer: a) 3 seconds
📝 Explanation:
DataNodes send heartbeats every 3 seconds.
162. In Hive, what is dynamic partitioning?
✅ Correct Answer: a) Automatic partition creation based on data
📝 Explanation:
Dynamic partitioning creates partitions on the fly.
163. What is the role of the OutputCommitter in MapReduce?
✅ Correct Answer: a) To commit the output of the job
📝 Explanation:
OutputCommitter handles the commit of task outputs.
164. Which is not a valid replication factor in HDFS?
✅ Correct Answer: d) 0
📝 Explanation:
Replication factor must be at least 1.