1. IBM and ________ have announced a major initiative to use Hadoop to support university courses in distributed computer programming.
Correct Answer: d) Google
Explanation:
Google and IBM Announce University Initiative to Address Internet-Scale.
2. Point out the correct statement.
Correct Answer: b) Hadoop stores data in HDFS and supports data compression/decompression
Explanation:
Data compression can be achieved using compression algorithms like bzip2, gzip, LZO, etc. Different algorithms can be used in different scenarios based on their capabilities.
3. What license is Hadoop distributed under?
Correct Answer: a) Apache License 2.0
Explanation:
Hadoop is Open Source, released under Apache 2 license.
4. Sun also has the Hadoop Live CD ________ project, which allows running a fully functional Hadoop cluster using a live CD.
Correct Answer: b) OpenSolaris
Explanation:
The OpenSolaris Hadoop LiveCD project built a bootable CD-ROM image.
5. Which of the following genres does Hadoop produce?
Correct Answer: a) Distributed file system
Explanation:
The Hadoop Distributed File System (HDFS) is designed to store very large data sets reliably, and to stream those data sets at high bandwidth to the user.
6. What was Hadoop written in?
Correct Answer: c) Java (programming language)
Explanation:
The Hadoop framework itself is mostly written in the Java programming language, with some native code in C and command-line utilities written as shell scripts.
7. Which of the following platforms does Hadoop run on?
Correct Answer: c) Cross-platform
Explanation:
Hadoop has support for cross-platform operating system.
8. Hadoop achieves reliability by replicating the data across multiple hosts and hence does not require ________ storage on hosts.
Correct Answer: a) RAID
Explanation:
With the default replication value, 3, data is stored on three nodes: two on the same rack, and one on a different rack.
9. Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit MapReduce jobs.
Correct Answer: a) MapReduce
Explanation:
MapReduce engine uses to distribute work around a cluster.
10. The Hadoop list includes the HBase database, the Apache Mahout ________ system, and matrix operations.
Correct Answer: a) Machine learning
Explanation:
The Apache Mahout project’s goal is to build a scalable machine learning tool.
11. Which of the following is a characteristic of HDFS?
Correct Answer: d) All of the mentioned
Explanation:
HDFS is designed to run on commodity hardware, provides high throughput access to application data, and is built using Java.
12. Point out the correct statement.
Correct Answer: d) All of the mentioned
Explanation:
HDFS is designed to run on commodity hardware, built using Java, and is open source.
13. Which of the following is a feature of HDFS?
Correct Answer: d) All of the mentioned
Explanation:
HDFS features include splitting files into blocks, replicating blocks, and having a fixed block size.
14. Which of the following is a benefit of HDFS?
Correct Answer: d) All of the mentioned
Explanation:
HDFS is economical, highly scalable, and highly available.
15. Point out the wrong statement.
Correct Answer: d) HDFS is not fault tolerant
Explanation:
HDFS is fault tolerant, so the statement 'HDFS is not fault tolerant' is wrong.
16. Which of these is not a feature of HDFS?
Correct Answer: c) Low Latency Access
Explanation:
HDFS is designed for high-throughput access to large datasets with high latency, not low latency access.
17. Which of these is a characteristic of HDFS NameNode?
Correct Answer: a) Manages the file system namespace
Explanation:
The NameNode manages the file system namespace and regulates access to files by clients.
18. What is the default block size in HDFS?
Correct Answer: b) 64 MB
Explanation:
The default block size in HDFS is 64 MB.
19. Which command is used to copy files from local file system to HDFS?
Correct Answer: b) hadoop fs -copyFromLocal
Explanation:
The command 'hadoop fs -copyFromLocal' is used to copy files from the local file system to HDFS.
20. What is the purpose of the Secondary NameNode in HDFS?
Correct Answer: c) Performs periodic checkpoints of the NameNode's metadata
Explanation:
The Secondary NameNode is responsible for performing periodic checkpoints of the NameNode's metadata to allow for recovery in case of a failure.
21. Which of these is not a Hadoop file format?
Correct Answer: d) AvroFile
Explanation:
Avro is a data serialization system, not a Hadoop file format. TextFile, SequenceFile, and RCFile are Hadoop-specific file formats.
22. Which of these is not a Hadoop file format?
Correct Answer: d) JSONFile
Explanation:
JSONFile is not a Hadoop-specific file format. SequenceFile, RCFile, and ORCFile are Hadoop-specific file formats.
23. Which of these is not a Hadoop file format?
Correct Answer: None of the mentioned
Explanation:
SequenceFile, RCFile, ORCFile, and ParquetFile are all Hadoop-specific file formats.
24. What is Hadoop primarily used for?
Correct Answer: a) Big data processing
Explanation:
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers.
25. Which core component of Hadoop is responsible for data storage?
Correct Answer: c) HDFS
Explanation:
HDFS (Hadoop Distributed File System) is the primary storage system used by Hadoop.
26. What type of architecture does Hadoop use to process large data sets?
Correct Answer: c) Master-slave
Explanation:
Hadoop uses a master-slave architecture where the master manages the cluster and slaves perform the actual data processing.
27. Hadoop can process data that is:
Correct Answer: d) All of the above
Explanation:
Hadoop is versatile and can handle structured, unstructured, and semi-structured data.
28. Which feature of Hadoop makes it suitable for processing large volumes of data?
Correct Answer: a) Fault tolerance
Explanation:
Hadoop's fault tolerance allows it to continue processing even if some nodes fail.
29. What mechanism does Hadoop use to ensure data is not lost in case of a node failure?
Correct Answer: c) Data replication
Explanation:
Hadoop replicates data across multiple nodes to ensure availability and fault tolerance.
30. Which programming model is primarily used by Hadoop to process large data sets?
Correct Answer: d) MapReduce
Explanation:
MapReduce is the core programming model for processing large datasets in Hadoop.
31. Which command is used to view the contents of a directory in HDFS?
Correct Answer: a) hadoop fs -ls
Explanation:
The 'ls' command lists the contents of directories in HDFS.
32. Which component in Hadoop's architecture is responsible for processing data?
Correct Answer: c) JobTracker
Explanation:
The JobTracker manages the execution of MapReduce jobs.
33. What role does the NameNode play in Hadoop Architecture?
Correct Answer: a) Manages the cluster's storage resources
Explanation:
The NameNode is the master server that manages the file system namespace and metadata.
34. In Hadoop, what is the function of a DataNode?
Correct Answer: a) Stores data blocks
Explanation:
DataNodes store the actual data in blocks and report to the NameNode.
35. Which type of file system does Hadoop use?
Correct Answer: a) Distributed
Explanation:
Hadoop uses the Hadoop Distributed File System (HDFS).
36. How does the Hadoop framework handle hardware failures?
Correct Answer: b) Re-routing tasks
Explanation:
Hadoop re-routes tasks to other nodes in case of failure.
37. What mechanism allows Hadoop to scale processing capacity?
Correct Answer: a) Adding more nodes to the network
Explanation:
Hadoop scales horizontally by adding more nodes.
38. How do you list all nodes in a Hadoop cluster using the command line?
Correct Answer: a) hadoop dfsadmin -report
Explanation:
The dfsadmin -report command provides cluster status including nodes.
39. Which command can you use to check the health of the Hadoop file system?
Correct Answer: b) hadoop fsck
Explanation:
hadoop fsck checks the health of files in HDFS.
40. What is the purpose of the hadoop balancer command?
Correct Answer: b) To balance the storage usage across the DataNodes
Explanation:
The balancer evens out the distribution of data blocks across DataNodes.
41. What should you check first if the NameNode is not starting?
Correct Answer: a) Configuration files
Explanation:
Misconfigured files are a common reason for startup failures.
42. When a DataNode is reported as down, what is the first action to take?
Correct Answer: b) Check network connectivity to the DataNode
Explanation:
Network issues are often the cause of a node appearing down.
43. What is a fundamental characteristic of HDFS?
Correct Answer: a) Fault tolerance
Explanation:
HDFS is designed to be fault tolerant through data replication.
44. Which of these is a feature of MapReduce?
Correct Answer: a) Automatic parallelization and distribution
Explanation:
MapReduce automatically parallelizes the execution of the task across a large number of servers in the cluster, distributing the data and the computational logic.
45. Which of these is a key component of MapReduce?
Correct Answer: a) JobTracker and TaskTracker
Explanation:
In the original Hadoop MapReduce implementation, JobTracker and TaskTracker are key components that manage job scheduling and task execution.
46. Which of the following is the primary function of the Map phase in MapReduce?
Correct Answer: c) To map input key-value pairs to intermediate key-value pairs
Explanation:
The Map phase processes input key-value pairs and produces intermediate key-value pairs, which are then shuffled and sorted before the Reduce phase.
47. Which of these is NOT a phase in MapReduce?
Correct Answer: d) Merge
Explanation:
Shuffle and Sort are intermediate steps between Map and Reduce, but Merge is not a distinct phase in MapReduce; it may refer to operations within the framework but is not a core phase.
48. Which of the following best describes the purpose of the Reduce phase?
Correct Answer: b) To aggregate the mapped data
Explanation:
The Reduce phase takes the intermediate data produced by the Map phase, groups it by key, and applies a reduce function to aggregate the values for each key.
49. Which of these classes is used to write the output of a MapReduce job?
Correct Answer: c) FileOutputFormat
Explanation:
FileOutputFormat is used to specify the output location for the MapReduce job. It defines how the output should be written to the file system.
50. Which of these classes is used to read the input for a MapReduce job?
Correct Answer: b) FileInputFormat
Explanation:
FileInputFormat is used to specify the input location for the MapReduce job. It defines how the input should be read from the file system.
51. Which of these is a generic API for MapReduce in Hadoop?
Correct Answer: d) Job
Explanation:
Job is a generic API for MapReduce in Hadoop. It provides a high-level interface to configure and run MapReduce jobs.
52. Which of these classes is used to specify the mapper class in a MapReduce job?
Correct Answer: a) setMapperClass()
Explanation:
setMapperClass() is used to specify the mapper class in a MapReduce job. It defines the class that will perform the map operation.
53. Which of these classes is used to specify the reducer class in a MapReduce job?
Correct Answer: b) setReducerClass()
Explanation:
setReducerClass() is used to specify the reducer class in a MapReduce job. It defines the class that will perform the reduce operation.
54. Which of these classes is used to specify the input format class in a MapReduce job?
Correct Answer: c) setInputFormatClass()
Explanation:
setInputFormatClass() is used to specify the input format class in a MapReduce job. It defines how the input data should be formatted.
55. Which of these classes is used to specify the output format class in a MapReduce job?
Correct Answer: d) setOutputFormatClass()
Explanation:
setOutputFormatClass() is used to specify the output format class in a MapReduce job. It defines how the output data should be formatted.
56. Which of these methods is used to set the number of reduce tasks in a MapReduce job?
Correct Answer: b) setNumReduceTasks()
Explanation:
setNumReduceTasks() is used to set the number of reduce tasks in a MapReduce job. It specifies how many reduce tasks should be executed.
57. Which of these methods is used to set the number of map tasks in a MapReduce job?
Correct Answer: a) setNumMapTasks()
Explanation:
setNumMapTasks() is used to set the number of map tasks in a MapReduce job. It specifies how many map tasks should be executed.
58. What action should you take if you notice that the HDFS capacity is unexpectedly decreasing?
Correct Answer: a) Check for under-replicated blocks
Explanation:
Under-replicated blocks can cause capacity issues as Hadoop tries to replicate them.
59. Which operation is NOT a typical function of the Reduce phase in MapReduce?
Correct Answer: d) Filtering records based on a condition
Explanation:
Filtering is typically done in the Map phase; Reduce focuses on aggregation.
60. How does the MapReduce framework typically divide the processing of data?
Correct Answer: c) Data is split into blocks, which are processed in parallel
Explanation:
Input data is split into blocks and processed in parallel by mappers.
61. What is the role of the Combiner function in a MapReduce job?
Correct Answer: b) To reduce the amount of data transferred between the Map and Reduce tasks
Explanation:
Combiners perform local aggregation to minimize network traffic.
62. In which scenario would you configure multiple reducers in a MapReduce job?
Correct Answer: d) All of the above
Explanation:
Multiple reducers allow for parallelism, handle large data, and produce partitioned output.
63. What determines the number of mappers to be run in a MapReduce job?
Correct Answer: a) The size of the input data
Explanation:
The number of mappers is determined by the input split size.
64. What happens if a mapper fails during the execution of a MapReduce job?
Correct Answer: b) Only the failed mapper tasks are retried
Explanation:
Hadoop retries only the failed tasks to ensure fault tolerance.
65. Which MapReduce method is called once at the end of the task?
Correct Answer: c) cleanup()
Explanation:
The cleanup() method is called once at the end of each task.
66. How do you specify the number of reduce tasks for a Hadoop job?
Correct Answer: a) Set the mapred.reduce.tasks parameter in the job configuration
Explanation:
The number of reducers is set via job configuration parameters.
67. What is the purpose of the Partitioner class in MapReduce?
Correct Answer: d) To control which key-value pairs go to which reducer
Explanation:
The Partitioner determines the reducer for each key.
68. What does the WritableComparable interface in Hadoop define?
Correct Answer: a) Data types that can be compared and written in Hadoop
Explanation:
WritableComparable allows objects to be serialized and compared for sorting.
69. What common issue should be checked first when a MapReduce job is running slower than expected?
Correct Answer: b) Inadequate memory allocation
Explanation:
Insufficient memory can cause spilling to disk and slow performance.
70. What is an effective way to resolve data skew during the reduce phase of a MapReduce job?
Correct Answer: a) Adjusting the number of reducers
Explanation:
Increasing reducers can help distribute skewed data more evenly.
71. What is the primary function of the Resource Manager in YARN?
Correct Answer: a) Managing cluster resources
Explanation:
The ResourceManager is the master daemon that arbitrates resources among applications.
72. How does YARN improve the scalability of Hadoop?
Correct Answer: a) By separating job management and resource management
Explanation:
YARN decouples resource management from job scheduling/monitoring.
73. What role does the NodeManager play in a YARN cluster?
Correct Answer: c) It manages the resources on a single node
Explanation:
NodeManager is per-machine and responsible for containers and monitoring.
74. Which YARN component is responsible for monitoring the health of the cluster nodes?
Correct Answer: b) NodeManager
Explanation:
NodeManagers monitor their node's resource usage and report to the ResourceManager.
75. In YARN, what does the ApplicationMaster do?
Correct Answer: a) Manages the lifecycle of an application
Explanation:
ApplicationMaster negotiates resources and executes tasks for an application.
76. How does YARN handle the failure of an ApplicationMaster?
Correct Answer: b) It automatically restarts the ApplicationMaster
Explanation:
YARN restarts the ApplicationMaster on failure to ensure fault tolerance.
77. Which command is used to list all running applications in YARN?
Correct Answer: a) yarn application -list
Explanation:
This command lists all running YARN applications.
78. How can you kill an application in YARN using the command line?
Correct Answer: a) yarn application -kill
Explanation:
This command terminates a running application by ID.
79. What command would you use to check the logs for a specific YARN application?
Correct Answer: a) yarn logs -applicationId
Explanation:
This command aggregates and prints logs for the specified application.
80. What should be your first step if a YARN application fails to start?
Correct Answer: a) Check the application logs for errors
Explanation:
Logs provide the most direct insight into startup failures.
81. If you notice that applications in YARN are frequently being killed due to insufficient memory, what should you adjust?
Correct Answer: a) Increase the container memory settings in YARN
Explanation:
Adjusting container sizes allows more memory per application.
82. What is Hive primarily used for in the Hadoop ecosystem?
Correct Answer: a) Data warehousing operations
Explanation:
Hive enables SQL-like querying on large datasets in HDFS.
83. Which tool in the Hadoop ecosystem is best suited for real-time data processing?
Correct Answer: d) Storm
Explanation:
Storm is designed for real-time stream processing.
84. How does Pig differ from SQL in terms of data processing?
Correct Answer: a) Pig processes data in a procedural manner, while SQL is declarative
Explanation:
Pig uses a procedural language for data flow, unlike declarative SQL.
85. What is the primary function of Apache Flume?
Correct Answer: b) Data ingestion into Hadoop
Explanation:
Flume collects, aggregates, and moves log data into HDFS.
86. In the Hadoop ecosystem, what is the role of Oozie?
Correct Answer: a) Job scheduling
Explanation:
Oozie is a workflow scheduler for Hadoop jobs.
87. How does HBase provide fast access to large datasets?
Correct Answer: a) By using a column-oriented storage format
Explanation:
HBase is a column-family NoSQL database built on HDFS.
88. Which command in HBase is used to scan all records from a specific table?
Correct Answer: a) scan 'table_name'
Explanation:
The scan command retrieves all or a range of records from a table.
89. How do you create a new table in Hive?
Correct Answer: a) CREATE TABLE table_name (columns)
Explanation:
This is the standard HiveQL command for creating tables.
90. What is the primary command to view the status of a job in Oozie?
Correct Answer: a) oozie job -info job_id
Explanation:
This command displays detailed information about a job's status.
91. What functionality does the sqoop merge command provide?
Correct Answer: d) Merging updates from an RDBMS into an existing Hadoop dataset
Explanation:
Sqoop merge combines incremental imports with existing data.
92. What should you verify first if a Sqoop import fails?
Correct Answer: a) The database connection settings
Explanation:
Connection issues are the most common cause of import failures.
93. If a Hive query runs significantly slower than expected, what should be checked first?
Correct Answer: a) The structure of the tables and indexes
Explanation:
Poor table design can lead to inefficient queries.
94. What is Hive mainly used for in the Hadoop ecosystem?
Correct Answer: a) Data warehousing
Explanation:
Hive provides data summarization and ad-hoc querying.
95. How does Hive handle data storage?
Correct Answer: b) It utilizes HDFS
Explanation:
Hive stores data in HDFS using directories and files.
96. What type of data models does Hive support?
Correct Answer: d) Structured, unstructured, and semi-structured data
Explanation:
Hive supports various data formats including ORC, Parquet, etc.
97. Which Hive component is responsible for converting SQL queries into MapReduce jobs?
Correct Answer: c) Hive Driver
Explanation:
The Driver receives queries and creates execution plans.
98. How does partitioning in Hive improve query performance?
Correct Answer: a) By decreasing the size of data scans
Explanation:
Partitioning allows Hive to skip irrelevant partitions during queries.
99. What is the correct HiveQL command to list all tables in the database?
Correct Answer: a) SHOW TABLES
Explanation:
SHOW TABLES lists all tables in the current database.
100. How do you add a new column to an existing Hive table?
Correct Answer: a) ALTER TABLE table_name ADD COLUMNS (new_column type)
Explanation:
This command adds columns to a table without affecting existing data.
101. In Hive, which command would you use to change the data type of a column in a table?
Correct Answer: a) ALTER TABLE table_name CHANGE COLUMN old_column new_column new_type
Explanation:
CHANGE COLUMN allows modifying column names and types.
102. How can you optimize a Hive query to limit the number of MapReduce jobs it generates?
Correct Answer: a) Use multi-table inserts whenever possible
Explanation:
Multi-table inserts reduce the number of jobs by combining operations.
103. What is a common fix if a Hive query returns incorrect results?
Correct Answer: c) Check and correct the query logic
Explanation:
Incorrect results are usually due to errors in the query itself.
104. What should you check if a Hive job is running longer than expected without errors?
Correct Answer: b) The configuration parameters for resource allocation
Explanation:
Resource allocation affects job execution time.
105. What is Pig primarily used for in the Hadoop ecosystem?
Correct Answer: a) Data transformations
Explanation:
Pig is a high-level platform for creating programs that run on Hadoop.
106. What makes Pig different from traditional SQL in processing data?
Correct Answer: a) Pig processes data iteratively and allows multiple outputs from a single query.
Explanation:
Pig's procedural nature allows for complex data flows.
107. In Pig, what is the difference between 'STORE' and 'DUMP'?
Correct Answer: a) 'STORE' writes the output to the filesystem, while 'DUMP' displays the output on the screen.
Explanation:
DUMP is for viewing results locally, STORE for persisting to HDFS.
108. How does Pig handle schema-less data?
Correct Answer: a) By inferring the schema at runtime.
Explanation:
Pig is schema-flexible and infers types during execution.
109. How can Pig scripts be optimized to handle large datasets more efficiently?
Correct Answer: c) By minimizing data read operations.
Explanation:
Reducing I/O operations improves performance on large data.
110. What Pig command is used to load data from a file?
Correct Answer: a) LOAD 'data.txt' AS (line);
Explanation:
LOAD reads data into a relation with an optional schema.
111. How do you group data by a specific column in Pig?
Correct Answer: a) GROUP data BY column;
Explanation:
GROUP creates groups of data based on a key.
112. What Pig function aggregates data to find the total?
Correct Answer: a) SUM(data.column);
Explanation:
SUM computes the sum of values in a group.
113. How do you filter rows in Pig that match a specific condition?
Correct Answer: a) FILTER data BY condition;
Explanation:
FILTER removes tuples that do not match the condition.
114. What is the first thing you should check if a Pig script fails due to an out-of-memory error?
Correct Answer: a) The data sizes being processed.
Explanation:
Large data can exceed memory limits.
115. If a Pig script is unexpectedly slow, what should be checked first to improve performance?
Correct Answer: b) The amount of data being processed.
Explanation:
Large datasets naturally take longer; optimize accordingly.
116. What is the primary storage model used by HBase?
Correct Answer: b) Column-oriented
Explanation:
HBase uses a column-family based storage model.
117. How does HBase handle scalability?
Correct Answer: a) Through horizontal scaling by adding more nodes
Explanation:
HBase scales by distributing regions across RegionServers.
118. Which of the following is true about Hadoop's design?
Correct Answer: b) It assumes that hardware failures are the norm
Explanation:
Hadoop is built to handle frequent hardware failures gracefully.
119. What is the default replication factor in HDFS?
Correct Answer: c) 3
Explanation:
The default replication factor is 3 for fault tolerance.
120. In MapReduce, what is the purpose of the shuffle phase?
Correct Answer: b) To sort and group intermediate data by key
Explanation:
Shuffle transfers and sorts mapper outputs for reducers.
121. Which Hadoop ecosystem tool is used for data serialization?
Correct Answer: a) Avro
Explanation:
Avro is a data serialization system for Hadoop.
122. What is YARN?
Correct Answer: a) Yet Another Resource Negotiator
Explanation:
YARN stands for Yet Another Resource Negotiator.
123. In Hive, what is a SerDe?
Correct Answer: a) Serializer/Deserializer
Explanation:
SerDe handles serialization and deserialization in Hive.
124. What is the main goal of Hadoop's data locality?
Correct Answer: a) To minimize network traffic
Explanation:
Data locality moves computation to data to avoid network I/O.
125. Which file format in Hadoop is optimized for OLAP workloads?
Correct Answer: c) Parquet
Explanation:
Parquet is columnar storage optimized for analytical queries.
126. What is the role of Zookeeper in Hadoop?
Correct Answer: a) Distributed coordination service
Explanation:
Zookeeper provides coordination for distributed applications.
127. In Pig, what does the FOREACH operator do?
Correct Answer: b) Applies expressions to each tuple
Explanation:
FOREACH generates a new relation by applying expressions.
128. What is a RegionServer in HBase?
Correct Answer: a) Manages regions of tables
Explanation:
RegionServers serve data for read and write requests.
129. Which Sqoop option is used for incremental imports?
Correct Answer: a) --incremental
Explanation:
This option allows importing only new or updated rows.
130. What is the default port for the NameNode web UI?
Correct Answer: a) 50070
Explanation:
Port 50070 is for the NameNode's web interface in Hadoop 1.x; 9870 in 3.x.
131. In MapReduce, what is speculation?
Correct Answer: a) Running duplicate tasks to mitigate slow tasks
Explanation:
Speculative execution launches duplicates of slow tasks.
132. What is the purpose of the /tmp directory in HDFS?
Correct Answer: a) Temporary storage for intermediate files
Explanation:
/tmp is used for temporary data during job execution.
133. Which tool is used for monitoring Hadoop clusters?
Correct Answer: a) Ambari
Explanation:
Ambari provides a web-based UI for provisioning and monitoring.
134. What is a Bloom filter in HBase?
Correct Answer: a) A probabilistic data structure for membership testing
Explanation:
Bloom filters reduce disk seeks for non-existent keys.
135. In Hive, what is bucketing?
Correct Answer: a) Dividing data into buckets based on a hash of a column
Explanation:
Bucketing improves join performance by distributing data evenly.
136. What is the maximum number of characters in a Hadoop block name?
Correct Answer: a) 128
Explanation:
Block IDs are 128-bit numbers.
137. Which is not a valid Hadoop daemon?
Correct Answer: d) QueryNode
Explanation:
QueryNode is not a Hadoop daemon.
138. What does DFS stand for in HDFS?
Correct Answer: a) Distributed File System
Explanation:
HDFS is Hadoop's Distributed File System.
139. In YARN, what is a container?
Correct Answer: a) A unit of resource allocation
Explanation:
Containers encapsulate resources like CPU and memory for tasks.
140. What is the purpose of the InputFormat in MapReduce?
Correct Answer: a) To define how input data is split and read
Explanation:
InputFormat provides the input splits and RecordReader.
141. Which compression codec is splittable in Hadoop?
Correct Answer: b) Bzip2
Explanation:
Bzip2 supports splitting for parallel processing.
142. What is the default sort order in Hadoop?
Correct Answer: a) Ascending
Explanation:
Keys are sorted in ascending order by default.
143. In HBase, what is a column family?
Correct Answer: a) A group of related columns
Explanation:
Column families group columns that are stored together.
144. What is Tez in Hadoop?
Correct Answer: a) An execution engine for DAGs
Explanation:
Tez optimizes MapReduce by using directed acyclic graphs.
145. Which is a NoSQL database in Hadoop ecosystem?
Correct Answer: a) HBase
Explanation:
HBase is a distributed, scalable, big data store.
146. What is the command to start the Hadoop DFS daemon?
Correct Answer: a) start-dfs.sh
Explanation:
This script starts the HDFS daemons.
147. What is rack awareness in Hadoop?
Correct Answer: a) Placing replicas in different racks for fault tolerance
Explanation:
Rack awareness improves data availability.
148. Which language is used to write Hive queries?
Correct Answer: a) HiveQL
Explanation:
HiveQL is similar to SQL for querying Hive tables.
149. What is the purpose of the fair scheduler in Hadoop?
Correct Answer: a) To allocate resources fairly among users
Explanation:
Fair Scheduler ensures equitable resource distribution.
150. In Pig, what is a bag?
Correct Answer: a) A collection of tuples
Explanation:
Bags are multi-sets in Pig data model.
151. What is the maximum number of map tasks per job in Hadoop?
Correct Answer: a) No limit
Explanation:
The number of map tasks is determined by input splits.
152. Which is used for machine learning in Hadoop?
Correct Answer: a) Mahout
Explanation:
Mahout provides scalable machine learning algorithms.
153. What is the block report interval in HDFS?
Correct Answer: a) 6 hours
Explanation:
DataNodes send block reports every 6 hours.
154. In MapReduce, what is a counter?
Correct Answer: a) A way to track job progress
Explanation:
Counters collect statistics during job execution.
155. What is the default input format in MapReduce?
Correct Answer: a) TextInputFormat
Explanation:
TextInputFormat treats each line as a key-value pair.
156. Which tool is used for log aggregation in Hadoop?
Correct Answer: d) All of the above
Explanation:
These tools can aggregate logs into HDFS.
157. What is a split in MapReduce?
Correct Answer: a) A chunk of the input file for a mapper
Explanation:
Input splits define the input for each mapper.
158. In HBase, what is the master node called?
Correct Answer: a) HMaster
Explanation:
HMaster manages the HBase cluster.
159. What is the purpose of the --direct option in Sqoop?
Correct Answer: a) To use direct connectors for faster imports
Explanation:
Direct mode uses database-specific tools for efficiency.
160. Which is a graph processing framework in Hadoop?
Correct Answer: d) All of the above
Explanation:
These support graph processing.
161. What is the heartbeat interval for DataNodes?
Correct Answer: a) 3 seconds
Explanation:
DataNodes send heartbeats every 3 seconds.
162. In Hive, what is dynamic partitioning?
Correct Answer: a) Automatic partition creation based on data
Explanation:
Dynamic partitioning creates partitions on the fly.
163. What is the role of the OutputCommitter in MapReduce?
Correct Answer: a) To commit the output of the job
Explanation:
OutputCommitter handles the commit of task outputs.
164. Which is not a valid replication factor in HDFS?
Correct Answer: d) 0
Explanation:
Replication factor must be at least 1.


