120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.
, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }1. Which of the following is not a key aspect of data quality in preprocessing?
Correct Answer: d) Database size
Explanation:
Data quality focuses on accuracy, completeness, and consistency, not the size of the database.
2. What is a common method to handle missing class labels in a dataset?
Correct Answer: a) Ignoring the tuple
Explanation:
Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.
3. In data cleaning, what technique uses statistical methods to fill missing values?
Correct Answer: b) Decision tree induction
Explanation:
Decision trees can predict probable values for missing data based on other attributes.
4. What is the primary purpose of data preprocessing in data analysis?
Correct Answer: b) To prepare data for effective analysis
Explanation:
Preprocessing transforms raw data into a format suitable for analysis and modeling.
5. Which technique is used to standardize feature ranges in preprocessing?
Correct Answer: b) Normalization
Explanation:
Normalization scales features to a common range, like 0-1, to prevent bias.
6. What does handling missing values prevent in data analysis?
Correct Answer: b) Biased model training
Explanation:
Unaddressed missing values can skew results and lead to inaccurate predictions.
7. Which method imputes missing values using the average of similar instances?
Correct Answer: b) K-NN imputation
Explanation:
K-Nearest Neighbors finds similar data points to estimate missing values.
8. In data cleaning, what is 'binning' primarily used for?
Correct Answer: a) Smoothing noisy data
Explanation:
Binning groups data into bins and replaces values with bin averages to reduce noise.
9. What is the risk of not removing duplicates in preprocessing?
Correct Answer: d) All of the above
Explanation:
Duplicates can bias models, inflate variance estimates, and slow processing.
10. Which encoding converts categorical data into binary vectors?
Correct Answer: b) One-hot encoding
Explanation:
One-hot encoding creates dummy variables for each category without ordinal assumptions.
11. What is 'data wrangling' in the context of preprocessing?
Correct Answer: b) Transforming messy data into clean format
Explanation:
Data wrangling involves cleaning and restructuring data for analysis.
12. Which technique detects outliers using statistical thresholds?
Correct Answer: a) Z-score method
Explanation:
Z-score identifies points more than 3 standard deviations from the mean as outliers.
13. What is the main goal of feature scaling in preprocessing?
Correct Answer: b) To make features comparable
Explanation:
Scaling ensures no feature dominates due to differing units or ranges.
14. In handling categorical data, when is label encoding appropriate?
Correct Answer: b) For ordinal data
Explanation:
Label encoding assigns numbers based on order, suitable for ordinal categories.
15. What does 'data integration' involve in preprocessing?
Correct Answer: a) Combining multiple data sources
Explanation:
Data integration merges datasets from different sources into a unified view.
16. Which imputation method is best for non-numerical data?
Correct Answer: b) Mode imputation
Explanation:
Mode imputation uses the most frequent value for categorical missing data.
17. What is a potential issue with mean imputation?
Correct Answer: a) Reduces variance
Explanation:
Mean imputation can underestimate variability in the dataset.
18. Which method groups data into buckets for smoothing?
Correct Answer: b) Binning
Explanation:
Binning sorts data into intervals and smooths by boundary or mean values.
19. What is 'data reduction' in preprocessing?
Correct Answer: b) Reducing data volume while preserving information
Explanation:
Data reduction techniques like PCA minimize data size without losing key insights.
20. When should you use median imputation over mean?
Correct Answer: b) For skewed distributions
Explanation:
Median is robust to outliers and skewness, unlike the mean.
21. What does PCA stand for in dimensionality reduction?
Correct Answer: a) Principal Component Analysis
Explanation:
PCA transforms data into principal components to reduce dimensions.
22. Which step checks for data consistency across sources?
Correct Answer: b) Data integration
Explanation:
Integration resolves inconsistencies when merging multiple data sources.
23. What is a common way to handle outliers in preprocessing?
Correct Answer: b) Cap or floor values
Explanation:
Capping limits extreme values to thresholds like quartiles.
24. In Python, which function detects missing values in Pandas?
Correct Answer: a) isnull()
Explanation:
Pandas' isnull() returns a boolean mask for missing values.
25. What is 'forward fill' in handling missing data?
Correct Answer: a) Filling with previous value
Explanation:
Forward fill propagates the last valid observation forward.
26. Which normalization brings data to zero mean and unit variance?
Correct Answer: b) Z-score normalization
Explanation:
Z-score uses mean and standard deviation for standardization.
27. What issue arises from inconsistent data formats?
Correct Answer: b) Parsing errors
Explanation:
Inconsistent formats like dates can cause loading or computation failures.
28. Which technique merges datasets on common keys?
Correct Answer: b) Joining
Explanation:
Joining combines tables based on matching keys like IDs.
29. What is 'data discretization' used for?
Correct Answer: a) Continuous to categorical conversion
Explanation:
Discretization bins continuous values into discrete intervals.
30. In outlier detection, what does IQR stand for?
Correct Answer: a) Interquartile Range
Explanation:
IQR method flags values outside 1.5 times the interquartile range.
31. What is a disadvantage of deleting rows with missing values?
Correct Answer: a) Data loss
Explanation:
Deletion reduces sample size, potentially biasing the dataset.
32. Which encoding preserves category frequencies?
Correct Answer: b) Frequency encoding
Explanation:
Frequency encoding replaces categories with their occurrence counts.
33. What does 'data transformation' include?
Correct Answer: a) Normalization and aggregation
Explanation:
Transformation alters data structure, like normalizing or aggregating.
34. How do you handle multicollinearity in preprocessing?
Correct Answer: a) Remove correlated features
Explanation:
Removing highly correlated features reduces redundancy and instability.
35. What is 'noise' in data cleaning?
Correct Answer: a) Random errors or variances
Explanation:
Noise refers to irrelevant or incorrect data points distorting patterns.
36. Which library in Python is used for data manipulation?
Correct Answer: b) Pandas
Explanation:
Pandas provides DataFrames for efficient data cleaning and transformation.
37. What is 'backward fill' for missing data?
Correct Answer: a) Filling with next value
Explanation:
Backward fill uses the next valid observation to fill gaps.
38. Which method is robust to outliers in scaling?
Correct Answer: b) Robust scaling
Explanation:
Robust scaling uses median and IQR, ignoring extreme values.
39. What causes data inconsistency?
Correct Answer: a) Different naming conventions
Explanation:
Synonyms or varying abbreviations across sources lead to inconsistencies.
40. In merging datasets, what is an inner join?
Correct Answer: a) Only matching records
Explanation:
Inner join returns rows with matching keys in both datasets.
41. What is entropy-based discretization?
Correct Answer: a) Supervised binning using class info
Explanation:
It uses information gain to create bins that maximize class separation.
42. How is outlier impact assessed?
Correct Answer: d) All of the above
Explanation:
Visual tools like box plots and scatters help identify outliers.
43. When is listwise deletion used?
Correct Answer: b) When data loss is acceptable
Explanation:
Listwise deletes entire rows with any missing values if sample size allows.
44. What does binary encoding do to categories?
Correct Answer: b) Converts to binary bits
Explanation:
Binary encoding halves dimensions compared to one-hot by using bits.
45. What is aggregation in transformation?
Correct Answer: a) Summarizing data
Explanation:
Aggregation computes summaries like means or counts from groups.
46. How to detect multicollinearity?
Correct Answer: c) Both a and b
Explanation:
High correlations or VIF > 5 indicate multicollinearity.
47. What smoothing method uses regression?
Correct Answer: b) Regression smoothing
Explanation:
Regression fits a model to local data for noise reduction.
48. Which Pandas method removes duplicates?
Correct Answer: a) drop_duplicates()
Explanation:
drop_duplicates() eliminates repeated rows based on specified columns.
49. What is interpolation for time series missing data?
Correct Answer: a) Linear estimation between points
Explanation:
Interpolation estimates values using surrounding data points.
50. What is L1 normalization?
Correct Answer: a) Sum to 1
Explanation:
L1 normalizes by dividing by the sum of absolute values.
51. What is entity resolution in cleaning?
Correct Answer: a) Merging similar records
Explanation:
It identifies and merges duplicates across datasets.
52. What join includes all left records?
Correct Answer: b) Left join
Explanation:
Left join keeps all from left table, matching from right.
53. What is chi-merge discretization?
Correct Answer: a) Supervised using chi-square
Explanation:
ChiMerge uses chi-square tests to determine bin boundaries.
54. What threshold for Z-score outlier?
Correct Answer: b) 3
Explanation:
Values beyond 3 standard deviations are typically outliers.
55. When to use pairwise deletion?
Correct Answer: a) For correlations
Explanation:
Pairwise uses available pairs, maximizing data for specific analyses.
56. What is target encoding?
Correct Answer: a) Replace with mean target
Explanation:
Target encoding uses the mean of the target for each category.
57. What is windowing in smoothing?
Correct Answer: a) Local averaging
Explanation:
Windowing averages values in a sliding window to smooth noise.
58. What does df.dropna() do in Pandas?
Correct Answer: b) Removes rows with missing
Explanation:
dropna() deletes rows or columns containing NaN values.
59. What is spline interpolation?
Correct Answer: a) Piecewise polynomial fitting
Explanation:
Splines use smooth polynomials between points for interpolation.
60. What is L2 normalization?
Correct Answer: a) Euclidean norm to 1
Explanation:
L2 divides by the square root of sum of squares.
61. What is fuzzy matching?
Correct Answer: a) Approximate string matching
Explanation:
Fuzzy matching handles typos or variations in entity names.
62. What is a full outer join?
Correct Answer: a) All records from both
Explanation:
Full outer join includes all rows, filling non-matches with nulls.
63. What is equal-width discretization?
Correct Answer: a) Equal bin sizes
Explanation:
Equal-width divides range into uniform intervals.
64. What is Mahalanobis distance for outliers?
Correct Answer: a) Accounts for covariance
Explanation:
It measures distance considering variable correlations.
65. What is hot-deck imputation?
Correct Answer: a) Random similar donor value
Explanation:
Hot-deck selects from observed values in similar cases.
66. What is cold-deck imputation?
Correct Answer: a) External donor values
Explanation:
Cold-deck uses values from another dataset or time.
67. What is feature engineering in preprocessing?
Correct Answer: a) Creating new features
Explanation:
It derives informative variables from raw data.
68. What is log transformation used for?
Correct Answer: a) Handling skewness
Explanation:
Log reduces right-skewness in positive data.
69. What is Box-Cox transformation?
Correct Answer: a) Stabilizes variance
Explanation:
It finds optimal power to make data more normal.
70. What is data profiling?
Correct Answer: a) Summarizing data characteristics
Explanation:
Profiling assesses quality, structure, and content.
71. What is schema matching?
Correct Answer: a) Aligning attributes across sources
Explanation:
It resolves differences in data schemas during integration.
72. What is equal-frequency discretization?
Correct Answer: a) Equal counts per bin
Explanation:
Quantile binning ensures similar sample sizes in bins.
73. What is Isolation Forest for outliers?
Correct Answer: a) Anomaly isolation via trees
Explanation:
It isolates outliers faster than normal points.
74. What is multiple imputation?
Correct Answer: a) Creates several filled datasets
Explanation:
It accounts for uncertainty by averaging multiple imputations.
75. What is polynomial feature generation?
Correct Answer: a) Higher-order interactions
Explanation:
It creates features like x^2 or x*y for non-linearity.
76. What is Yeo-Johnson transformation?
Correct Answer: a) For negative values too
Explanation:
Extension of Box-Cox handling negative and zero values.
77. What is data validation in cleaning?
Correct Answer: a) Ensuring accuracy and consistency
Explanation:
Validation checks rules like range or format compliance.
78. What is record linkage?
Correct Answer: a) Matching across datasets
Explanation:
It links records referring to the same entity.
79. What is unsupervised discretization?
Correct Answer: a) No class labels used
Explanation:
Methods like equal-width don't rely on target variables.
80. What is Local Outlier Factor (LOF)?
Correct Answer: a) Density-based outlier score
Explanation:
LOF compares local density to neighbors.
81. What is MICE imputation?
Correct Answer: a) Multiple Imputation by Chained Equations
Explanation:
Iterative regression for each variable with missings.
82. What is interaction feature?
Correct Answer: a) Product of two features
Explanation:
Captures combined effects, like age * income.
83. What is quantile transformation?
Correct Answer: a) Maps to uniform distribution
Explanation:
It ranks data and maps to a uniform or normal dist.
84. What is data auditing?
Correct Answer: a) Systematic quality review
Explanation:
Auditing identifies patterns of errors or anomalies.
85. What is object consolidation?
Correct Answer: a) Merging duplicate entities
Explanation:
Part of integration resolving duplicates across sources.
86. What is clustering-based discretization?
Correct Answer: a) Groups similar values
Explanation:
Uses clustering to form natural bins.
87. What is DBSCAN for outliers?
Correct Answer: a) Density-based clustering flags noise
Explanation:
Points not in clusters are outliers in DBSCAN.
88. What is Kalman filter imputation?
Correct Answer: a) For time series states
Explanation:
Predicts missing values using state-space models.
89. What is lagged feature?
Correct Answer: a) Previous time step value
Explanation:
Used in time series for autoregressive features.
90. What is power transformation?
Correct Answer: a) General family for normality
Explanation:
Includes Box-Cox and Yeo-Johnson for stabilizing variance.
91. What is referential integrity check?
Correct Answer: a) Valid foreign keys
Explanation:
Ensures links between tables are valid.
92. What is data deduplication?
Correct Answer: a) Removing exact duplicates
Explanation:
Identifies and eliminates identical records.
93. What is entropy-based binning?
Correct Answer: a) Minimizes intra-bin impurity
Explanation:
Uses information entropy for supervised discretization.
94. What is one-class SVM for outliers?
Correct Answer: a) Learns normal boundary
Explanation:
Flags points outside the learned normal region.
95. What is EM algorithm for imputation?
Correct Answer: a) Expectation-Maximization
Explanation:
Iteratively estimates parameters and missings.
96. What is rolling window feature?
Correct Answer: a) Moving statistics
Explanation:
Computes aggregates over time windows.
97. What is arcsinh transformation?
Correct Answer: a) For heavy-tailed data
Explanation:
Hyperbolic inverse sine handles extremes like log.
98. What is domain validation?
Correct Answer: a) Business rule checks
Explanation:
Ensures data fits domain-specific logic.
99. What is survivorship bias in cleaning?
Correct Answer: a) Ignoring failed entities
Explanation:
Clean by including all historical data.
100. What is k-bins discretization?
Correct Answer: a) Optimal bins via dynamic programming
Explanation:
Minimizes error with k bins.
101. What is elliptic envelope for outliers?
Correct Answer: a) Gaussian mixture based
Explanation:
Fits minimum covariance determinant.
102. What is random forest imputation?
Correct Answer: a) Tree-based predictions
Explanation:
Uses forests to predict missings from features.
103. What is Fourier transform feature?
Correct Answer: a) Frequency domain for time series
Explanation:
Extracts periodic components.
104. What is square root transformation?
Correct Answer: a) For count data skewness
Explanation:
Reduces variance in Poisson-like data.
105. What is completeness check?
Correct Answer: a) No missing required fields
Explanation:
Verifies all mandatory data is present.
106. What is selection bias in data cleaning?
Correct Answer: a) Non-random sampling
Explanation:
Address by understanding sampling method.
107. What is CAIM discretization?
Correct Answer: a) Class-Attribute Interdependence Maximization
Explanation:
Supervised method maximizing dependency.
108. What is COPOD for outliers?
Correct Answer: a) Copula-based
Explanation:
Unsupervised using copulas for dependence.
109. What is Bayesian imputation?
Correct Answer: a) Probabilistic filling
Explanation:
Incorporates prior distributions for estimates.
110. What is wavelet transform feature?
Correct Answer: a) Multi-resolution analysis
Explanation:
Decomposes signals into time-frequency components.
111. What is reciprocal transformation?
Correct Answer: a) 1/x for left-skew
Explanation:
Inverts values to handle negative skew.
112. What is uniqueness check?
Correct Answer: a) No unintended duplicates
Explanation:
Ensures primary keys are unique.
113. What is temporal bias in cleaning?
Correct Answer: a) Time-period specific data
Explanation:
Balance data across periods.
114. What is MDLP discretization?
Correct Answer: a) Minimum Description Length Principle
Explanation:
Supervised stopping criterion for binning.
115. What is KNN for outliers?
Correct Answer: a) Distance to neighbors
Explanation:
High distance indicates isolation.
116. What is matrix factorization imputation?
Correct Answer: a) Low-rank approximation
Explanation:
Fills sparse matrices like in recommender systems.
117. What is embedding feature for text?
Correct Answer: a) Dense vector representations
Explanation:
Word2Vec or BERT captures semantic meaning.
118. What is cube root transformation?
Correct Answer: a) Milder than log for skewness
Explanation:
x^(1/3) for moderate right-skew.
119. What is timeliness check?
Correct Answer: a) Data currency
Explanation:
Verifies data is up-to-date.
120. What is confirmation bias in cleaning?
Correct Answer: a) Retaining supporting data
Explanation:
Avoid by objective criteria.
121. What is FUSINTER discretization?
Correct Answer: a) Fuzzy unsupervised
Explanation:
Handles overlapping bins with fuzziness.
122. What is angle-based outlier detection?
Correct Answer: a) ABOD using angles
Explanation:
Efficient for high dimensions via angles.
123. What is deep learning imputation?
Correct Answer: a) Autoencoder-based
Explanation:
Learns latent representations for filling.
124. What is PCA feature for images?
Correct Answer: a) Eigenfaces
Explanation:
Reduces dimensionality in face recognition.
125. What is exponential transformation?
Correct Answer: a) For left-skew to right
Explanation:
e^x stretches lower values.
126. What is accuracy check?
Correct Answer: a) Matches reality
Explanation:
Verifies data correctness against sources.
127. What is spatial bias in data?
Correct Answer: a) Geographic imbalance
Explanation:
Clean by sampling across regions.


