120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.
, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }1. Which of the following is not a key aspect of data quality in preprocessing?
✅ Correct Answer: d) Database size
📝 Explanation:
Data quality focuses on accuracy, completeness, and consistency, not the size of the database.
2. What is a common method to handle missing class labels in a dataset?
✅ Correct Answer: a) Ignoring the tuple
📝 Explanation:
Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.
3. In data cleaning, what technique uses statistical methods to fill missing values?
✅ Correct Answer: b) Decision tree induction
📝 Explanation:
Decision trees can predict probable values for missing data based on other attributes.
4. What is the primary purpose of data preprocessing in data analysis?
✅ Correct Answer: b) To prepare data for effective analysis
📝 Explanation:
Preprocessing transforms raw data into a format suitable for analysis and modeling.
5. Which technique is used to standardize feature ranges in preprocessing?
✅ Correct Answer: b) Normalization
📝 Explanation:
Normalization scales features to a common range, like 0-1, to prevent bias.
6. What does handling missing values prevent in data analysis?
✅ Correct Answer: b) Biased model training
📝 Explanation:
Unaddressed missing values can skew results and lead to inaccurate predictions.
7. Which method imputes missing values using the average of similar instances?
✅ Correct Answer: b) K-NN imputation
📝 Explanation:
K-Nearest Neighbors finds similar data points to estimate missing values.
8. In data cleaning, what is 'binning' primarily used for?
✅ Correct Answer: a) Smoothing noisy data
📝 Explanation:
Binning groups data into bins and replaces values with bin averages to reduce noise.
9. What is the risk of not removing duplicates in preprocessing?
✅ Correct Answer: d) All of the above
📝 Explanation:
Duplicates can bias models, inflate variance estimates, and slow processing.
10. Which encoding converts categorical data into binary vectors?
✅ Correct Answer: b) One-hot encoding
📝 Explanation:
One-hot encoding creates dummy variables for each category without ordinal assumptions.
11. What is 'data wrangling' in the context of preprocessing?
✅ Correct Answer: b) Transforming messy data into clean format
📝 Explanation:
Data wrangling involves cleaning and restructuring data for analysis.
12. Which technique detects outliers using statistical thresholds?
✅ Correct Answer: a) Z-score method
📝 Explanation:
Z-score identifies points more than 3 standard deviations from the mean as outliers.
13. What is the main goal of feature scaling in preprocessing?
✅ Correct Answer: b) To make features comparable
📝 Explanation:
Scaling ensures no feature dominates due to differing units or ranges.
14. In handling categorical data, when is label encoding appropriate?
✅ Correct Answer: b) For ordinal data
📝 Explanation:
Label encoding assigns numbers based on order, suitable for ordinal categories.
15. What does 'data integration' involve in preprocessing?
✅ Correct Answer: a) Combining multiple data sources
📝 Explanation:
Data integration merges datasets from different sources into a unified view.
16. Which imputation method is best for non-numerical data?
✅ Correct Answer: b) Mode imputation
📝 Explanation:
Mode imputation uses the most frequent value for categorical missing data.
17. What is a potential issue with mean imputation?
✅ Correct Answer: a) Reduces variance
📝 Explanation:
Mean imputation can underestimate variability in the dataset.
18. Which method groups data into buckets for smoothing?
✅ Correct Answer: b) Binning
📝 Explanation:
Binning sorts data into intervals and smooths by boundary or mean values.
19. What is 'data reduction' in preprocessing?
✅ Correct Answer: b) Reducing data volume while preserving information
📝 Explanation:
Data reduction techniques like PCA minimize data size without losing key insights.
20. When should you use median imputation over mean?
✅ Correct Answer: b) For skewed distributions
📝 Explanation:
Median is robust to outliers and skewness, unlike the mean.
21. What does PCA stand for in dimensionality reduction?
✅ Correct Answer: a) Principal Component Analysis
📝 Explanation:
PCA transforms data into principal components to reduce dimensions.
22. Which step checks for data consistency across sources?
✅ Correct Answer: b) Data integration
📝 Explanation:
Integration resolves inconsistencies when merging multiple data sources.
23. What is a common way to handle outliers in preprocessing?
✅ Correct Answer: b) Cap or floor values
📝 Explanation:
Capping limits extreme values to thresholds like quartiles.
24. In Python, which function detects missing values in Pandas?
✅ Correct Answer: a) isnull()
📝 Explanation:
Pandas' isnull() returns a boolean mask for missing values.
25. What is 'forward fill' in handling missing data?
✅ Correct Answer: a) Filling with previous value
📝 Explanation:
Forward fill propagates the last valid observation forward.
26. Which normalization brings data to zero mean and unit variance?
✅ Correct Answer: b) Z-score normalization
📝 Explanation:
Z-score uses mean and standard deviation for standardization.
27. What issue arises from inconsistent data formats?
✅ Correct Answer: b) Parsing errors
📝 Explanation:
Inconsistent formats like dates can cause loading or computation failures.
28. Which technique merges datasets on common keys?
✅ Correct Answer: b) Joining
📝 Explanation:
Joining combines tables based on matching keys like IDs.
29. What is 'data discretization' used for?
✅ Correct Answer: a) Continuous to categorical conversion
📝 Explanation:
Discretization bins continuous values into discrete intervals.
30. In outlier detection, what does IQR stand for?
✅ Correct Answer: a) Interquartile Range
📝 Explanation:
IQR method flags values outside 1.5 times the interquartile range.
31. What is a disadvantage of deleting rows with missing values?
✅ Correct Answer: a) Data loss
📝 Explanation:
Deletion reduces sample size, potentially biasing the dataset.
32. Which encoding preserves category frequencies?
✅ Correct Answer: b) Frequency encoding
📝 Explanation:
Frequency encoding replaces categories with their occurrence counts.
33. What does 'data transformation' include?
✅ Correct Answer: a) Normalization and aggregation
📝 Explanation:
Transformation alters data structure, like normalizing or aggregating.
34. How do you handle multicollinearity in preprocessing?
✅ Correct Answer: a) Remove correlated features
📝 Explanation:
Removing highly correlated features reduces redundancy and instability.
35. What is 'noise' in data cleaning?
✅ Correct Answer: a) Random errors or variances
📝 Explanation:
Noise refers to irrelevant or incorrect data points distorting patterns.
36. Which library in Python is used for data manipulation?
✅ Correct Answer: b) Pandas
📝 Explanation:
Pandas provides DataFrames for efficient data cleaning and transformation.
37. What is 'backward fill' for missing data?
✅ Correct Answer: a) Filling with next value
📝 Explanation:
Backward fill uses the next valid observation to fill gaps.
38. Which method is robust to outliers in scaling?
✅ Correct Answer: b) Robust scaling
📝 Explanation:
Robust scaling uses median and IQR, ignoring extreme values.
39. What causes data inconsistency?
✅ Correct Answer: a) Different naming conventions
📝 Explanation:
Synonyms or varying abbreviations across sources lead to inconsistencies.
40. In merging datasets, what is an inner join?
✅ Correct Answer: a) Only matching records
📝 Explanation:
Inner join returns rows with matching keys in both datasets.
41. What is entropy-based discretization?
✅ Correct Answer: a) Supervised binning using class info
📝 Explanation:
It uses information gain to create bins that maximize class separation.
42. How is outlier impact assessed?
✅ Correct Answer: d) All of the above
📝 Explanation:
Visual tools like box plots and scatters help identify outliers.
43. When is listwise deletion used?
✅ Correct Answer: b) When data loss is acceptable
📝 Explanation:
Listwise deletes entire rows with any missing values if sample size allows.
44. What does binary encoding do to categories?
✅ Correct Answer: b) Converts to binary bits
📝 Explanation:
Binary encoding halves dimensions compared to one-hot by using bits.
45. What is aggregation in transformation?
✅ Correct Answer: a) Summarizing data
📝 Explanation:
Aggregation computes summaries like means or counts from groups.
46. How to detect multicollinearity?
✅ Correct Answer: c) Both a and b
📝 Explanation:
High correlations or VIF > 5 indicate multicollinearity.
47. What smoothing method uses regression?
✅ Correct Answer: b) Regression smoothing
📝 Explanation:
Regression fits a model to local data for noise reduction.
48. Which Pandas method removes duplicates?
✅ Correct Answer: a) drop_duplicates()
📝 Explanation:
drop_duplicates() eliminates repeated rows based on specified columns.
49. What is interpolation for time series missing data?
✅ Correct Answer: a) Linear estimation between points
📝 Explanation:
Interpolation estimates values using surrounding data points.
50. What is L1 normalization?
✅ Correct Answer: a) Sum to 1
📝 Explanation:
L1 normalizes by dividing by the sum of absolute values.
51. What is entity resolution in cleaning?
✅ Correct Answer: a) Merging similar records
📝 Explanation:
It identifies and merges duplicates across datasets.
52. What join includes all left records?
✅ Correct Answer: b) Left join
📝 Explanation:
Left join keeps all from left table, matching from right.
53. What is chi-merge discretization?
✅ Correct Answer: a) Supervised using chi-square
📝 Explanation:
ChiMerge uses chi-square tests to determine bin boundaries.
54. What threshold for Z-score outlier?
✅ Correct Answer: b) 3
📝 Explanation:
Values beyond 3 standard deviations are typically outliers.
55. When to use pairwise deletion?
✅ Correct Answer: a) For correlations
📝 Explanation:
Pairwise uses available pairs, maximizing data for specific analyses.
56. What is target encoding?
✅ Correct Answer: a) Replace with mean target
📝 Explanation:
Target encoding uses the mean of the target for each category.
57. What is windowing in smoothing?
✅ Correct Answer: a) Local averaging
📝 Explanation:
Windowing averages values in a sliding window to smooth noise.
58. What does df.dropna() do in Pandas?
✅ Correct Answer: b) Removes rows with missing
📝 Explanation:
dropna() deletes rows or columns containing NaN values.
59. What is spline interpolation?
✅ Correct Answer: a) Piecewise polynomial fitting
📝 Explanation:
Splines use smooth polynomials between points for interpolation.
60. What is L2 normalization?
✅ Correct Answer: a) Euclidean norm to 1
📝 Explanation:
L2 divides by the square root of sum of squares.
61. What is fuzzy matching?
✅ Correct Answer: a) Approximate string matching
📝 Explanation:
Fuzzy matching handles typos or variations in entity names.
62. What is a full outer join?
✅ Correct Answer: a) All records from both
📝 Explanation:
Full outer join includes all rows, filling non-matches with nulls.
63. What is equal-width discretization?
✅ Correct Answer: a) Equal bin sizes
📝 Explanation:
Equal-width divides range into uniform intervals.
64. What is Mahalanobis distance for outliers?
✅ Correct Answer: a) Accounts for covariance
📝 Explanation:
It measures distance considering variable correlations.
65. What is hot-deck imputation?
✅ Correct Answer: a) Random similar donor value
📝 Explanation:
Hot-deck selects from observed values in similar cases.
66. What is cold-deck imputation?
✅ Correct Answer: a) External donor values
📝 Explanation:
Cold-deck uses values from another dataset or time.
67. What is feature engineering in preprocessing?
✅ Correct Answer: a) Creating new features
📝 Explanation:
It derives informative variables from raw data.
68. What is log transformation used for?
✅ Correct Answer: a) Handling skewness
📝 Explanation:
Log reduces right-skewness in positive data.
69. What is Box-Cox transformation?
✅ Correct Answer: a) Stabilizes variance
📝 Explanation:
It finds optimal power to make data more normal.
70. What is data profiling?
✅ Correct Answer: a) Summarizing data characteristics
📝 Explanation:
Profiling assesses quality, structure, and content.
71. What is schema matching?
✅ Correct Answer: a) Aligning attributes across sources
📝 Explanation:
It resolves differences in data schemas during integration.
72. What is equal-frequency discretization?
✅ Correct Answer: a) Equal counts per bin
📝 Explanation:
Quantile binning ensures similar sample sizes in bins.
73. What is Isolation Forest for outliers?
✅ Correct Answer: a) Anomaly isolation via trees
📝 Explanation:
It isolates outliers faster than normal points.
74. What is multiple imputation?
✅ Correct Answer: a) Creates several filled datasets
📝 Explanation:
It accounts for uncertainty by averaging multiple imputations.
75. What is polynomial feature generation?
✅ Correct Answer: a) Higher-order interactions
📝 Explanation:
It creates features like x^2 or x*y for non-linearity.
76. What is Yeo-Johnson transformation?
✅ Correct Answer: a) For negative values too
📝 Explanation:
Extension of Box-Cox handling negative and zero values.
77. What is data validation in cleaning?
✅ Correct Answer: a) Ensuring accuracy and consistency
📝 Explanation:
Validation checks rules like range or format compliance.
78. What is record linkage?
✅ Correct Answer: a) Matching across datasets
📝 Explanation:
It links records referring to the same entity.
79. What is unsupervised discretization?
✅ Correct Answer: a) No class labels used
📝 Explanation:
Methods like equal-width don't rely on target variables.
80. What is Local Outlier Factor (LOF)?
✅ Correct Answer: a) Density-based outlier score
📝 Explanation:
LOF compares local density to neighbors.
81. What is MICE imputation?
✅ Correct Answer: a) Multiple Imputation by Chained Equations
📝 Explanation:
Iterative regression for each variable with missings.
82. What is interaction feature?
✅ Correct Answer: a) Product of two features
📝 Explanation:
Captures combined effects, like age * income.
83. What is quantile transformation?
✅ Correct Answer: a) Maps to uniform distribution
📝 Explanation:
It ranks data and maps to a uniform or normal dist.
84. What is data auditing?
✅ Correct Answer: a) Systematic quality review
📝 Explanation:
Auditing identifies patterns of errors or anomalies.
85. What is object consolidation?
✅ Correct Answer: a) Merging duplicate entities
📝 Explanation:
Part of integration resolving duplicates across sources.
86. What is clustering-based discretization?
✅ Correct Answer: a) Groups similar values
📝 Explanation:
Uses clustering to form natural bins.
87. What is DBSCAN for outliers?
✅ Correct Answer: a) Density-based clustering flags noise
📝 Explanation:
Points not in clusters are outliers in DBSCAN.
88. What is Kalman filter imputation?
✅ Correct Answer: a) For time series states
📝 Explanation:
Predicts missing values using state-space models.
89. What is lagged feature?
✅ Correct Answer: a) Previous time step value
📝 Explanation:
Used in time series for autoregressive features.
90. What is power transformation?
✅ Correct Answer: a) General family for normality
📝 Explanation:
Includes Box-Cox and Yeo-Johnson for stabilizing variance.
91. What is referential integrity check?
✅ Correct Answer: a) Valid foreign keys
📝 Explanation:
Ensures links between tables are valid.
92. What is data deduplication?
✅ Correct Answer: a) Removing exact duplicates
📝 Explanation:
Identifies and eliminates identical records.
93. What is entropy-based binning?
✅ Correct Answer: a) Minimizes intra-bin impurity
📝 Explanation:
Uses information entropy for supervised discretization.
94. What is one-class SVM for outliers?
✅ Correct Answer: a) Learns normal boundary
📝 Explanation:
Flags points outside the learned normal region.
95. What is EM algorithm for imputation?
✅ Correct Answer: a) Expectation-Maximization
📝 Explanation:
Iteratively estimates parameters and missings.
96. What is rolling window feature?
✅ Correct Answer: a) Moving statistics
📝 Explanation:
Computes aggregates over time windows.
97. What is arcsinh transformation?
✅ Correct Answer: a) For heavy-tailed data
📝 Explanation:
Hyperbolic inverse sine handles extremes like log.
98. What is domain validation?
✅ Correct Answer: a) Business rule checks
📝 Explanation:
Ensures data fits domain-specific logic.
99. What is survivorship bias in cleaning?
✅ Correct Answer: a) Ignoring failed entities
📝 Explanation:
Clean by including all historical data.
100. What is k-bins discretization?
✅ Correct Answer: a) Optimal bins via dynamic programming
📝 Explanation:
Minimizes error with k bins.
101. What is elliptic envelope for outliers?
✅ Correct Answer: a) Gaussian mixture based
📝 Explanation:
Fits minimum covariance determinant.
102. What is random forest imputation?
✅ Correct Answer: a) Tree-based predictions
📝 Explanation:
Uses forests to predict missings from features.
103. What is Fourier transform feature?
✅ Correct Answer: a) Frequency domain for time series
📝 Explanation:
Extracts periodic components.
104. What is square root transformation?
✅ Correct Answer: a) For count data skewness
📝 Explanation:
Reduces variance in Poisson-like data.
105. What is completeness check?
✅ Correct Answer: a) No missing required fields
📝 Explanation:
Verifies all mandatory data is present.
106. What is selection bias in data cleaning?
✅ Correct Answer: a) Non-random sampling
📝 Explanation:
Address by understanding sampling method.
107. What is CAIM discretization?
✅ Correct Answer: a) Class-Attribute Interdependence Maximization
📝 Explanation:
Supervised method maximizing dependency.
108. What is COPOD for outliers?
✅ Correct Answer: a) Copula-based
📝 Explanation:
Unsupervised using copulas for dependence.
109. What is Bayesian imputation?
✅ Correct Answer: a) Probabilistic filling
📝 Explanation:
Incorporates prior distributions for estimates.
110. What is wavelet transform feature?
✅ Correct Answer: a) Multi-resolution analysis
📝 Explanation:
Decomposes signals into time-frequency components.
111. What is reciprocal transformation?
✅ Correct Answer: a) 1/x for left-skew
📝 Explanation:
Inverts values to handle negative skew.
112. What is uniqueness check?
✅ Correct Answer: a) No unintended duplicates
📝 Explanation:
Ensures primary keys are unique.
113. What is temporal bias in cleaning?
✅ Correct Answer: a) Time-period specific data
📝 Explanation:
Balance data across periods.
114. What is MDLP discretization?
✅ Correct Answer: a) Minimum Description Length Principle
📝 Explanation:
Supervised stopping criterion for binning.
115. What is KNN for outliers?
✅ Correct Answer: a) Distance to neighbors
📝 Explanation:
High distance indicates isolation.
116. What is matrix factorization imputation?
✅ Correct Answer: a) Low-rank approximation
📝 Explanation:
Fills sparse matrices like in recommender systems.
117. What is embedding feature for text?
✅ Correct Answer: a) Dense vector representations
📝 Explanation:
Word2Vec or BERT captures semantic meaning.
118. What is cube root transformation?
✅ Correct Answer: a) Milder than log for skewness
📝 Explanation:
x^(1/3) for moderate right-skew.
119. What is timeliness check?
✅ Correct Answer: a) Data currency
📝 Explanation:
Verifies data is up-to-date.
120. What is confirmation bias in cleaning?
✅ Correct Answer: a) Retaining supporting data
📝 Explanation:
Avoid by objective criteria.
121. What is FUSINTER discretization?
✅ Correct Answer: a) Fuzzy unsupervised
📝 Explanation:
Handles overlapping bins with fuzziness.
122. What is angle-based outlier detection?
✅ Correct Answer: a) ABOD using angles
📝 Explanation:
Efficient for high dimensions via angles.
123. What is deep learning imputation?
✅ Correct Answer: a) Autoencoder-based
📝 Explanation:
Learns latent representations for filling.
124. What is PCA feature for images?
✅ Correct Answer: a) Eigenfaces
📝 Explanation:
Reduces dimensionality in face recognition.
125. What is exponential transformation?
✅ Correct Answer: a) For left-skew to right
📝 Explanation:
e^x stretches lower values.
126. What is accuracy check?
✅ Correct Answer: a) Matches reality
📝 Explanation:
Verifies data correctness against sources.
127. What is spatial bias in data?
✅ Correct Answer: a) Geographic imbalance
📝 Explanation:
Clean by sampling across regions.


