120 Data Cleaning and Preprocessing in Data Analysis - MCQs

Q: 1. Which of the following is not a key aspect of data quality in preprocessing?

See the full post for the detailed answer.

Q: 2. What is a common method to handle missing class labels in a dataset?

See the full post for the detailed answer.

Q: 3. In data cleaning, what technique uses statistical methods to fill missing values?

See the full post for the detailed answer.

Q: 4. What is the primary purpose of data preprocessing in data analysis?

See the full post for the detailed answer.

Q: 5. Which technique is used to standardize feature ranges in preprocessing?

See the full post for the detailed answer.

Q: 6. What does handling missing values prevent in data analysis?

See the full post for the detailed answer.

Q: 7. Which method imputes missing values using the average of similar instances?

See the full post for the detailed answer.

Q: 8. In data cleaning, what is 'binning' primarily used for?

See the full post for the detailed answer.

Q: 9. What is the risk of not removing duplicates in preprocessing?

See the full post for the detailed answer.

Q: 10. Which encoding converts categorical data into binary vectors?

See the full post for the detailed answer.

Category: 1000 Data Analysis MCQDate: Published: November 8, 2025Posted by: MCQs Generator

1 min read

120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.

, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }

1. Which of the following is not a key aspect of data quality in preprocessing?

a) Accuracy

b) Completeness

c) Consistency

d) Database size

Correct Answer: d) Database size

Explanation:

Data quality focuses on accuracy, completeness, and consistency, not the size of the database.

2. What is a common method to handle missing class labels in a dataset?

a) Ignoring the tuple

b) Generating a duplicate tuple

c) Deleting the entire dataset

d) Increasing the dataset size

Correct Answer: a) Ignoring the tuple

Explanation:

Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.

3. In data cleaning, what technique uses statistical methods to fill missing values?

a) Spooling

b) Decision tree induction

c) Numerosity reduction

d) Aggregation

Correct Answer: b) Decision tree induction

Explanation:

Decision trees can predict probable values for missing data based on other attributes.

4. What is the primary purpose of data preprocessing in data analysis?

a) To increase data volume

b) To prepare data for effective analysis

c) To delete all data

d) To visualize data directly

Correct Answer: b) To prepare data for effective analysis

Explanation:

Preprocessing transforms raw data into a format suitable for analysis and modeling.

5. Which technique is used to standardize feature ranges in preprocessing?

a) Binning

b) Normalization

c) Clustering

d) Regression

Correct Answer: b) Normalization

Explanation:

Normalization scales features to a common range, like 0-1, to prevent bias.

6. What does handling missing values prevent in data analysis?

a) Data expansion

b) Biased model training

c) Data visualization

d) Feature creation

Correct Answer: b) Biased model training

Explanation:

Unaddressed missing values can skew results and lead to inaccurate predictions.

7. Which method imputes missing values using the average of similar instances?

a) Mean imputation

b) K-NN imputation

c) Deletion

d) Mode imputation

Correct Answer: b) K-NN imputation

Explanation:

K-Nearest Neighbors finds similar data points to estimate missing values.

8. In data cleaning, what is 'binning' primarily used for?

a) Smoothing noisy data

b) Removing duplicates

c) Encoding categories

d) Scaling features

Correct Answer: a) Smoothing noisy data

Explanation:

Binning groups data into bins and replaces values with bin averages to reduce noise.

9. What is the risk of not removing duplicates in preprocessing?

a) Overfitting in models

b) Underestimation of variance

c) Increased computational cost

d) All of the above

Correct Answer: d) All of the above

Explanation:

Duplicates can bias models, inflate variance estimates, and slow processing.

10. Which encoding converts categorical data into binary vectors?

a) Label encoding

b) One-hot encoding

c) Target encoding

d) Frequency encoding

Correct Answer: b) One-hot encoding

Explanation:

One-hot encoding creates dummy variables for each category without ordinal assumptions.

11. What is 'data wrangling' in the context of preprocessing?

a) Data deletion

b) Transforming messy data into clean format

c) Data visualization

d) Model training

Correct Answer: b) Transforming messy data into clean format

Explanation:

Data wrangling involves cleaning and restructuring data for analysis.

12. Which technique detects outliers using statistical thresholds?

a) Z-score method

b) K-means clustering

c) PCA

d) Binning

Correct Answer: a) Z-score method

Explanation:

Z-score identifies points more than 3 standard deviations from the mean as outliers.

13. What is the main goal of feature scaling in preprocessing?

a) To reduce dimensions

b) To make features comparable

c) To encode text

d) To handle missing data

Correct Answer: b) To make features comparable

Explanation:

Scaling ensures no feature dominates due to differing units or ranges.

14. In handling categorical data, when is label encoding appropriate?

a) For nominal data

b) For ordinal data

c) For text data

d) For numerical data

Correct Answer: b) For ordinal data

Explanation:

Label encoding assigns numbers based on order, suitable for ordinal categories.

15. What does 'data integration' involve in preprocessing?

a) Combining multiple data sources

b) Removing noise

c) Scaling values

d) Encoding features

Correct Answer: a) Combining multiple data sources

Explanation:

Data integration merges datasets from different sources into a unified view.

16. Which imputation method is best for non-numerical data?

a) Mean imputation

b) Mode imputation

c) K-NN imputation

d) Regression imputation

Correct Answer: b) Mode imputation

Explanation:

Mode imputation uses the most frequent value for categorical missing data.

17. What is a potential issue with mean imputation?

a) Reduces variance

b) Increases data size

c) Deletes rows

d) Encodes categories

Correct Answer: a) Reduces variance

Explanation:

Mean imputation can underestimate variability in the dataset.

18. Which method groups data into buckets for smoothing?

a) Regression

b) Binning

c) Clustering

d) Normalization

Correct Answer: b) Binning

Explanation:

Binning sorts data into intervals and smooths by boundary or mean values.

19. What is 'data reduction' in preprocessing?

a) Increasing dataset size

b) Reducing data volume while preserving information

c) Deleting all data

d) Visualizing data

Correct Answer: b) Reducing data volume while preserving information

Explanation:

Data reduction techniques like PCA minimize data size without losing key insights.

20. When should you use median imputation over mean?

a) For symmetric distributions

b) For skewed distributions

c) For categorical data

d) For ordinal data

Correct Answer: b) For skewed distributions

Explanation:

Median is robust to outliers and skewness, unlike the mean.

21. What does PCA stand for in dimensionality reduction?

a) Principal Component Analysis

b) Primary Cluster Aggregation

c) Principal Correlation Adjustment

d) Primary Component Allocation

Correct Answer: a) Principal Component Analysis

Explanation:

PCA transforms data into principal components to reduce dimensions.

22. Which step checks for data consistency across sources?

a) Data cleaning

b) Data integration

c) Data transformation

d) Data reduction

Correct Answer: b) Data integration

Explanation:

Integration resolves inconsistencies when merging multiple data sources.

23. What is a common way to handle outliers in preprocessing?

a) Ignore them

b) Cap or floor values

c) Increase dataset size

d) Encode as categories

Correct Answer: b) Cap or floor values

Explanation:

Capping limits extreme values to thresholds like quartiles.

24. In Python, which function detects missing values in Pandas?

a) isnull()

b) dropna()

c) fillna()

d) describe()

Correct Answer: a) isnull()

Explanation:

Pandas' isnull() returns a boolean mask for missing values.

25. What is 'forward fill' in handling missing data?

a) Filling with previous value

b) Filling with next value

c) Filling with mean

d) Deleting rows

Correct Answer: a) Filling with previous value

Explanation:

Forward fill propagates the last valid observation forward.

26. Which normalization brings data to zero mean and unit variance?

a) Min-max scaling

b) Z-score normalization

c) Decimal scaling

d) L1 normalization

Correct Answer: b) Z-score normalization

Explanation:

Z-score uses mean and standard deviation for standardization.

27. What issue arises from inconsistent data formats?

a) Easy analysis

b) Parsing errors

c) Faster processing

d) Better visualization

Correct Answer: b) Parsing errors

Explanation:

Inconsistent formats like dates can cause loading or computation failures.

28. Which technique merges datasets on common keys?

a) Concatenation

b) Joining

c) Binning

d) Imputation

Correct Answer: b) Joining

Explanation:

Joining combines tables based on matching keys like IDs.

29. What is 'data discretization' used for?

a) Continuous to categorical conversion

b) Noise removal

c) Duplicate detection

d) Scaling

Correct Answer: a) Continuous to categorical conversion

Explanation:

Discretization bins continuous values into discrete intervals.

30. In outlier detection, what does IQR stand for?

a) Interquartile Range

b) Integrated Query Response

c) Internal Quality Ratio

d) Interval Quality Reduction

Correct Answer: a) Interquartile Range

Explanation:

IQR method flags values outside 1.5 times the interquartile range.

31. What is a disadvantage of deleting rows with missing values?

a) Data loss

b) Increased accuracy

c) Faster processing

d) Better scaling

Correct Answer: a) Data loss

Explanation:

Deletion reduces sample size, potentially biasing the dataset.

32. Which encoding preserves category frequencies?

a) One-hot encoding

b) Frequency encoding

c) Label encoding

d) Binary encoding

Correct Answer: b) Frequency encoding

Explanation:

Frequency encoding replaces categories with their occurrence counts.

33. What does 'data transformation' include?

a) Normalization and aggregation

b) Only deletion

c) Visualization

d) Model training

Correct Answer: a) Normalization and aggregation

Explanation:

Transformation alters data structure, like normalizing or aggregating.

34. How do you handle multicollinearity in preprocessing?

a) Remove correlated features

b) Add more data

c) Ignore it

d) Scale only

Correct Answer: a) Remove correlated features

Explanation:

Removing highly correlated features reduces redundancy and instability.

35. What is 'noise' in data cleaning?

a) Random errors or variances

b) Missing values

c) Duplicates

d) Outliers

Correct Answer: a) Random errors or variances

Explanation:

Noise refers to irrelevant or incorrect data points distorting patterns.

36. Which library in Python is used for data manipulation?

a) Matplotlib

b) Pandas

c) Scikit-learn

d) NumPy

Correct Answer: b) Pandas

Explanation:

Pandas provides DataFrames for efficient data cleaning and transformation.

37. What is 'backward fill' for missing data?

a) Filling with next value

b) Filling with previous value

c) Filling with mean

d) Deleting

Correct Answer: a) Filling with next value

Explanation:

Backward fill uses the next valid observation to fill gaps.

38. Which method is robust to outliers in scaling?

a) Min-max scaling

b) Robust scaling

c) Z-score

d) Log scaling

Correct Answer: b) Robust scaling

Explanation:

Robust scaling uses median and IQR, ignoring extreme values.

39. What causes data inconsistency?

a) Different naming conventions

b) Uniform formats

c) Single source

d) Clean data

Correct Answer: a) Different naming conventions

Explanation:

Synonyms or varying abbreviations across sources lead to inconsistencies.

40. In merging datasets, what is an inner join?

a) Only matching records

b) All records from left

c) All records from right

d) All records combined

Correct Answer: a) Only matching records

Explanation:

Inner join returns rows with matching keys in both datasets.

41. What is entropy-based discretization?

a) Supervised binning using class info

b) Unsupervised binning

c) Noise removal

d) Scaling

Correct Answer: a) Supervised binning using class info

Explanation:

It uses information gain to create bins that maximize class separation.

42. How is outlier impact assessed?

a) Box plots

b) Histograms

c) Scatter plots

d) All of the above

Correct Answer: d) All of the above

Explanation:

Visual tools like box plots and scatters help identify outliers.

43. When is listwise deletion used?

a) For few missing values

b) When data loss is acceptable

c) For categorical data

d) Always

Correct Answer: b) When data loss is acceptable

Explanation:

Listwise deletes entire rows with any missing values if sample size allows.

44. What does binary encoding do to categories?

a) Assigns integers

b) Converts to binary bits

c) One-hot expands

d) Frequency maps

Correct Answer: b) Converts to binary bits

Explanation:

Binary encoding halves dimensions compared to one-hot by using bits.

45. What is aggregation in transformation?

a) Summarizing data

b) Splitting data

c) Encoding

d) Imputing

Correct Answer: a) Summarizing data

Explanation:

Aggregation computes summaries like means or counts from groups.

46. How to detect multicollinearity?

a) Correlation matrix

b) VIF calculation

c) Both a and b

d) None

Correct Answer: c) Both a and b

Explanation:

High correlations or VIF > 5 indicate multicollinearity.

47. What smoothing method uses regression?

a) Moving average

b) Regression smoothing

c) Binning

d) All

Correct Answer: b) Regression smoothing

Explanation:

Regression fits a model to local data for noise reduction.

48. Which Pandas method removes duplicates?

a) drop_duplicates()

b) unique()

c) value_counts()

d) replace()

Correct Answer: a) drop_duplicates()

Explanation:

drop_duplicates() eliminates repeated rows based on specified columns.

49. What is interpolation for time series missing data?

a) Linear estimation between points

b) Mean fill

c) Mode fill

d) Deletion

Correct Answer: a) Linear estimation between points

Explanation:

Interpolation estimates values using surrounding data points.

50. What is L1 normalization?

a) Sum to 1

b) Divide by max

c) Zero mean

d) Unit variance

Correct Answer: a) Sum to 1

Explanation:

L1 normalizes by dividing by the sum of absolute values.

51. What is entity resolution in cleaning?

a) Merging similar records

b) Removing noise

c) Scaling

d) Encoding

Correct Answer: a) Merging similar records

Explanation:

It identifies and merges duplicates across datasets.

52. What join includes all left records?

a) Inner join

b) Left join

c) Right join

d) Outer join

Correct Answer: b) Left join

Explanation:

Left join keeps all from left table, matching from right.

53. What is chi-merge discretization?

a) Supervised using chi-square

b) Unsupervised

c) Noise based

d) Scale based

Correct Answer: a) Supervised using chi-square

Explanation:

ChiMerge uses chi-square tests to determine bin boundaries.

54. What threshold for Z-score outlier?

a) 2

b) 3

c) 1

d) 4

Correct Answer: b) 3

Explanation:

Values beyond 3 standard deviations are typically outliers.

55. When to use pairwise deletion?

a) For correlations

b) For full dataset analysis

c) Always

d) For imputation

Correct Answer: a) For correlations

Explanation:

Pairwise uses available pairs, maximizing data for specific analyses.

56. What is target encoding?

a) Replace with mean target

b) Binary conversion

c) Frequency

d) One-hot

Correct Answer: a) Replace with mean target

Explanation:

Target encoding uses the mean of the target for each category.

57. What is windowing in smoothing?

a) Local averaging

b) Global scaling

c) Binning

d) Regression

Correct Answer: a) Local averaging

Explanation:

Windowing averages values in a sliding window to smooth noise.

58. What does df.dropna() do in Pandas?

a) Fills missing

b) Removes rows with missing

c) Detects missing

d) Counts missing

Correct Answer: b) Removes rows with missing

Explanation:

dropna() deletes rows or columns containing NaN values.

59. What is spline interpolation?

a) Piecewise polynomial fitting

b) Linear

c) Nearest neighbor

d) Constant

Correct Answer: a) Piecewise polynomial fitting

Explanation:

Splines use smooth polynomials between points for interpolation.

60. What is L2 normalization?

a) Euclidean norm to 1

b) Manhattan to 1

c) Max value

d) Mean zero

Correct Answer: a) Euclidean norm to 1

Explanation:

L2 divides by the square root of sum of squares.

61. What is fuzzy matching?

a) Approximate string matching

b) Exact matching

c) Numerical scaling

d) Binning

Correct Answer: a) Approximate string matching

Explanation:

Fuzzy matching handles typos or variations in entity names.

62. What is a full outer join?

a) All records from both

b) Only left

c) Only matching

d) Only right

Correct Answer: a) All records from both

Explanation:

Full outer join includes all rows, filling non-matches with nulls.

63. What is equal-width discretization?

a) Equal bin sizes

b) Equal frequency

c) Supervised

d) Noise based

Correct Answer: a) Equal bin sizes

Explanation:

Equal-width divides range into uniform intervals.

64. What is Mahalanobis distance for outliers?

a) Accounts for covariance

b) Euclidean only

c) Manhattan

d) Simple threshold

Correct Answer: a) Accounts for covariance

Explanation:

It measures distance considering variable correlations.

65. What is hot-deck imputation?

a) Random similar donor value

b) Mean fill

c) Mode

d) Regression

Correct Answer: a) Random similar donor value

Explanation:

Hot-deck selects from observed values in similar cases.

66. What is cold-deck imputation?

a) External donor values

b) Internal random

c) Mean

d) Delete

Correct Answer: a) External donor values

Explanation:

Cold-deck uses values from another dataset or time.

67. What is feature engineering in preprocessing?

a) Creating new features

b) Deleting features

c) Scaling only

d) Encoding only

Correct Answer: a) Creating new features

Explanation:

It derives informative variables from raw data.

68. What is log transformation used for?

a) Handling skewness

b) Categorical encoding

c) Missing fill

d) Duplicate removal

Correct Answer: a) Handling skewness

Explanation:

Log reduces right-skewness in positive data.

69. What is Box-Cox transformation?

a) Stabilizes variance

b) Normalizes only

c) Discretizes

d) Clusters

Correct Answer: a) Stabilizes variance

Explanation:

It finds optimal power to make data more normal.

70. What is data profiling?

a) Summarizing data characteristics

b) Cleaning data

c) Modeling

d) Visualizing

Correct Answer: a) Summarizing data characteristics

Explanation:

Profiling assesses quality, structure, and content.

71. What is schema matching?

a) Aligning attributes across sources

b) Numerical scaling

c) Binning

d) Imputation

Correct Answer: a) Aligning attributes across sources

Explanation:

It resolves differences in data schemas during integration.

72. What is equal-frequency discretization?

a) Equal counts per bin

b) Equal widths

c) Supervised

d) Random

Correct Answer: a) Equal counts per bin

Explanation:

Quantile binning ensures similar sample sizes in bins.

73. What is Isolation Forest for outliers?

a) Anomaly isolation via trees

b) Clustering

c) Regression

d) PCA

Correct Answer: a) Anomaly isolation via trees

Explanation:

It isolates outliers faster than normal points.

74. What is multiple imputation?

a) Creates several filled datasets

b) Single mean fill

c) Deletion

d) Mode only

Correct Answer: a) Creates several filled datasets

Explanation:

It accounts for uncertainty by averaging multiple imputations.

75. What is polynomial feature generation?

a) Higher-order interactions

b) Linear only

c) Categorical

d) Text

Correct Answer: a) Higher-order interactions

Explanation:

It creates features like x^2 or x*y for non-linearity.

76. What is Yeo-Johnson transformation?

a) For negative values too

b) Positive only

c) Discretization

d) Encoding

Correct Answer: a) For negative values too

Explanation:

Extension of Box-Cox handling negative and zero values.

77. What is data validation in cleaning?

a) Ensuring accuracy and consistency

b) Visualization

c) Modeling

d) Storage

Correct Answer: a) Ensuring accuracy and consistency

Explanation:

Validation checks rules like range or format compliance.

78. What is record linkage?

a) Matching across datasets

b) Within dataset duplicates

c) Scaling

d) Binning

Correct Answer: a) Matching across datasets

Explanation:

It links records referring to the same entity.

79. What is unsupervised discretization?

a) No class labels used

b) Uses target

c) Supervised only

d) Regression based

Correct Answer: a) No class labels used

Explanation:

Methods like equal-width don't rely on target variables.

80. What is Local Outlier Factor (LOF)?

a) Density-based outlier score

b) Distance only

c) Global threshold

d) Simple Z-score

Correct Answer: a) Density-based outlier score

Explanation:

LOF compares local density to neighbors.

81. What is MICE imputation?

a) Multiple Imputation by Chained Equations

b) Single chain

c) Mean only

d) Delete

Correct Answer: a) Multiple Imputation by Chained Equations

Explanation:

Iterative regression for each variable with missings.

82. What is interaction feature?

a) Product of two features

b) Single feature

c) Categorical

d) Target

Correct Answer: a) Product of two features

Explanation:

Captures combined effects, like age * income.

83. What is quantile transformation?

a) Maps to uniform distribution

b) Normalizes to Gaussian

c) Logs

d) Boxes

Correct Answer: a) Maps to uniform distribution

Explanation:

It ranks data and maps to a uniform or normal dist.

84. What is data auditing?

a) Systematic quality review

b) Random check

c) Visualization

d) Modeling

Correct Answer: a) Systematic quality review

Explanation:

Auditing identifies patterns of errors or anomalies.

85. What is object consolidation?

a) Merging duplicate entities

b) Splitting

c) Encoding

d) Scaling

Correct Answer: a) Merging duplicate entities

Explanation:

Part of integration resolving duplicates across sources.

86. What is clustering-based discretization?

a) Groups similar values

b) Equal width

c) Frequency

d) Chi-square

Correct Answer: a) Groups similar values

Explanation:

Uses clustering to form natural bins.

87. What is DBSCAN for outliers?

a) Density-based clustering flags noise

b) K-means

c) Hierarchical

d) Gaussian

Correct Answer: a) Density-based clustering flags noise

Explanation:

Points not in clusters are outliers in DBSCAN.

88. What is Kalman filter imputation?

a) For time series states

b) Static mean

c) Random

d) Mode

Correct Answer: a) For time series states

Explanation:

Predicts missing values using state-space models.

89. What is lagged feature?

a) Previous time step value

b) Future

c) Average

d) Sum

Correct Answer: a) Previous time step value

Explanation:

Used in time series for autoregressive features.

90. What is power transformation?

a) General family for normality

b) Log only

c) Square root

d) Reciprocal

Correct Answer: a) General family for normality

Explanation:

Includes Box-Cox and Yeo-Johnson for stabilizing variance.

91. What is referential integrity check?

a) Valid foreign keys

b) Data types

c) Ranges

d) Formats

Correct Answer: a) Valid foreign keys

Explanation:

Ensures links between tables are valid.

92. What is data deduplication?

a) Removing exact duplicates

b) Fuzzy only

c) Encoding

d) Scaling

Correct Answer: a) Removing exact duplicates

Explanation:

Identifies and eliminates identical records.

93. What is entropy-based binning?

a) Minimizes intra-bin impurity

b) Equal width

c) Frequency

d) Unsupervised

Correct Answer: a) Minimizes intra-bin impurity

Explanation:

Uses information entropy for supervised discretization.

94. What is one-class SVM for outliers?

a) Learns normal boundary

b) Supervised classification

c) Clustering

d) Regression

Correct Answer: a) Learns normal boundary

Explanation:

Flags points outside the learned normal region.

95. What is EM algorithm for imputation?

a) Expectation-Maximization

b) Simple mean

c) KNN

d) Delete

Correct Answer: a) Expectation-Maximization

Explanation:

Iteratively estimates parameters and missings.

96. What is rolling window feature?

a) Moving statistics

b) Static

c) Lagged only

d) Future

Correct Answer: a) Moving statistics

Explanation:

Computes aggregates over time windows.

97. What is arcsinh transformation?

a) For heavy-tailed data

b) Log

c) Square

d) Identity

Correct Answer: a) For heavy-tailed data

Explanation:

Hyperbolic inverse sine handles extremes like log.

98. What is domain validation?

a) Business rule checks

b) Syntax only

c) Length

d) Type

Correct Answer: a) Business rule checks

Explanation:

Ensures data fits domain-specific logic.

99. What is survivorship bias in cleaning?

a) Ignoring failed entities

b) Duplicates

c) Missing

d) Noise

Correct Answer: a) Ignoring failed entities

Explanation:

Clean by including all historical data.

100. What is k-bins discretization?

a) Optimal bins via dynamic programming

b) Equal

c) Random

d) Density

Correct Answer: a) Optimal bins via dynamic programming

Explanation:

Minimizes error with k bins.

101. What is elliptic envelope for outliers?

a) Gaussian mixture based

b) Simple threshold

c) KNN

d) Tree

Correct Answer: a) Gaussian mixture based

Explanation:

Fits minimum covariance determinant.

102. What is random forest imputation?

a) Tree-based predictions

b) Linear

c) Mean

d) Mode

Correct Answer: a) Tree-based predictions

Explanation:

Uses forests to predict missings from features.

103. What is Fourier transform feature?

a) Frequency domain for time series

b) Spatial

c) Categorical

d) Text

Correct Answer: a) Frequency domain for time series

Explanation:

Extracts periodic components.

104. What is square root transformation?

a) For count data skewness

b) Log

c) Power 3

d) Reciprocal

Correct Answer: a) For count data skewness

Explanation:

Reduces variance in Poisson-like data.

105. What is completeness check?

a) No missing required fields

b) Format

c) Range

d) Consistency

Correct Answer: a) No missing required fields

Explanation:

Verifies all mandatory data is present.

106. What is selection bias in data cleaning?

a) Non-random sampling

b) Duplicates

c) Outliers

d) Noise

Correct Answer: a) Non-random sampling

Explanation:

Address by understanding sampling method.

107. What is CAIM discretization?

a) Class-Attribute Interdependence Maximization

b) Equal

c) Width

d) Frequency

Correct Answer: a) Class-Attribute Interdependence Maximization

Explanation:

Supervised method maximizing dependency.

108. What is COPOD for outliers?

a) Copula-based

b) Distance

c) Density

d) Tree

Correct Answer: a) Copula-based

Explanation:

Unsupervised using copulas for dependence.

109. What is Bayesian imputation?

a) Probabilistic filling

b) Deterministic

c) Mean

d) KNN

Correct Answer: a) Probabilistic filling

Explanation:

Incorporates prior distributions for estimates.

110. What is wavelet transform feature?

a) Multi-resolution analysis

b) Frequency only

c) Time only

d) Static

Correct Answer: a) Multi-resolution analysis

Explanation:

Decomposes signals into time-frequency components.

111. What is reciprocal transformation?

a) 1/x for left-skew

b) Sqrt

c) Log

d) Square

Correct Answer: a) 1/x for left-skew

Explanation:

Inverts values to handle negative skew.

112. What is uniqueness check?

a) No unintended duplicates

b) Missing

c) Range

d) Type

Correct Answer: a) No unintended duplicates

Explanation:

Ensures primary keys are unique.

113. What is temporal bias in cleaning?

a) Time-period specific data

b) Spatial

c) Selection

d) Confirmation

Correct Answer: a) Time-period specific data

Explanation:

Balance data across periods.

114. What is MDLP discretization?

a) Minimum Description Length Principle

b) Equal width

c) Frequency

d) Clustering

Correct Answer: a) Minimum Description Length Principle

Explanation:

Supervised stopping criterion for binning.

115. What is KNN for outliers?

a) Distance to neighbors

b) Density

c) Copula

d) Tree

Correct Answer: a) Distance to neighbors

Explanation:

High distance indicates isolation.

116. What is matrix factorization imputation?

a) Low-rank approximation

b) Regression

c) Tree

d) Mean

Correct Answer: a) Low-rank approximation

Explanation:

Fills sparse matrices like in recommender systems.

117. What is embedding feature for text?

a) Dense vector representations

b) Bag of words

c) TF-IDF

d) Count

Correct Answer: a) Dense vector representations

Explanation:

Word2Vec or BERT captures semantic meaning.

118. What is cube root transformation?

a) Milder than log for skewness

b) Stronger

c) Reciprocal

d) Square

Correct Answer: a) Milder than log for skewness

Explanation:

x^(1/3) for moderate right-skew.

119. What is timeliness check?

a) Data currency

b) Accuracy

c) Completeness

d) Consistency

Correct Answer: a) Data currency

Explanation:

Verifies data is up-to-date.

120. What is confirmation bias in cleaning?

a) Retaining supporting data

b) Random

c) Temporal

d) Spatial

Correct Answer: a) Retaining supporting data

Explanation:

Avoid by objective criteria.

121. What is FUSINTER discretization?

a) Fuzzy unsupervised

b) Supervised

c) Equal

d) Chi

Correct Answer: a) Fuzzy unsupervised

Explanation:

Handles overlapping bins with fuzziness.

122. What is angle-based outlier detection?

a) ABOD using angles

b) Distance

c) Density

d) Variance

Correct Answer: a) ABOD using angles

Explanation:

Efficient for high dimensions via angles.

123. What is deep learning imputation?

a) Autoencoder-based

b) Linear

c) Tree

d) KNN

Correct Answer: a) Autoencoder-based

Explanation:

Learns latent representations for filling.

124. What is PCA feature for images?

a) Eigenfaces

b) Pixel counts

c) Colors

d) Sizes

Correct Answer: a) Eigenfaces

Explanation:

Reduces dimensionality in face recognition.

125. What is exponential transformation?

a) For left-skew to right

b) Right to left

c) Normal

d) Uniform

Correct Answer: a) For left-skew to right

Explanation:

e^x stretches lower values.

126. What is accuracy check?

a) Matches reality

b) Format

c) Uniqueness

d) Timeliness

Correct Answer: a) Matches reality

Explanation:

Verifies data correctness against sources.

127. What is spatial bias in data?

a) Geographic imbalance

b) Time

c) Confirmation

d) Selection

Correct Answer: a) Geographic imbalance

Explanation:

Clean by sampling across regions.

130 Exploratory Data Analysis (EDA) MCQs

MCQs cover the fundamentals of Exploratory Data Analysis, covering data summarization, visualization techniques, handling anomalies, and inferring patterns from datasets.…

November 8, 2025

By MCQs Generator

100 Descriptive, Inferential, and Time Series Statistics in Data Analysis - MCQs

100 challenging multiple-choice questions on descriptive statistics, inferential methods, and time series analysis. Inspired by real data science and analytics…

November 8, 2025

By MCQs Generator

50 Hypothesis Testing in Data Analysis - MCQs

This set of 50 MCQs explores key concepts in hypothesis testing, including null and alternative hypotheses, p-values, test statistics, error…

November 8, 2025

By MCQs Generator