MCQs Generator

MCQs Generator - Fixed Responsive Header
Home » Directory » 1000 Data Analysis MCQ » 120 Data Cleaning and Preprocessing in Data Analysis - MCQs

120 Data Cleaning and Preprocessing in Data Analysis - MCQs

120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.

, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }

1. Which of the following is not a key aspect of data quality in preprocessing?

a) Accuracy
b) Completeness
c) Consistency
d) Database size
✅ Correct Answer: d) Database size
📝 Explanation:
Data quality focuses on accuracy, completeness, and consistency, not the size of the database.

2. What is a common method to handle missing class labels in a dataset?

a) Ignoring the tuple
b) Generating a duplicate tuple
c) Deleting the entire dataset
d) Increasing the dataset size
✅ Correct Answer: a) Ignoring the tuple
📝 Explanation:
Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.

3. In data cleaning, what technique uses statistical methods to fill missing values?

a) Spooling
b) Decision tree induction
c) Numerosity reduction
d) Aggregation
✅ Correct Answer: b) Decision tree induction
📝 Explanation:
Decision trees can predict probable values for missing data based on other attributes.

4. What is the primary purpose of data preprocessing in data analysis?

a) To increase data volume
b) To prepare data for effective analysis
c) To delete all data
d) To visualize data directly
✅ Correct Answer: b) To prepare data for effective analysis
📝 Explanation:
Preprocessing transforms raw data into a format suitable for analysis and modeling.

5. Which technique is used to standardize feature ranges in preprocessing?

a) Binning
b) Normalization
c) Clustering
d) Regression
✅ Correct Answer: b) Normalization
📝 Explanation:
Normalization scales features to a common range, like 0-1, to prevent bias.

6. What does handling missing values prevent in data analysis?

a) Data expansion
b) Biased model training
c) Data visualization
d) Feature creation
✅ Correct Answer: b) Biased model training
📝 Explanation:
Unaddressed missing values can skew results and lead to inaccurate predictions.

7. Which method imputes missing values using the average of similar instances?

a) Mean imputation
b) K-NN imputation
c) Deletion
d) Mode imputation
✅ Correct Answer: b) K-NN imputation
📝 Explanation:
K-Nearest Neighbors finds similar data points to estimate missing values.

8. In data cleaning, what is 'binning' primarily used for?

a) Smoothing noisy data
b) Removing duplicates
c) Encoding categories
d) Scaling features
✅ Correct Answer: a) Smoothing noisy data
📝 Explanation:
Binning groups data into bins and replaces values with bin averages to reduce noise.

9. What is the risk of not removing duplicates in preprocessing?

a) Overfitting in models
b) Underestimation of variance
c) Increased computational cost
d) All of the above
✅ Correct Answer: d) All of the above
📝 Explanation:
Duplicates can bias models, inflate variance estimates, and slow processing.

10. Which encoding converts categorical data into binary vectors?

a) Label encoding
b) One-hot encoding
c) Target encoding
d) Frequency encoding
✅ Correct Answer: b) One-hot encoding
📝 Explanation:
One-hot encoding creates dummy variables for each category without ordinal assumptions.

11. What is 'data wrangling' in the context of preprocessing?

a) Data deletion
b) Transforming messy data into clean format
c) Data visualization
d) Model training
✅ Correct Answer: b) Transforming messy data into clean format
📝 Explanation:
Data wrangling involves cleaning and restructuring data for analysis.

12. Which technique detects outliers using statistical thresholds?

a) Z-score method
b) K-means clustering
c) PCA
d) Binning
✅ Correct Answer: a) Z-score method
📝 Explanation:
Z-score identifies points more than 3 standard deviations from the mean as outliers.

13. What is the main goal of feature scaling in preprocessing?

a) To reduce dimensions
b) To make features comparable
c) To encode text
d) To handle missing data
✅ Correct Answer: b) To make features comparable
📝 Explanation:
Scaling ensures no feature dominates due to differing units or ranges.

14. In handling categorical data, when is label encoding appropriate?

a) For nominal data
b) For ordinal data
c) For text data
d) For numerical data
✅ Correct Answer: b) For ordinal data
📝 Explanation:
Label encoding assigns numbers based on order, suitable for ordinal categories.

15. What does 'data integration' involve in preprocessing?

a) Combining multiple data sources
b) Removing noise
c) Scaling values
d) Encoding features
✅ Correct Answer: a) Combining multiple data sources
📝 Explanation:
Data integration merges datasets from different sources into a unified view.

16. Which imputation method is best for non-numerical data?

a) Mean imputation
b) Mode imputation
c) K-NN imputation
d) Regression imputation
✅ Correct Answer: b) Mode imputation
📝 Explanation:
Mode imputation uses the most frequent value for categorical missing data.

17. What is a potential issue with mean imputation?

a) Reduces variance
b) Increases data size
c) Deletes rows
d) Encodes categories
✅ Correct Answer: a) Reduces variance
📝 Explanation:
Mean imputation can underestimate variability in the dataset.

18. Which method groups data into buckets for smoothing?

a) Regression
b) Binning
c) Clustering
d) Normalization
✅ Correct Answer: b) Binning
📝 Explanation:
Binning sorts data into intervals and smooths by boundary or mean values.

19. What is 'data reduction' in preprocessing?

a) Increasing dataset size
b) Reducing data volume while preserving information
c) Deleting all data
d) Visualizing data
✅ Correct Answer: b) Reducing data volume while preserving information
📝 Explanation:
Data reduction techniques like PCA minimize data size without losing key insights.

20. When should you use median imputation over mean?

a) For symmetric distributions
b) For skewed distributions
c) For categorical data
d) For ordinal data
✅ Correct Answer: b) For skewed distributions
📝 Explanation:
Median is robust to outliers and skewness, unlike the mean.

21. What does PCA stand for in dimensionality reduction?

a) Principal Component Analysis
b) Primary Cluster Aggregation
c) Principal Correlation Adjustment
d) Primary Component Allocation
✅ Correct Answer: a) Principal Component Analysis
📝 Explanation:
PCA transforms data into principal components to reduce dimensions.

22. Which step checks for data consistency across sources?

a) Data cleaning
b) Data integration
c) Data transformation
d) Data reduction
✅ Correct Answer: b) Data integration
📝 Explanation:
Integration resolves inconsistencies when merging multiple data sources.

23. What is a common way to handle outliers in preprocessing?

a) Ignore them
b) Cap or floor values
c) Increase dataset size
d) Encode as categories
✅ Correct Answer: b) Cap or floor values
📝 Explanation:
Capping limits extreme values to thresholds like quartiles.

24. In Python, which function detects missing values in Pandas?

a) isnull()
b) dropna()
c) fillna()
d) describe()
✅ Correct Answer: a) isnull()
📝 Explanation:
Pandas' isnull() returns a boolean mask for missing values.

25. What is 'forward fill' in handling missing data?

a) Filling with previous value
b) Filling with next value
c) Filling with mean
d) Deleting rows
✅ Correct Answer: a) Filling with previous value
📝 Explanation:
Forward fill propagates the last valid observation forward.

26. Which normalization brings data to zero mean and unit variance?

a) Min-max scaling
b) Z-score normalization
c) Decimal scaling
d) L1 normalization
✅ Correct Answer: b) Z-score normalization
📝 Explanation:
Z-score uses mean and standard deviation for standardization.

27. What issue arises from inconsistent data formats?

a) Easy analysis
b) Parsing errors
c) Faster processing
d) Better visualization
✅ Correct Answer: b) Parsing errors
📝 Explanation:
Inconsistent formats like dates can cause loading or computation failures.

28. Which technique merges datasets on common keys?

a) Concatenation
b) Joining
c) Binning
d) Imputation
✅ Correct Answer: b) Joining
📝 Explanation:
Joining combines tables based on matching keys like IDs.

29. What is 'data discretization' used for?

a) Continuous to categorical conversion
b) Noise removal
c) Duplicate detection
d) Scaling
✅ Correct Answer: a) Continuous to categorical conversion
📝 Explanation:
Discretization bins continuous values into discrete intervals.

30. In outlier detection, what does IQR stand for?

a) Interquartile Range
b) Integrated Query Response
c) Internal Quality Ratio
d) Interval Quality Reduction
✅ Correct Answer: a) Interquartile Range
📝 Explanation:
IQR method flags values outside 1.5 times the interquartile range.

31. What is a disadvantage of deleting rows with missing values?

a) Data loss
b) Increased accuracy
c) Faster processing
d) Better scaling
✅ Correct Answer: a) Data loss
📝 Explanation:
Deletion reduces sample size, potentially biasing the dataset.

32. Which encoding preserves category frequencies?

a) One-hot encoding
b) Frequency encoding
c) Label encoding
d) Binary encoding
✅ Correct Answer: b) Frequency encoding
📝 Explanation:
Frequency encoding replaces categories with their occurrence counts.

33. What does 'data transformation' include?

a) Normalization and aggregation
b) Only deletion
c) Visualization
d) Model training
✅ Correct Answer: a) Normalization and aggregation
📝 Explanation:
Transformation alters data structure, like normalizing or aggregating.

34. How do you handle multicollinearity in preprocessing?

a) Remove correlated features
b) Add more data
c) Ignore it
d) Scale only
✅ Correct Answer: a) Remove correlated features
📝 Explanation:
Removing highly correlated features reduces redundancy and instability.

35. What is 'noise' in data cleaning?

a) Random errors or variances
b) Missing values
c) Duplicates
d) Outliers
✅ Correct Answer: a) Random errors or variances
📝 Explanation:
Noise refers to irrelevant or incorrect data points distorting patterns.

36. Which library in Python is used for data manipulation?

a) Matplotlib
b) Pandas
c) Scikit-learn
d) NumPy
✅ Correct Answer: b) Pandas
📝 Explanation:
Pandas provides DataFrames for efficient data cleaning and transformation.

37. What is 'backward fill' for missing data?

a) Filling with next value
b) Filling with previous value
c) Filling with mean
d) Deleting
✅ Correct Answer: a) Filling with next value
📝 Explanation:
Backward fill uses the next valid observation to fill gaps.

38. Which method is robust to outliers in scaling?

a) Min-max scaling
b) Robust scaling
c) Z-score
d) Log scaling
✅ Correct Answer: b) Robust scaling
📝 Explanation:
Robust scaling uses median and IQR, ignoring extreme values.

39. What causes data inconsistency?

a) Different naming conventions
b) Uniform formats
c) Single source
d) Clean data
✅ Correct Answer: a) Different naming conventions
📝 Explanation:
Synonyms or varying abbreviations across sources lead to inconsistencies.

40. In merging datasets, what is an inner join?

a) Only matching records
b) All records from left
c) All records from right
d) All records combined
✅ Correct Answer: a) Only matching records
📝 Explanation:
Inner join returns rows with matching keys in both datasets.

41. What is entropy-based discretization?

a) Supervised binning using class info
b) Unsupervised binning
c) Noise removal
d) Scaling
✅ Correct Answer: a) Supervised binning using class info
📝 Explanation:
It uses information gain to create bins that maximize class separation.

42. How is outlier impact assessed?

a) Box plots
b) Histograms
c) Scatter plots
d) All of the above
✅ Correct Answer: d) All of the above
📝 Explanation:
Visual tools like box plots and scatters help identify outliers.

43. When is listwise deletion used?

a) For few missing values
b) When data loss is acceptable
c) For categorical data
d) Always
✅ Correct Answer: b) When data loss is acceptable
📝 Explanation:
Listwise deletes entire rows with any missing values if sample size allows.

44. What does binary encoding do to categories?

a) Assigns integers
b) Converts to binary bits
c) One-hot expands
d) Frequency maps
✅ Correct Answer: b) Converts to binary bits
📝 Explanation:
Binary encoding halves dimensions compared to one-hot by using bits.

45. What is aggregation in transformation?

a) Summarizing data
b) Splitting data
c) Encoding
d) Imputing
✅ Correct Answer: a) Summarizing data
📝 Explanation:
Aggregation computes summaries like means or counts from groups.

46. How to detect multicollinearity?

a) Correlation matrix
b) VIF calculation
c) Both a and b
d) None
✅ Correct Answer: c) Both a and b
📝 Explanation:
High correlations or VIF > 5 indicate multicollinearity.

47. What smoothing method uses regression?

a) Moving average
b) Regression smoothing
c) Binning
d) All
✅ Correct Answer: b) Regression smoothing
📝 Explanation:
Regression fits a model to local data for noise reduction.

48. Which Pandas method removes duplicates?

a) drop_duplicates()
b) unique()
c) value_counts()
d) replace()
✅ Correct Answer: a) drop_duplicates()
📝 Explanation:
drop_duplicates() eliminates repeated rows based on specified columns.

49. What is interpolation for time series missing data?

a) Linear estimation between points
b) Mean fill
c) Mode fill
d) Deletion
✅ Correct Answer: a) Linear estimation between points
📝 Explanation:
Interpolation estimates values using surrounding data points.

50. What is L1 normalization?

a) Sum to 1
b) Divide by max
c) Zero mean
d) Unit variance
✅ Correct Answer: a) Sum to 1
📝 Explanation:
L1 normalizes by dividing by the sum of absolute values.

51. What is entity resolution in cleaning?

a) Merging similar records
b) Removing noise
c) Scaling
d) Encoding
✅ Correct Answer: a) Merging similar records
📝 Explanation:
It identifies and merges duplicates across datasets.

52. What join includes all left records?

a) Inner join
b) Left join
c) Right join
d) Outer join
✅ Correct Answer: b) Left join
📝 Explanation:
Left join keeps all from left table, matching from right.

53. What is chi-merge discretization?

a) Supervised using chi-square
b) Unsupervised
c) Noise based
d) Scale based
✅ Correct Answer: a) Supervised using chi-square
📝 Explanation:
ChiMerge uses chi-square tests to determine bin boundaries.

54. What threshold for Z-score outlier?

a) 2
b) 3
c) 1
d) 4
✅ Correct Answer: b) 3
📝 Explanation:
Values beyond 3 standard deviations are typically outliers.

55. When to use pairwise deletion?

a) For correlations
b) For full dataset analysis
c) Always
d) For imputation
✅ Correct Answer: a) For correlations
📝 Explanation:
Pairwise uses available pairs, maximizing data for specific analyses.

56. What is target encoding?

a) Replace with mean target
b) Binary conversion
c) Frequency
d) One-hot
✅ Correct Answer: a) Replace with mean target
📝 Explanation:
Target encoding uses the mean of the target for each category.

57. What is windowing in smoothing?

a) Local averaging
b) Global scaling
c) Binning
d) Regression
✅ Correct Answer: a) Local averaging
📝 Explanation:
Windowing averages values in a sliding window to smooth noise.

58. What does df.dropna() do in Pandas?

a) Fills missing
b) Removes rows with missing
c) Detects missing
d) Counts missing
✅ Correct Answer: b) Removes rows with missing
📝 Explanation:
dropna() deletes rows or columns containing NaN values.

59. What is spline interpolation?

a) Piecewise polynomial fitting
b) Linear
c) Nearest neighbor
d) Constant
✅ Correct Answer: a) Piecewise polynomial fitting
📝 Explanation:
Splines use smooth polynomials between points for interpolation.

60. What is L2 normalization?

a) Euclidean norm to 1
b) Manhattan to 1
c) Max value
d) Mean zero
✅ Correct Answer: a) Euclidean norm to 1
📝 Explanation:
L2 divides by the square root of sum of squares.

61. What is fuzzy matching?

a) Approximate string matching
b) Exact matching
c) Numerical scaling
d) Binning
✅ Correct Answer: a) Approximate string matching
📝 Explanation:
Fuzzy matching handles typos or variations in entity names.

62. What is a full outer join?

a) All records from both
b) Only left
c) Only matching
d) Only right
✅ Correct Answer: a) All records from both
📝 Explanation:
Full outer join includes all rows, filling non-matches with nulls.

63. What is equal-width discretization?

a) Equal bin sizes
b) Equal frequency
c) Supervised
d) Noise based
✅ Correct Answer: a) Equal bin sizes
📝 Explanation:
Equal-width divides range into uniform intervals.

64. What is Mahalanobis distance for outliers?

a) Accounts for covariance
b) Euclidean only
c) Manhattan
d) Simple threshold
✅ Correct Answer: a) Accounts for covariance
📝 Explanation:
It measures distance considering variable correlations.

65. What is hot-deck imputation?

a) Random similar donor value
b) Mean fill
c) Mode
d) Regression
✅ Correct Answer: a) Random similar donor value
📝 Explanation:
Hot-deck selects from observed values in similar cases.

66. What is cold-deck imputation?

a) External donor values
b) Internal random
c) Mean
d) Delete
✅ Correct Answer: a) External donor values
📝 Explanation:
Cold-deck uses values from another dataset or time.

67. What is feature engineering in preprocessing?

a) Creating new features
b) Deleting features
c) Scaling only
d) Encoding only
✅ Correct Answer: a) Creating new features
📝 Explanation:
It derives informative variables from raw data.

68. What is log transformation used for?

a) Handling skewness
b) Categorical encoding
c) Missing fill
d) Duplicate removal
✅ Correct Answer: a) Handling skewness
📝 Explanation:
Log reduces right-skewness in positive data.

69. What is Box-Cox transformation?

a) Stabilizes variance
b) Normalizes only
c) Discretizes
d) Clusters
✅ Correct Answer: a) Stabilizes variance
📝 Explanation:
It finds optimal power to make data more normal.

70. What is data profiling?

a) Summarizing data characteristics
b) Cleaning data
c) Modeling
d) Visualizing
✅ Correct Answer: a) Summarizing data characteristics
📝 Explanation:
Profiling assesses quality, structure, and content.

71. What is schema matching?

a) Aligning attributes across sources
b) Numerical scaling
c) Binning
d) Imputation
✅ Correct Answer: a) Aligning attributes across sources
📝 Explanation:
It resolves differences in data schemas during integration.

72. What is equal-frequency discretization?

a) Equal counts per bin
b) Equal widths
c) Supervised
d) Random
✅ Correct Answer: a) Equal counts per bin
📝 Explanation:
Quantile binning ensures similar sample sizes in bins.

73. What is Isolation Forest for outliers?

a) Anomaly isolation via trees
b) Clustering
c) Regression
d) PCA
✅ Correct Answer: a) Anomaly isolation via trees
📝 Explanation:
It isolates outliers faster than normal points.

74. What is multiple imputation?

a) Creates several filled datasets
b) Single mean fill
c) Deletion
d) Mode only
✅ Correct Answer: a) Creates several filled datasets
📝 Explanation:
It accounts for uncertainty by averaging multiple imputations.

75. What is polynomial feature generation?

a) Higher-order interactions
b) Linear only
c) Categorical
d) Text
✅ Correct Answer: a) Higher-order interactions
📝 Explanation:
It creates features like x^2 or x*y for non-linearity.

76. What is Yeo-Johnson transformation?

a) For negative values too
b) Positive only
c) Discretization
d) Encoding
✅ Correct Answer: a) For negative values too
📝 Explanation:
Extension of Box-Cox handling negative and zero values.

77. What is data validation in cleaning?

a) Ensuring accuracy and consistency
b) Visualization
c) Modeling
d) Storage
✅ Correct Answer: a) Ensuring accuracy and consistency
📝 Explanation:
Validation checks rules like range or format compliance.

78. What is record linkage?

a) Matching across datasets
b) Within dataset duplicates
c) Scaling
d) Binning
✅ Correct Answer: a) Matching across datasets
📝 Explanation:
It links records referring to the same entity.

79. What is unsupervised discretization?

a) No class labels used
b) Uses target
c) Supervised only
d) Regression based
✅ Correct Answer: a) No class labels used
📝 Explanation:
Methods like equal-width don't rely on target variables.

80. What is Local Outlier Factor (LOF)?

a) Density-based outlier score
b) Distance only
c) Global threshold
d) Simple Z-score
✅ Correct Answer: a) Density-based outlier score
📝 Explanation:
LOF compares local density to neighbors.

81. What is MICE imputation?

a) Multiple Imputation by Chained Equations
b) Single chain
c) Mean only
d) Delete
✅ Correct Answer: a) Multiple Imputation by Chained Equations
📝 Explanation:
Iterative regression for each variable with missings.

82. What is interaction feature?

a) Product of two features
b) Single feature
c) Categorical
d) Target
✅ Correct Answer: a) Product of two features
📝 Explanation:
Captures combined effects, like age * income.

83. What is quantile transformation?

a) Maps to uniform distribution
b) Normalizes to Gaussian
c) Logs
d) Boxes
✅ Correct Answer: a) Maps to uniform distribution
📝 Explanation:
It ranks data and maps to a uniform or normal dist.

84. What is data auditing?

a) Systematic quality review
b) Random check
c) Visualization
d) Modeling
✅ Correct Answer: a) Systematic quality review
📝 Explanation:
Auditing identifies patterns of errors or anomalies.

85. What is object consolidation?

a) Merging duplicate entities
b) Splitting
c) Encoding
d) Scaling
✅ Correct Answer: a) Merging duplicate entities
📝 Explanation:
Part of integration resolving duplicates across sources.

86. What is clustering-based discretization?

a) Groups similar values
b) Equal width
c) Frequency
d) Chi-square
✅ Correct Answer: a) Groups similar values
📝 Explanation:
Uses clustering to form natural bins.

87. What is DBSCAN for outliers?

a) Density-based clustering flags noise
b) K-means
c) Hierarchical
d) Gaussian
✅ Correct Answer: a) Density-based clustering flags noise
📝 Explanation:
Points not in clusters are outliers in DBSCAN.

88. What is Kalman filter imputation?

a) For time series states
b) Static mean
c) Random
d) Mode
✅ Correct Answer: a) For time series states
📝 Explanation:
Predicts missing values using state-space models.

89. What is lagged feature?

a) Previous time step value
b) Future
c) Average
d) Sum
✅ Correct Answer: a) Previous time step value
📝 Explanation:
Used in time series for autoregressive features.

90. What is power transformation?

a) General family for normality
b) Log only
c) Square root
d) Reciprocal
✅ Correct Answer: a) General family for normality
📝 Explanation:
Includes Box-Cox and Yeo-Johnson for stabilizing variance.

91. What is referential integrity check?

a) Valid foreign keys
b) Data types
c) Ranges
d) Formats
✅ Correct Answer: a) Valid foreign keys
📝 Explanation:
Ensures links between tables are valid.

92. What is data deduplication?

a) Removing exact duplicates
b) Fuzzy only
c) Encoding
d) Scaling
✅ Correct Answer: a) Removing exact duplicates
📝 Explanation:
Identifies and eliminates identical records.

93. What is entropy-based binning?

a) Minimizes intra-bin impurity
b) Equal width
c) Frequency
d) Unsupervised
✅ Correct Answer: a) Minimizes intra-bin impurity
📝 Explanation:
Uses information entropy for supervised discretization.

94. What is one-class SVM for outliers?

a) Learns normal boundary
b) Supervised classification
c) Clustering
d) Regression
✅ Correct Answer: a) Learns normal boundary
📝 Explanation:
Flags points outside the learned normal region.

95. What is EM algorithm for imputation?

a) Expectation-Maximization
b) Simple mean
c) KNN
d) Delete
✅ Correct Answer: a) Expectation-Maximization
📝 Explanation:
Iteratively estimates parameters and missings.

96. What is rolling window feature?

a) Moving statistics
b) Static
c) Lagged only
d) Future
✅ Correct Answer: a) Moving statistics
📝 Explanation:
Computes aggregates over time windows.

97. What is arcsinh transformation?

a) For heavy-tailed data
b) Log
c) Square
d) Identity
✅ Correct Answer: a) For heavy-tailed data
📝 Explanation:
Hyperbolic inverse sine handles extremes like log.

98. What is domain validation?

a) Business rule checks
b) Syntax only
c) Length
d) Type
✅ Correct Answer: a) Business rule checks
📝 Explanation:
Ensures data fits domain-specific logic.

99. What is survivorship bias in cleaning?

a) Ignoring failed entities
b) Duplicates
c) Missing
d) Noise
✅ Correct Answer: a) Ignoring failed entities
📝 Explanation:
Clean by including all historical data.

100. What is k-bins discretization?

a) Optimal bins via dynamic programming
b) Equal
c) Random
d) Density
✅ Correct Answer: a) Optimal bins via dynamic programming
📝 Explanation:
Minimizes error with k bins.

101. What is elliptic envelope for outliers?

a) Gaussian mixture based
b) Simple threshold
c) KNN
d) Tree
✅ Correct Answer: a) Gaussian mixture based
📝 Explanation:
Fits minimum covariance determinant.

102. What is random forest imputation?

a) Tree-based predictions
b) Linear
c) Mean
d) Mode
✅ Correct Answer: a) Tree-based predictions
📝 Explanation:
Uses forests to predict missings from features.

103. What is Fourier transform feature?

a) Frequency domain for time series
b) Spatial
c) Categorical
d) Text
✅ Correct Answer: a) Frequency domain for time series
📝 Explanation:
Extracts periodic components.

104. What is square root transformation?

a) For count data skewness
b) Log
c) Power 3
d) Reciprocal
✅ Correct Answer: a) For count data skewness
📝 Explanation:
Reduces variance in Poisson-like data.

105. What is completeness check?

a) No missing required fields
b) Format
c) Range
d) Consistency
✅ Correct Answer: a) No missing required fields
📝 Explanation:
Verifies all mandatory data is present.

106. What is selection bias in data cleaning?

a) Non-random sampling
b) Duplicates
c) Outliers
d) Noise
✅ Correct Answer: a) Non-random sampling
📝 Explanation:
Address by understanding sampling method.

107. What is CAIM discretization?

a) Class-Attribute Interdependence Maximization
b) Equal
c) Width
d) Frequency
✅ Correct Answer: a) Class-Attribute Interdependence Maximization
📝 Explanation:
Supervised method maximizing dependency.

108. What is COPOD for outliers?

a) Copula-based
b) Distance
c) Density
d) Tree
✅ Correct Answer: a) Copula-based
📝 Explanation:
Unsupervised using copulas for dependence.

109. What is Bayesian imputation?

a) Probabilistic filling
b) Deterministic
c) Mean
d) KNN
✅ Correct Answer: a) Probabilistic filling
📝 Explanation:
Incorporates prior distributions for estimates.

110. What is wavelet transform feature?

a) Multi-resolution analysis
b) Frequency only
c) Time only
d) Static
✅ Correct Answer: a) Multi-resolution analysis
📝 Explanation:
Decomposes signals into time-frequency components.

111. What is reciprocal transformation?

a) 1/x for left-skew
b) Sqrt
c) Log
d) Square
✅ Correct Answer: a) 1/x for left-skew
📝 Explanation:
Inverts values to handle negative skew.

112. What is uniqueness check?

a) No unintended duplicates
b) Missing
c) Range
d) Type
✅ Correct Answer: a) No unintended duplicates
📝 Explanation:
Ensures primary keys are unique.

113. What is temporal bias in cleaning?

a) Time-period specific data
b) Spatial
c) Selection
d) Confirmation
✅ Correct Answer: a) Time-period specific data
📝 Explanation:
Balance data across periods.

114. What is MDLP discretization?

a) Minimum Description Length Principle
b) Equal width
c) Frequency
d) Clustering
✅ Correct Answer: a) Minimum Description Length Principle
📝 Explanation:
Supervised stopping criterion for binning.

115. What is KNN for outliers?

a) Distance to neighbors
b) Density
c) Copula
d) Tree
✅ Correct Answer: a) Distance to neighbors
📝 Explanation:
High distance indicates isolation.

116. What is matrix factorization imputation?

a) Low-rank approximation
b) Regression
c) Tree
d) Mean
✅ Correct Answer: a) Low-rank approximation
📝 Explanation:
Fills sparse matrices like in recommender systems.

117. What is embedding feature for text?

a) Dense vector representations
b) Bag of words
c) TF-IDF
d) Count
✅ Correct Answer: a) Dense vector representations
📝 Explanation:
Word2Vec or BERT captures semantic meaning.

118. What is cube root transformation?

a) Milder than log for skewness
b) Stronger
c) Reciprocal
d) Square
✅ Correct Answer: a) Milder than log for skewness
📝 Explanation:
x^(1/3) for moderate right-skew.

119. What is timeliness check?

a) Data currency
b) Accuracy
c) Completeness
d) Consistency
✅ Correct Answer: a) Data currency
📝 Explanation:
Verifies data is up-to-date.

120. What is confirmation bias in cleaning?

a) Retaining supporting data
b) Random
c) Temporal
d) Spatial
✅ Correct Answer: a) Retaining supporting data
📝 Explanation:
Avoid by objective criteria.

121. What is FUSINTER discretization?

a) Fuzzy unsupervised
b) Supervised
c) Equal
d) Chi
✅ Correct Answer: a) Fuzzy unsupervised
📝 Explanation:
Handles overlapping bins with fuzziness.

122. What is angle-based outlier detection?

a) ABOD using angles
b) Distance
c) Density
d) Variance
✅ Correct Answer: a) ABOD using angles
📝 Explanation:
Efficient for high dimensions via angles.

123. What is deep learning imputation?

a) Autoencoder-based
b) Linear
c) Tree
d) KNN
✅ Correct Answer: a) Autoencoder-based
📝 Explanation:
Learns latent representations for filling.

124. What is PCA feature for images?

a) Eigenfaces
b) Pixel counts
c) Colors
d) Sizes
✅ Correct Answer: a) Eigenfaces
📝 Explanation:
Reduces dimensionality in face recognition.

125. What is exponential transformation?

a) For left-skew to right
b) Right to left
c) Normal
d) Uniform
✅ Correct Answer: a) For left-skew to right
📝 Explanation:
e^x stretches lower values.

126. What is accuracy check?

a) Matches reality
b) Format
c) Uniqueness
d) Timeliness
✅ Correct Answer: a) Matches reality
📝 Explanation:
Verifies data correctness against sources.

127. What is spatial bias in data?

a) Geographic imbalance
b) Time
c) Confirmation
d) Selection
✅ Correct Answer: a) Geographic imbalance
📝 Explanation:
Clean by sampling across regions.
Previous: 100 Descriptive, Inferential, and Time Series Statistics in Data Analysis - MCQs
Next: 130 Exploratory Data Analysis (EDA) MCQs
50 Regression Analysis in Data Analysis - MCQs

50 Regression Analysis in Data Analysis MCQs

These 50 MCQs covers fundamental concepts in regression analysis, including linear and multiple regression, assumptions, diagnostics, and interpretation. Ideal for…

By MCQs Generator
Exploratory Data Analysis (EDA) MCQs

130 Exploratory Data Analysis (EDA) MCQs

MCQs cover the fundamentals of Exploratory Data Analysis, covering data summarization, visualization techniques, handling anomalies, and inferring patterns from datasets.…

By MCQs Generator
Hypothesis Testing in Data Analysis

50 Hypothesis Testing in Data Analysis - MCQs

This set of 50 MCQs explores key concepts in hypothesis testing, including null and alternative hypotheses, p-values, test statistics, error…

By MCQs Generator

Detailed Explanation ×

Loading usage info...

Generating comprehensive explanation...