MCQs Generator

MCQs Generator - Fixed Responsive Header
Home » Directory » 1000 Data Analysis MCQ » 120 Data Cleaning and Preprocessing in Data Analysis - MCQs

120 Data Cleaning and Preprocessing in Data Analysis - MCQs

120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.

, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }

1. Which of the following is not a key aspect of data quality in preprocessing?

a) Accuracy
b) Completeness
c) Consistency
d) Database size
âś… Correct Answer: d) Database size
📝 Explanation:
Data quality focuses on accuracy, completeness, and consistency, not the size of the database.

2. What is a common method to handle missing class labels in a dataset?

a) Ignoring the tuple
b) Generating a duplicate tuple
c) Deleting the entire dataset
d) Increasing the dataset size
âś… Correct Answer: a) Ignoring the tuple
📝 Explanation:
Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.

3. In data cleaning, what technique uses statistical methods to fill missing values?

a) Spooling
b) Decision tree induction
c) Numerosity reduction
d) Aggregation
âś… Correct Answer: b) Decision tree induction
📝 Explanation:
Decision trees can predict probable values for missing data based on other attributes.

4. What is the primary purpose of data preprocessing in data analysis?

a) To increase data volume
b) To prepare data for effective analysis
c) To delete all data
d) To visualize data directly
âś… Correct Answer: b) To prepare data for effective analysis
📝 Explanation:
Preprocessing transforms raw data into a format suitable for analysis and modeling.

5. Which technique is used to standardize feature ranges in preprocessing?

a) Binning
b) Normalization
c) Clustering
d) Regression
âś… Correct Answer: b) Normalization
📝 Explanation:
Normalization scales features to a common range, like 0-1, to prevent bias.

6. What does handling missing values prevent in data analysis?

a) Data expansion
b) Biased model training
c) Data visualization
d) Feature creation
âś… Correct Answer: b) Biased model training
📝 Explanation:
Unaddressed missing values can skew results and lead to inaccurate predictions.

7. Which method imputes missing values using the average of similar instances?

a) Mean imputation
b) K-NN imputation
c) Deletion
d) Mode imputation
âś… Correct Answer: b) K-NN imputation
📝 Explanation:
K-Nearest Neighbors finds similar data points to estimate missing values.

8. In data cleaning, what is 'binning' primarily used for?

a) Smoothing noisy data
b) Removing duplicates
c) Encoding categories
d) Scaling features
âś… Correct Answer: a) Smoothing noisy data
📝 Explanation:
Binning groups data into bins and replaces values with bin averages to reduce noise.

9. What is the risk of not removing duplicates in preprocessing?

a) Overfitting in models
b) Underestimation of variance
c) Increased computational cost
d) All of the above
âś… Correct Answer: d) All of the above
📝 Explanation:
Duplicates can bias models, inflate variance estimates, and slow processing.

10. Which encoding converts categorical data into binary vectors?

a) Label encoding
b) One-hot encoding
c) Target encoding
d) Frequency encoding
âś… Correct Answer: b) One-hot encoding
📝 Explanation:
One-hot encoding creates dummy variables for each category without ordinal assumptions.

11. What is 'data wrangling' in the context of preprocessing?

a) Data deletion
b) Transforming messy data into clean format
c) Data visualization
d) Model training
âś… Correct Answer: b) Transforming messy data into clean format
📝 Explanation:
Data wrangling involves cleaning and restructuring data for analysis.

12. Which technique detects outliers using statistical thresholds?

a) Z-score method
b) K-means clustering
c) PCA
d) Binning
âś… Correct Answer: a) Z-score method
📝 Explanation:
Z-score identifies points more than 3 standard deviations from the mean as outliers.

13. What is the main goal of feature scaling in preprocessing?

a) To reduce dimensions
b) To make features comparable
c) To encode text
d) To handle missing data
âś… Correct Answer: b) To make features comparable
📝 Explanation:
Scaling ensures no feature dominates due to differing units or ranges.

14. In handling categorical data, when is label encoding appropriate?

a) For nominal data
b) For ordinal data
c) For text data
d) For numerical data
âś… Correct Answer: b) For ordinal data
📝 Explanation:
Label encoding assigns numbers based on order, suitable for ordinal categories.

15. What does 'data integration' involve in preprocessing?

a) Combining multiple data sources
b) Removing noise
c) Scaling values
d) Encoding features
âś… Correct Answer: a) Combining multiple data sources
📝 Explanation:
Data integration merges datasets from different sources into a unified view.

16. Which imputation method is best for non-numerical data?

a) Mean imputation
b) Mode imputation
c) K-NN imputation
d) Regression imputation
âś… Correct Answer: b) Mode imputation
📝 Explanation:
Mode imputation uses the most frequent value for categorical missing data.

17. What is a potential issue with mean imputation?

a) Reduces variance
b) Increases data size
c) Deletes rows
d) Encodes categories
âś… Correct Answer: a) Reduces variance
📝 Explanation:
Mean imputation can underestimate variability in the dataset.

18. Which method groups data into buckets for smoothing?

a) Regression
b) Binning
c) Clustering
d) Normalization
âś… Correct Answer: b) Binning
📝 Explanation:
Binning sorts data into intervals and smooths by boundary or mean values.

19. What is 'data reduction' in preprocessing?

a) Increasing dataset size
b) Reducing data volume while preserving information
c) Deleting all data
d) Visualizing data
âś… Correct Answer: b) Reducing data volume while preserving information
📝 Explanation:
Data reduction techniques like PCA minimize data size without losing key insights.

20. When should you use median imputation over mean?

a) For symmetric distributions
b) For skewed distributions
c) For categorical data
d) For ordinal data
âś… Correct Answer: b) For skewed distributions
📝 Explanation:
Median is robust to outliers and skewness, unlike the mean.

21. What does PCA stand for in dimensionality reduction?

a) Principal Component Analysis
b) Primary Cluster Aggregation
c) Principal Correlation Adjustment
d) Primary Component Allocation
âś… Correct Answer: a) Principal Component Analysis
📝 Explanation:
PCA transforms data into principal components to reduce dimensions.

22. Which step checks for data consistency across sources?

a) Data cleaning
b) Data integration
c) Data transformation
d) Data reduction
âś… Correct Answer: b) Data integration
📝 Explanation:
Integration resolves inconsistencies when merging multiple data sources.

23. What is a common way to handle outliers in preprocessing?

a) Ignore them
b) Cap or floor values
c) Increase dataset size
d) Encode as categories
âś… Correct Answer: b) Cap or floor values
📝 Explanation:
Capping limits extreme values to thresholds like quartiles.

24. In Python, which function detects missing values in Pandas?

a) isnull()
b) dropna()
c) fillna()
d) describe()
âś… Correct Answer: a) isnull()
📝 Explanation:
Pandas' isnull() returns a boolean mask for missing values.

25. What is 'forward fill' in handling missing data?

a) Filling with previous value
b) Filling with next value
c) Filling with mean
d) Deleting rows
âś… Correct Answer: a) Filling with previous value
📝 Explanation:
Forward fill propagates the last valid observation forward.

26. Which normalization brings data to zero mean and unit variance?

a) Min-max scaling
b) Z-score normalization
c) Decimal scaling
d) L1 normalization
âś… Correct Answer: b) Z-score normalization
📝 Explanation:
Z-score uses mean and standard deviation for standardization.

27. What issue arises from inconsistent data formats?

a) Easy analysis
b) Parsing errors
c) Faster processing
d) Better visualization
âś… Correct Answer: b) Parsing errors
📝 Explanation:
Inconsistent formats like dates can cause loading or computation failures.

28. Which technique merges datasets on common keys?

a) Concatenation
b) Joining
c) Binning
d) Imputation
âś… Correct Answer: b) Joining
📝 Explanation:
Joining combines tables based on matching keys like IDs.

29. What is 'data discretization' used for?

a) Continuous to categorical conversion
b) Noise removal
c) Duplicate detection
d) Scaling
âś… Correct Answer: a) Continuous to categorical conversion
📝 Explanation:
Discretization bins continuous values into discrete intervals.

30. In outlier detection, what does IQR stand for?

a) Interquartile Range
b) Integrated Query Response
c) Internal Quality Ratio
d) Interval Quality Reduction
âś… Correct Answer: a) Interquartile Range
📝 Explanation:
IQR method flags values outside 1.5 times the interquartile range.

31. What is a disadvantage of deleting rows with missing values?

a) Data loss
b) Increased accuracy
c) Faster processing
d) Better scaling
âś… Correct Answer: a) Data loss
📝 Explanation:
Deletion reduces sample size, potentially biasing the dataset.

32. Which encoding preserves category frequencies?

a) One-hot encoding
b) Frequency encoding
c) Label encoding
d) Binary encoding
âś… Correct Answer: b) Frequency encoding
📝 Explanation:
Frequency encoding replaces categories with their occurrence counts.

33. What does 'data transformation' include?

a) Normalization and aggregation
b) Only deletion
c) Visualization
d) Model training
âś… Correct Answer: a) Normalization and aggregation
📝 Explanation:
Transformation alters data structure, like normalizing or aggregating.

34. How do you handle multicollinearity in preprocessing?

a) Remove correlated features
b) Add more data
c) Ignore it
d) Scale only
âś… Correct Answer: a) Remove correlated features
📝 Explanation:
Removing highly correlated features reduces redundancy and instability.

35. What is 'noise' in data cleaning?

a) Random errors or variances
b) Missing values
c) Duplicates
d) Outliers
âś… Correct Answer: a) Random errors or variances
📝 Explanation:
Noise refers to irrelevant or incorrect data points distorting patterns.

36. Which library in Python is used for data manipulation?

a) Matplotlib
b) Pandas
c) Scikit-learn
d) NumPy
âś… Correct Answer: b) Pandas
📝 Explanation:
Pandas provides DataFrames for efficient data cleaning and transformation.

37. What is 'backward fill' for missing data?

a) Filling with next value
b) Filling with previous value
c) Filling with mean
d) Deleting
âś… Correct Answer: a) Filling with next value
📝 Explanation:
Backward fill uses the next valid observation to fill gaps.

38. Which method is robust to outliers in scaling?

a) Min-max scaling
b) Robust scaling
c) Z-score
d) Log scaling
âś… Correct Answer: b) Robust scaling
📝 Explanation:
Robust scaling uses median and IQR, ignoring extreme values.

39. What causes data inconsistency?

a) Different naming conventions
b) Uniform formats
c) Single source
d) Clean data
âś… Correct Answer: a) Different naming conventions
📝 Explanation:
Synonyms or varying abbreviations across sources lead to inconsistencies.

40. In merging datasets, what is an inner join?

a) Only matching records
b) All records from left
c) All records from right
d) All records combined
âś… Correct Answer: a) Only matching records
📝 Explanation:
Inner join returns rows with matching keys in both datasets.

41. What is entropy-based discretization?

a) Supervised binning using class info
b) Unsupervised binning
c) Noise removal
d) Scaling
âś… Correct Answer: a) Supervised binning using class info
📝 Explanation:
It uses information gain to create bins that maximize class separation.

42. How is outlier impact assessed?

a) Box plots
b) Histograms
c) Scatter plots
d) All of the above
âś… Correct Answer: d) All of the above
📝 Explanation:
Visual tools like box plots and scatters help identify outliers.

43. When is listwise deletion used?

a) For few missing values
b) When data loss is acceptable
c) For categorical data
d) Always
âś… Correct Answer: b) When data loss is acceptable
📝 Explanation:
Listwise deletes entire rows with any missing values if sample size allows.

44. What does binary encoding do to categories?

a) Assigns integers
b) Converts to binary bits
c) One-hot expands
d) Frequency maps
âś… Correct Answer: b) Converts to binary bits
📝 Explanation:
Binary encoding halves dimensions compared to one-hot by using bits.

45. What is aggregation in transformation?

a) Summarizing data
b) Splitting data
c) Encoding
d) Imputing
âś… Correct Answer: a) Summarizing data
📝 Explanation:
Aggregation computes summaries like means or counts from groups.

46. How to detect multicollinearity?

a) Correlation matrix
b) VIF calculation
c) Both a and b
d) None
âś… Correct Answer: c) Both a and b
📝 Explanation:
High correlations or VIF > 5 indicate multicollinearity.

47. What smoothing method uses regression?

a) Moving average
b) Regression smoothing
c) Binning
d) All
âś… Correct Answer: b) Regression smoothing
📝 Explanation:
Regression fits a model to local data for noise reduction.

48. Which Pandas method removes duplicates?

a) drop_duplicates()
b) unique()
c) value_counts()
d) replace()
âś… Correct Answer: a) drop_duplicates()
📝 Explanation:
drop_duplicates() eliminates repeated rows based on specified columns.

49. What is interpolation for time series missing data?

a) Linear estimation between points
b) Mean fill
c) Mode fill
d) Deletion
âś… Correct Answer: a) Linear estimation between points
📝 Explanation:
Interpolation estimates values using surrounding data points.

50. What is L1 normalization?

a) Sum to 1
b) Divide by max
c) Zero mean
d) Unit variance
âś… Correct Answer: a) Sum to 1
📝 Explanation:
L1 normalizes by dividing by the sum of absolute values.

51. What is entity resolution in cleaning?

a) Merging similar records
b) Removing noise
c) Scaling
d) Encoding
âś… Correct Answer: a) Merging similar records
📝 Explanation:
It identifies and merges duplicates across datasets.

52. What join includes all left records?

a) Inner join
b) Left join
c) Right join
d) Outer join
âś… Correct Answer: b) Left join
📝 Explanation:
Left join keeps all from left table, matching from right.

53. What is chi-merge discretization?

a) Supervised using chi-square
b) Unsupervised
c) Noise based
d) Scale based
âś… Correct Answer: a) Supervised using chi-square
📝 Explanation:
ChiMerge uses chi-square tests to determine bin boundaries.

54. What threshold for Z-score outlier?

a) 2
b) 3
c) 1
d) 4
âś… Correct Answer: b) 3
📝 Explanation:
Values beyond 3 standard deviations are typically outliers.

55. When to use pairwise deletion?

a) For correlations
b) For full dataset analysis
c) Always
d) For imputation
âś… Correct Answer: a) For correlations
📝 Explanation:
Pairwise uses available pairs, maximizing data for specific analyses.

56. What is target encoding?

a) Replace with mean target
b) Binary conversion
c) Frequency
d) One-hot
âś… Correct Answer: a) Replace with mean target
📝 Explanation:
Target encoding uses the mean of the target for each category.

57. What is windowing in smoothing?

a) Local averaging
b) Global scaling
c) Binning
d) Regression
âś… Correct Answer: a) Local averaging
📝 Explanation:
Windowing averages values in a sliding window to smooth noise.

58. What does df.dropna() do in Pandas?

a) Fills missing
b) Removes rows with missing
c) Detects missing
d) Counts missing
âś… Correct Answer: b) Removes rows with missing
📝 Explanation:
dropna() deletes rows or columns containing NaN values.

59. What is spline interpolation?

a) Piecewise polynomial fitting
b) Linear
c) Nearest neighbor
d) Constant
âś… Correct Answer: a) Piecewise polynomial fitting
📝 Explanation:
Splines use smooth polynomials between points for interpolation.

60. What is L2 normalization?

a) Euclidean norm to 1
b) Manhattan to 1
c) Max value
d) Mean zero
âś… Correct Answer: a) Euclidean norm to 1
📝 Explanation:
L2 divides by the square root of sum of squares.

61. What is fuzzy matching?

a) Approximate string matching
b) Exact matching
c) Numerical scaling
d) Binning
âś… Correct Answer: a) Approximate string matching
📝 Explanation:
Fuzzy matching handles typos or variations in entity names.

62. What is a full outer join?

a) All records from both
b) Only left
c) Only matching
d) Only right
âś… Correct Answer: a) All records from both
📝 Explanation:
Full outer join includes all rows, filling non-matches with nulls.

63. What is equal-width discretization?

a) Equal bin sizes
b) Equal frequency
c) Supervised
d) Noise based
âś… Correct Answer: a) Equal bin sizes
📝 Explanation:
Equal-width divides range into uniform intervals.

64. What is Mahalanobis distance for outliers?

a) Accounts for covariance
b) Euclidean only
c) Manhattan
d) Simple threshold
âś… Correct Answer: a) Accounts for covariance
📝 Explanation:
It measures distance considering variable correlations.

65. What is hot-deck imputation?

a) Random similar donor value
b) Mean fill
c) Mode
d) Regression
âś… Correct Answer: a) Random similar donor value
📝 Explanation:
Hot-deck selects from observed values in similar cases.

66. What is cold-deck imputation?

a) External donor values
b) Internal random
c) Mean
d) Delete
âś… Correct Answer: a) External donor values
📝 Explanation:
Cold-deck uses values from another dataset or time.

67. What is feature engineering in preprocessing?

a) Creating new features
b) Deleting features
c) Scaling only
d) Encoding only
âś… Correct Answer: a) Creating new features
📝 Explanation:
It derives informative variables from raw data.

68. What is log transformation used for?

a) Handling skewness
b) Categorical encoding
c) Missing fill
d) Duplicate removal
âś… Correct Answer: a) Handling skewness
📝 Explanation:
Log reduces right-skewness in positive data.

69. What is Box-Cox transformation?

a) Stabilizes variance
b) Normalizes only
c) Discretizes
d) Clusters
âś… Correct Answer: a) Stabilizes variance
📝 Explanation:
It finds optimal power to make data more normal.

70. What is data profiling?

a) Summarizing data characteristics
b) Cleaning data
c) Modeling
d) Visualizing
âś… Correct Answer: a) Summarizing data characteristics
📝 Explanation:
Profiling assesses quality, structure, and content.

71. What is schema matching?

a) Aligning attributes across sources
b) Numerical scaling
c) Binning
d) Imputation
âś… Correct Answer: a) Aligning attributes across sources
📝 Explanation:
It resolves differences in data schemas during integration.

72. What is equal-frequency discretization?

a) Equal counts per bin
b) Equal widths
c) Supervised
d) Random
âś… Correct Answer: a) Equal counts per bin
📝 Explanation:
Quantile binning ensures similar sample sizes in bins.

73. What is Isolation Forest for outliers?

a) Anomaly isolation via trees
b) Clustering
c) Regression
d) PCA
âś… Correct Answer: a) Anomaly isolation via trees
📝 Explanation:
It isolates outliers faster than normal points.

74. What is multiple imputation?

a) Creates several filled datasets
b) Single mean fill
c) Deletion
d) Mode only
âś… Correct Answer: a) Creates several filled datasets
📝 Explanation:
It accounts for uncertainty by averaging multiple imputations.

75. What is polynomial feature generation?

a) Higher-order interactions
b) Linear only
c) Categorical
d) Text
âś… Correct Answer: a) Higher-order interactions
📝 Explanation:
It creates features like x^2 or x*y for non-linearity.

76. What is Yeo-Johnson transformation?

a) For negative values too
b) Positive only
c) Discretization
d) Encoding
âś… Correct Answer: a) For negative values too
📝 Explanation:
Extension of Box-Cox handling negative and zero values.

77. What is data validation in cleaning?

a) Ensuring accuracy and consistency
b) Visualization
c) Modeling
d) Storage
âś… Correct Answer: a) Ensuring accuracy and consistency
📝 Explanation:
Validation checks rules like range or format compliance.

78. What is record linkage?

a) Matching across datasets
b) Within dataset duplicates
c) Scaling
d) Binning
âś… Correct Answer: a) Matching across datasets
📝 Explanation:
It links records referring to the same entity.

79. What is unsupervised discretization?

a) No class labels used
b) Uses target
c) Supervised only
d) Regression based
âś… Correct Answer: a) No class labels used
📝 Explanation:
Methods like equal-width don't rely on target variables.

80. What is Local Outlier Factor (LOF)?

a) Density-based outlier score
b) Distance only
c) Global threshold
d) Simple Z-score
âś… Correct Answer: a) Density-based outlier score
📝 Explanation:
LOF compares local density to neighbors.

81. What is MICE imputation?

a) Multiple Imputation by Chained Equations
b) Single chain
c) Mean only
d) Delete
âś… Correct Answer: a) Multiple Imputation by Chained Equations
📝 Explanation:
Iterative regression for each variable with missings.

82. What is interaction feature?

a) Product of two features
b) Single feature
c) Categorical
d) Target
âś… Correct Answer: a) Product of two features
📝 Explanation:
Captures combined effects, like age * income.

83. What is quantile transformation?

a) Maps to uniform distribution
b) Normalizes to Gaussian
c) Logs
d) Boxes
âś… Correct Answer: a) Maps to uniform distribution
📝 Explanation:
It ranks data and maps to a uniform or normal dist.

84. What is data auditing?

a) Systematic quality review
b) Random check
c) Visualization
d) Modeling
âś… Correct Answer: a) Systematic quality review
📝 Explanation:
Auditing identifies patterns of errors or anomalies.

85. What is object consolidation?

a) Merging duplicate entities
b) Splitting
c) Encoding
d) Scaling
âś… Correct Answer: a) Merging duplicate entities
📝 Explanation:
Part of integration resolving duplicates across sources.

86. What is clustering-based discretization?

a) Groups similar values
b) Equal width
c) Frequency
d) Chi-square
âś… Correct Answer: a) Groups similar values
📝 Explanation:
Uses clustering to form natural bins.

87. What is DBSCAN for outliers?

a) Density-based clustering flags noise
b) K-means
c) Hierarchical
d) Gaussian
âś… Correct Answer: a) Density-based clustering flags noise
📝 Explanation:
Points not in clusters are outliers in DBSCAN.

88. What is Kalman filter imputation?

a) For time series states
b) Static mean
c) Random
d) Mode
âś… Correct Answer: a) For time series states
📝 Explanation:
Predicts missing values using state-space models.

89. What is lagged feature?

a) Previous time step value
b) Future
c) Average
d) Sum
âś… Correct Answer: a) Previous time step value
📝 Explanation:
Used in time series for autoregressive features.

90. What is power transformation?

a) General family for normality
b) Log only
c) Square root
d) Reciprocal
âś… Correct Answer: a) General family for normality
📝 Explanation:
Includes Box-Cox and Yeo-Johnson for stabilizing variance.

91. What is referential integrity check?

a) Valid foreign keys
b) Data types
c) Ranges
d) Formats
âś… Correct Answer: a) Valid foreign keys
📝 Explanation:
Ensures links between tables are valid.

92. What is data deduplication?

a) Removing exact duplicates
b) Fuzzy only
c) Encoding
d) Scaling
âś… Correct Answer: a) Removing exact duplicates
📝 Explanation:
Identifies and eliminates identical records.

93. What is entropy-based binning?

a) Minimizes intra-bin impurity
b) Equal width
c) Frequency
d) Unsupervised
âś… Correct Answer: a) Minimizes intra-bin impurity
📝 Explanation:
Uses information entropy for supervised discretization.

94. What is one-class SVM for outliers?

a) Learns normal boundary
b) Supervised classification
c) Clustering
d) Regression
âś… Correct Answer: a) Learns normal boundary
📝 Explanation:
Flags points outside the learned normal region.

95. What is EM algorithm for imputation?

a) Expectation-Maximization
b) Simple mean
c) KNN
d) Delete
âś… Correct Answer: a) Expectation-Maximization
📝 Explanation:
Iteratively estimates parameters and missings.

96. What is rolling window feature?

a) Moving statistics
b) Static
c) Lagged only
d) Future
âś… Correct Answer: a) Moving statistics
📝 Explanation:
Computes aggregates over time windows.

97. What is arcsinh transformation?

a) For heavy-tailed data
b) Log
c) Square
d) Identity
âś… Correct Answer: a) For heavy-tailed data
📝 Explanation:
Hyperbolic inverse sine handles extremes like log.

98. What is domain validation?

a) Business rule checks
b) Syntax only
c) Length
d) Type
âś… Correct Answer: a) Business rule checks
📝 Explanation:
Ensures data fits domain-specific logic.

99. What is survivorship bias in cleaning?

a) Ignoring failed entities
b) Duplicates
c) Missing
d) Noise
âś… Correct Answer: a) Ignoring failed entities
📝 Explanation:
Clean by including all historical data.

100. What is k-bins discretization?

a) Optimal bins via dynamic programming
b) Equal
c) Random
d) Density
âś… Correct Answer: a) Optimal bins via dynamic programming
📝 Explanation:
Minimizes error with k bins.

101. What is elliptic envelope for outliers?

a) Gaussian mixture based
b) Simple threshold
c) KNN
d) Tree
âś… Correct Answer: a) Gaussian mixture based
📝 Explanation:
Fits minimum covariance determinant.

102. What is random forest imputation?

a) Tree-based predictions
b) Linear
c) Mean
d) Mode
âś… Correct Answer: a) Tree-based predictions
📝 Explanation:
Uses forests to predict missings from features.

103. What is Fourier transform feature?

a) Frequency domain for time series
b) Spatial
c) Categorical
d) Text
âś… Correct Answer: a) Frequency domain for time series
📝 Explanation:
Extracts periodic components.

104. What is square root transformation?

a) For count data skewness
b) Log
c) Power 3
d) Reciprocal
âś… Correct Answer: a) For count data skewness
📝 Explanation:
Reduces variance in Poisson-like data.

105. What is completeness check?

a) No missing required fields
b) Format
c) Range
d) Consistency
âś… Correct Answer: a) No missing required fields
📝 Explanation:
Verifies all mandatory data is present.

106. What is selection bias in data cleaning?

a) Non-random sampling
b) Duplicates
c) Outliers
d) Noise
âś… Correct Answer: a) Non-random sampling
📝 Explanation:
Address by understanding sampling method.

107. What is CAIM discretization?

a) Class-Attribute Interdependence Maximization
b) Equal
c) Width
d) Frequency
âś… Correct Answer: a) Class-Attribute Interdependence Maximization
📝 Explanation:
Supervised method maximizing dependency.

108. What is COPOD for outliers?

a) Copula-based
b) Distance
c) Density
d) Tree
âś… Correct Answer: a) Copula-based
📝 Explanation:
Unsupervised using copulas for dependence.

109. What is Bayesian imputation?

a) Probabilistic filling
b) Deterministic
c) Mean
d) KNN
âś… Correct Answer: a) Probabilistic filling
📝 Explanation:
Incorporates prior distributions for estimates.

110. What is wavelet transform feature?

a) Multi-resolution analysis
b) Frequency only
c) Time only
d) Static
âś… Correct Answer: a) Multi-resolution analysis
📝 Explanation:
Decomposes signals into time-frequency components.

111. What is reciprocal transformation?

a) 1/x for left-skew
b) Sqrt
c) Log
d) Square
âś… Correct Answer: a) 1/x for left-skew
📝 Explanation:
Inverts values to handle negative skew.

112. What is uniqueness check?

a) No unintended duplicates
b) Missing
c) Range
d) Type
âś… Correct Answer: a) No unintended duplicates
📝 Explanation:
Ensures primary keys are unique.

113. What is temporal bias in cleaning?

a) Time-period specific data
b) Spatial
c) Selection
d) Confirmation
âś… Correct Answer: a) Time-period specific data
📝 Explanation:
Balance data across periods.

114. What is MDLP discretization?

a) Minimum Description Length Principle
b) Equal width
c) Frequency
d) Clustering
âś… Correct Answer: a) Minimum Description Length Principle
📝 Explanation:
Supervised stopping criterion for binning.

115. What is KNN for outliers?

a) Distance to neighbors
b) Density
c) Copula
d) Tree
âś… Correct Answer: a) Distance to neighbors
📝 Explanation:
High distance indicates isolation.

116. What is matrix factorization imputation?

a) Low-rank approximation
b) Regression
c) Tree
d) Mean
âś… Correct Answer: a) Low-rank approximation
📝 Explanation:
Fills sparse matrices like in recommender systems.

117. What is embedding feature for text?

a) Dense vector representations
b) Bag of words
c) TF-IDF
d) Count
âś… Correct Answer: a) Dense vector representations
📝 Explanation:
Word2Vec or BERT captures semantic meaning.

118. What is cube root transformation?

a) Milder than log for skewness
b) Stronger
c) Reciprocal
d) Square
âś… Correct Answer: a) Milder than log for skewness
📝 Explanation:
x^(1/3) for moderate right-skew.

119. What is timeliness check?

a) Data currency
b) Accuracy
c) Completeness
d) Consistency
âś… Correct Answer: a) Data currency
📝 Explanation:
Verifies data is up-to-date.

120. What is confirmation bias in cleaning?

a) Retaining supporting data
b) Random
c) Temporal
d) Spatial
âś… Correct Answer: a) Retaining supporting data
📝 Explanation:
Avoid by objective criteria.

121. What is FUSINTER discretization?

a) Fuzzy unsupervised
b) Supervised
c) Equal
d) Chi
âś… Correct Answer: a) Fuzzy unsupervised
📝 Explanation:
Handles overlapping bins with fuzziness.

122. What is angle-based outlier detection?

a) ABOD using angles
b) Distance
c) Density
d) Variance
âś… Correct Answer: a) ABOD using angles
📝 Explanation:
Efficient for high dimensions via angles.

123. What is deep learning imputation?

a) Autoencoder-based
b) Linear
c) Tree
d) KNN
âś… Correct Answer: a) Autoencoder-based
📝 Explanation:
Learns latent representations for filling.

124. What is PCA feature for images?

a) Eigenfaces
b) Pixel counts
c) Colors
d) Sizes
âś… Correct Answer: a) Eigenfaces
📝 Explanation:
Reduces dimensionality in face recognition.

125. What is exponential transformation?

a) For left-skew to right
b) Right to left
c) Normal
d) Uniform
âś… Correct Answer: a) For left-skew to right
📝 Explanation:
e^x stretches lower values.

126. What is accuracy check?

a) Matches reality
b) Format
c) Uniqueness
d) Timeliness
âś… Correct Answer: a) Matches reality
📝 Explanation:
Verifies data correctness against sources.

127. What is spatial bias in data?

a) Geographic imbalance
b) Time
c) Confirmation
d) Selection
âś… Correct Answer: a) Geographic imbalance
📝 Explanation:
Clean by sampling across regions.
New50 Regression Analysis in Data Analysis - MCQs

50 Regression Analysis in Data Analysis MCQs

These 50 MCQs covers fundamental concepts in regression analysis, including linear and multiple regression, assumptions, diagnostics, and interpretation. Ideal for…

By MCQs Generator
NewExploratory Data Analysis (EDA) MCQs

130 Exploratory Data Analysis (EDA) MCQs

MCQs cover the fundamentals of Exploratory Data Analysis, covering data summarization, visualization techniques, handling anomalies, and inferring patterns from datasets.…

By MCQs Generator
NewCorrelation and Covariance

60 Important Correlation and Covariance MCQs

This set of 60 MCQs covers the fundamentals of correlation and covariance, including types like Pearson and Spearman, their calculations,…

By MCQs Generator

Detailed Explanation ×

Loading usage info...

Generating comprehensive explanation...