120 industry-level multiple-choice questions on data cleaning, handling missing values, outliers, encoding, scaling, and preprocessing pipelines—modeled after real data scientist and analyst interviews at FAANG, fintech, and consulting firms.
, “schemaMarkup”: { “@context”: “https://schema.org”, “@type”: “BlogPosting”, “headline”: “Data Cleaning and Preprocessing – MCQ Quiz”, “description”: “Multiple choice quiz on data cleaning and preprocessing in data analysis, designed for tech-industry style interviews and competitive exams.”, “image”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png”, “author”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “publisher”: { “@type”: “Organization”, “name”: “MCQs Generator”, “logo”: { “@type”: “ImageObject”, “url”: “https://mcqsgenerator.com/wp-content/uploads/2025/08/MCQS-Generator-Logo.png” } }, “datePublished”: “2025-11-09T00:00:00Z”, “dateModified”: “2025-11-09T00:00:00Z”, “mainEntity”: { “@type”: “Quiz”, “name”: “Data Cleaning and Preprocessing MCQ Quiz”, “description”: “Quiz containing 100 multiple choice questions on data cleaning and preprocessing, aligned with industry/tech interview style.”, “creator”: { “@type”: “Organization”, “name”: “MCQs Generator”, “url”: “https://www.mcqsgenerator.com” }, “educationalLevel”: “Intermediate”, “typicalTime”: “PT30M” }120 Data Cleaning and Preprocessing in Data Analysis - MCQs
âś… Correct Answer: d) Database size
📝 Explanation:
Data quality focuses on accuracy, completeness, and consistency, not the size of the database.
âś… Correct Answer: a) Ignoring the tuple
📝 Explanation:
Ignoring tuples with missing class labels is a simple strategy when the impact is minimal.
âś… Correct Answer: b) Decision tree induction
📝 Explanation:
Decision trees can predict probable values for missing data based on other attributes.
âś… Correct Answer: b) To prepare data for effective analysis
📝 Explanation:
Preprocessing transforms raw data into a format suitable for analysis and modeling.
âś… Correct Answer: b) Normalization
📝 Explanation:
Normalization scales features to a common range, like 0-1, to prevent bias.
âś… Correct Answer: b) Biased model training
📝 Explanation:
Unaddressed missing values can skew results and lead to inaccurate predictions.
âś… Correct Answer: b) K-NN imputation
📝 Explanation:
K-Nearest Neighbors finds similar data points to estimate missing values.
âś… Correct Answer: a) Smoothing noisy data
📝 Explanation:
Binning groups data into bins and replaces values with bin averages to reduce noise.
âś… Correct Answer: d) All of the above
📝 Explanation:
Duplicates can bias models, inflate variance estimates, and slow processing.
âś… Correct Answer: b) One-hot encoding
📝 Explanation:
One-hot encoding creates dummy variables for each category without ordinal assumptions.
âś… Correct Answer: b) Transforming messy data into clean format
📝 Explanation:
Data wrangling involves cleaning and restructuring data for analysis.
âś… Correct Answer: a) Z-score method
📝 Explanation:
Z-score identifies points more than 3 standard deviations from the mean as outliers.
âś… Correct Answer: b) To make features comparable
📝 Explanation:
Scaling ensures no feature dominates due to differing units or ranges.
âś… Correct Answer: b) For ordinal data
📝 Explanation:
Label encoding assigns numbers based on order, suitable for ordinal categories.
âś… Correct Answer: a) Combining multiple data sources
📝 Explanation:
Data integration merges datasets from different sources into a unified view.
âś… Correct Answer: b) Mode imputation
📝 Explanation:
Mode imputation uses the most frequent value for categorical missing data.
âś… Correct Answer: a) Reduces variance
📝 Explanation:
Mean imputation can underestimate variability in the dataset.
âś… Correct Answer: b) Binning
📝 Explanation:
Binning sorts data into intervals and smooths by boundary or mean values.
âś… Correct Answer: b) Reducing data volume while preserving information
📝 Explanation:
Data reduction techniques like PCA minimize data size without losing key insights.
âś… Correct Answer: b) For skewed distributions
📝 Explanation:
Median is robust to outliers and skewness, unlike the mean.
âś… Correct Answer: a) Principal Component Analysis
📝 Explanation:
PCA transforms data into principal components to reduce dimensions.
âś… Correct Answer: b) Data integration
📝 Explanation:
Integration resolves inconsistencies when merging multiple data sources.
âś… Correct Answer: b) Cap or floor values
📝 Explanation:
Capping limits extreme values to thresholds like quartiles.
âś… Correct Answer: a) isnull()
📝 Explanation:
Pandas' isnull() returns a boolean mask for missing values.
âś… Correct Answer: a) Filling with previous value
📝 Explanation:
Forward fill propagates the last valid observation forward.
âś… Correct Answer: b) Z-score normalization
📝 Explanation:
Z-score uses mean and standard deviation for standardization.
âś… Correct Answer: b) Parsing errors
📝 Explanation:
Inconsistent formats like dates can cause loading or computation failures.
âś… Correct Answer: b) Joining
📝 Explanation:
Joining combines tables based on matching keys like IDs.
âś… Correct Answer: a) Continuous to categorical conversion
📝 Explanation:
Discretization bins continuous values into discrete intervals.
âś… Correct Answer: a) Interquartile Range
📝 Explanation:
IQR method flags values outside 1.5 times the interquartile range.
âś… Correct Answer: a) Data loss
📝 Explanation:
Deletion reduces sample size, potentially biasing the dataset.
âś… Correct Answer: b) Frequency encoding
📝 Explanation:
Frequency encoding replaces categories with their occurrence counts.
âś… Correct Answer: a) Normalization and aggregation
📝 Explanation:
Transformation alters data structure, like normalizing or aggregating.
âś… Correct Answer: a) Remove correlated features
📝 Explanation:
Removing highly correlated features reduces redundancy and instability.
âś… Correct Answer: a) Random errors or variances
📝 Explanation:
Noise refers to irrelevant or incorrect data points distorting patterns.
âś… Correct Answer: b) Pandas
📝 Explanation:
Pandas provides DataFrames for efficient data cleaning and transformation.
âś… Correct Answer: a) Filling with next value
📝 Explanation:
Backward fill uses the next valid observation to fill gaps.
âś… Correct Answer: b) Robust scaling
📝 Explanation:
Robust scaling uses median and IQR, ignoring extreme values.
âś… Correct Answer: a) Different naming conventions
📝 Explanation:
Synonyms or varying abbreviations across sources lead to inconsistencies.
âś… Correct Answer: a) Only matching records
📝 Explanation:
Inner join returns rows with matching keys in both datasets.
âś… Correct Answer: a) Supervised binning using class info
📝 Explanation:
It uses information gain to create bins that maximize class separation.
âś… Correct Answer: d) All of the above
📝 Explanation:
Visual tools like box plots and scatters help identify outliers.
âś… Correct Answer: b) When data loss is acceptable
📝 Explanation:
Listwise deletes entire rows with any missing values if sample size allows.
âś… Correct Answer: b) Converts to binary bits
📝 Explanation:
Binary encoding halves dimensions compared to one-hot by using bits.
âś… Correct Answer: a) Summarizing data
📝 Explanation:
Aggregation computes summaries like means or counts from groups.
âś… Correct Answer: c) Both a and b
📝 Explanation:
High correlations or VIF > 5 indicate multicollinearity.
âś… Correct Answer: b) Regression smoothing
📝 Explanation:
Regression fits a model to local data for noise reduction.
âś… Correct Answer: a) drop_duplicates()
📝 Explanation:
drop_duplicates() eliminates repeated rows based on specified columns.
âś… Correct Answer: a) Linear estimation between points
📝 Explanation:
Interpolation estimates values using surrounding data points.
âś… Correct Answer: a) Sum to 1
📝 Explanation:
L1 normalizes by dividing by the sum of absolute values.
âś… Correct Answer: a) Merging similar records
📝 Explanation:
It identifies and merges duplicates across datasets.
âś… Correct Answer: b) Left join
📝 Explanation:
Left join keeps all from left table, matching from right.
âś… Correct Answer: a) Supervised using chi-square
📝 Explanation:
ChiMerge uses chi-square tests to determine bin boundaries.
âś… Correct Answer: b) 3
📝 Explanation:
Values beyond 3 standard deviations are typically outliers.
âś… Correct Answer: a) For correlations
📝 Explanation:
Pairwise uses available pairs, maximizing data for specific analyses.
âś… Correct Answer: a) Replace with mean target
📝 Explanation:
Target encoding uses the mean of the target for each category.
âś… Correct Answer: a) Local averaging
📝 Explanation:
Windowing averages values in a sliding window to smooth noise.
âś… Correct Answer: b) Removes rows with missing
📝 Explanation:
dropna() deletes rows or columns containing NaN values.
âś… Correct Answer: a) Piecewise polynomial fitting
📝 Explanation:
Splines use smooth polynomials between points for interpolation.
âś… Correct Answer: a) Euclidean norm to 1
📝 Explanation:
L2 divides by the square root of sum of squares.
âś… Correct Answer: a) Approximate string matching
📝 Explanation:
Fuzzy matching handles typos or variations in entity names.
âś… Correct Answer: a) All records from both
📝 Explanation:
Full outer join includes all rows, filling non-matches with nulls.
âś… Correct Answer: a) Equal bin sizes
📝 Explanation:
Equal-width divides range into uniform intervals.
âś… Correct Answer: a) Accounts for covariance
📝 Explanation:
It measures distance considering variable correlations.
âś… Correct Answer: a) Random similar donor value
📝 Explanation:
Hot-deck selects from observed values in similar cases.
âś… Correct Answer: a) External donor values
📝 Explanation:
Cold-deck uses values from another dataset or time.
âś… Correct Answer: a) Creating new features
📝 Explanation:
It derives informative variables from raw data.
âś… Correct Answer: a) Handling skewness
📝 Explanation:
Log reduces right-skewness in positive data.
âś… Correct Answer: a) Stabilizes variance
📝 Explanation:
It finds optimal power to make data more normal.
âś… Correct Answer: a) Summarizing data characteristics
📝 Explanation:
Profiling assesses quality, structure, and content.
âś… Correct Answer: a) Aligning attributes across sources
📝 Explanation:
It resolves differences in data schemas during integration.
âś… Correct Answer: a) Equal counts per bin
📝 Explanation:
Quantile binning ensures similar sample sizes in bins.
âś… Correct Answer: a) Anomaly isolation via trees
📝 Explanation:
It isolates outliers faster than normal points.
âś… Correct Answer: a) Creates several filled datasets
📝 Explanation:
It accounts for uncertainty by averaging multiple imputations.
âś… Correct Answer: a) Higher-order interactions
📝 Explanation:
It creates features like x^2 or x*y for non-linearity.
âś… Correct Answer: a) For negative values too
📝 Explanation:
Extension of Box-Cox handling negative and zero values.
âś… Correct Answer: a) Ensuring accuracy and consistency
📝 Explanation:
Validation checks rules like range or format compliance.
âś… Correct Answer: a) Matching across datasets
📝 Explanation:
It links records referring to the same entity.
âś… Correct Answer: a) No class labels used
📝 Explanation:
Methods like equal-width don't rely on target variables.
âś… Correct Answer: a) Density-based outlier score
📝 Explanation:
LOF compares local density to neighbors.
âś… Correct Answer: a) Multiple Imputation by Chained Equations
📝 Explanation:
Iterative regression for each variable with missings.
âś… Correct Answer: a) Product of two features
📝 Explanation:
Captures combined effects, like age * income.
âś… Correct Answer: a) Maps to uniform distribution
📝 Explanation:
It ranks data and maps to a uniform or normal dist.
âś… Correct Answer: a) Systematic quality review
📝 Explanation:
Auditing identifies patterns of errors or anomalies.
âś… Correct Answer: a) Merging duplicate entities
📝 Explanation:
Part of integration resolving duplicates across sources.
âś… Correct Answer: a) Groups similar values
📝 Explanation:
Uses clustering to form natural bins.
âś… Correct Answer: a) Density-based clustering flags noise
📝 Explanation:
Points not in clusters are outliers in DBSCAN.
âś… Correct Answer: a) For time series states
📝 Explanation:
Predicts missing values using state-space models.
âś… Correct Answer: a) Previous time step value
📝 Explanation:
Used in time series for autoregressive features.
âś… Correct Answer: a) General family for normality
📝 Explanation:
Includes Box-Cox and Yeo-Johnson for stabilizing variance.
âś… Correct Answer: a) Valid foreign keys
📝 Explanation:
Ensures links between tables are valid.
âś… Correct Answer: a) Removing exact duplicates
📝 Explanation:
Identifies and eliminates identical records.
âś… Correct Answer: a) Minimizes intra-bin impurity
📝 Explanation:
Uses information entropy for supervised discretization.
âś… Correct Answer: a) Learns normal boundary
📝 Explanation:
Flags points outside the learned normal region.
âś… Correct Answer: a) Expectation-Maximization
📝 Explanation:
Iteratively estimates parameters and missings.
âś… Correct Answer: a) Moving statistics
📝 Explanation:
Computes aggregates over time windows.
âś… Correct Answer: a) For heavy-tailed data
📝 Explanation:
Hyperbolic inverse sine handles extremes like log.
âś… Correct Answer: a) Business rule checks
📝 Explanation:
Ensures data fits domain-specific logic.
âś… Correct Answer: a) Ignoring failed entities
📝 Explanation:
Clean by including all historical data.
âś… Correct Answer: a) Optimal bins via dynamic programming
📝 Explanation:
Minimizes error with k bins.
âś… Correct Answer: a) Gaussian mixture based
📝 Explanation:
Fits minimum covariance determinant.
âś… Correct Answer: a) Tree-based predictions
📝 Explanation:
Uses forests to predict missings from features.
âś… Correct Answer: a) Frequency domain for time series
📝 Explanation:
Extracts periodic components.
âś… Correct Answer: a) For count data skewness
📝 Explanation:
Reduces variance in Poisson-like data.
âś… Correct Answer: a) No missing required fields
📝 Explanation:
Verifies all mandatory data is present.
âś… Correct Answer: a) Non-random sampling
📝 Explanation:
Address by understanding sampling method.
âś… Correct Answer: a) Class-Attribute Interdependence Maximization
📝 Explanation:
Supervised method maximizing dependency.
âś… Correct Answer: a) Copula-based
📝 Explanation:
Unsupervised using copulas for dependence.
âś… Correct Answer: a) Probabilistic filling
📝 Explanation:
Incorporates prior distributions for estimates.
âś… Correct Answer: a) Multi-resolution analysis
📝 Explanation:
Decomposes signals into time-frequency components.
âś… Correct Answer: a) 1/x for left-skew
📝 Explanation:
Inverts values to handle negative skew.
âś… Correct Answer: a) No unintended duplicates
📝 Explanation:
Ensures primary keys are unique.
âś… Correct Answer: a) Time-period specific data
📝 Explanation:
Balance data across periods.
âś… Correct Answer: a) Minimum Description Length Principle
📝 Explanation:
Supervised stopping criterion for binning.
âś… Correct Answer: a) Distance to neighbors
📝 Explanation:
High distance indicates isolation.
âś… Correct Answer: a) Low-rank approximation
📝 Explanation:
Fills sparse matrices like in recommender systems.
âś… Correct Answer: a) Dense vector representations
📝 Explanation:
Word2Vec or BERT captures semantic meaning.
âś… Correct Answer: a) Milder than log for skewness
📝 Explanation:
x^(1/3) for moderate right-skew.
âś… Correct Answer: a) Data currency
📝 Explanation:
Verifies data is up-to-date.
âś… Correct Answer: a) Retaining supporting data
📝 Explanation:
Avoid by objective criteria.
âś… Correct Answer: a) Fuzzy unsupervised
📝 Explanation:
Handles overlapping bins with fuzziness.
âś… Correct Answer: a) ABOD using angles
📝 Explanation:
Efficient for high dimensions via angles.
âś… Correct Answer: a) Autoencoder-based
📝 Explanation:
Learns latent representations for filling.
âś… Correct Answer: a) Eigenfaces
📝 Explanation:
Reduces dimensionality in face recognition.
âś… Correct Answer: a) For left-skew to right
📝 Explanation:
e^x stretches lower values.
âś… Correct Answer: a) Matches reality
📝 Explanation:
Verifies data correctness against sources.
âś… Correct Answer: a) Geographic imbalance
📝 Explanation:
Clean by sampling across regions.
Related Posts
New
New
New
50 Regression Analysis in Data Analysis MCQs
These 50 MCQs covers fundamental concepts in regression analysis, including linear and multiple regression, assumptions, diagnostics, and interpretation. Ideal for…
November 8, 2025By MCQs Generator
130 Exploratory Data Analysis (EDA) MCQs
MCQs cover the fundamentals of Exploratory Data Analysis, covering data summarization, visualization techniques, handling anomalies, and inferring patterns from datasets.…
November 8, 2025By MCQs Generator
60 Important Correlation and Covariance MCQs
This set of 60 MCQs covers the fundamentals of correlation and covariance, including types like Pearson and Spearman, their calculations,…
November 8, 2025By MCQs Generator