Interview Preparation

Data Analyst Interview Questions & Answers for 2026

Curated questions covering core concepts, practical scenarios, and tradeoffs — suitable for fresher, 2-year, and 5-year experience levels.

Q1. What is the difference between a fact table and a dimension table in a data warehouse?

A fact table stores measurable, quantitative data — the metrics of a business process like sales transactions, page views, or order amounts. Each row represents a business event and contains numerical measures plus foreign keys to dimension tables. A dimension table stores descriptive attributes that contextualise the facts — date, product, customer, location. The star schema joins fact tables to dimension tables; the snowflake schema normalises dimension tables further. Fact tables are typically large (millions of rows), dimension tables are smaller but wider. This separation enables fast OLAP queries via slicing and dicing.

Q2. What is the difference between GROUP BY and PARTITION BY in SQL?

GROUP BY collapses multiple rows into a single row per group — it reduces the number of rows in the result set. PARTITION BY is used in window functions and divides rows into groups for calculation without collapsing rows — each original row remains in the result with the window function result added as a new column. Example: GROUP BY department gives one row per department with total salary. PARTITION BY department in SUM(salary) OVER (PARTITION BY department) gives every employee row plus their department total in the same result.

Q3. How do you handle missing data in a dataset?

First understand why data is missing: MCAR (Missing Completely At Random — safe to drop), MAR (Missing At Random — depends on other observed variables), MNAR (Missing Not At Random — dropping biases the analysis). Options: drop rows/columns with too many nulls, impute with mean/median/mode for numerical data, use forward-fill or backward-fill for time series, use model-based imputation (KNN, MICE) for complex cases. Always document the imputation strategy and its impact. In pandas: df.fillna(), df.dropna(), SimpleImputer. The choice depends on the percentage missing and the downstream analysis.

Q4. What is the Central Limit Theorem and why does it matter for data analysis?

The Central Limit Theorem states that the distribution of sample means approaches a normal distribution as sample size increases, regardless of the original population distribution — typically valid with n >= 30. This matters because many statistical tests (t-tests, confidence intervals, ANOVA) assume normality. CLT justifies applying these tests even when the underlying data is skewed or non-normal, as long as the sample is large enough. It also explains why averages of large datasets are more predictable than individual data points, which underlies A/B testing methodology and polling statistics.

Q5. What is the difference between correlation and causation? Give an example.

Correlation means two variables move together statistically. Causation means one variable directly causes the other to change. Correlation does not imply causation — a classic example: ice cream sales and drowning rates are positively correlated, but ice cream does not cause drowning — both are caused by a third variable (summer heat). In data analysis confounding variables, reverse causation, and coincidental correlations are common. To establish causation you need randomised controlled experiments (A/B tests) or advanced methods like difference-in-differences or instrumental variables. Always ask "what is the confounding variable?" before drawing causal conclusions.

Q6. What is an A/B test and how do you determine if a result is statistically significant?

An A/B test randomly divides users into two groups: control (A, existing experience) and treatment (B, new variant), then measures a key metric. Statistical significance is assessed with a hypothesis test: define null hypothesis (no difference), choose significance level alpha (usually 0.05), run the experiment until reaching the predetermined sample size, then compute the p-value. If p < 0.05 the result is statistically significant — reject the null hypothesis. Also check practical significance (effect size) — a statistically significant change may be too small to matter. Use a power calculation before starting to determine required sample size.

Q7. What are the main differences between Pandas DataFrame and SQL table?

Both are tabular two-dimensional data structures but with different paradigms. SQL tables live in a database server, are persistent, optimised for large datasets with indexes, and use declarative SQL queries. Pandas DataFrames live in memory, are ephemeral (lost when the session ends), handle complex Python operations and transformations well, and are better for exploratory analysis and visualisation. SQL handles data of any size with server resources; Pandas is limited by RAM. For a typical analysis workflow: query the database with SQL to extract and filter data, then use Pandas for cleaning, transformation, statistical analysis, and visualisation.

Q8. How would you detect and handle outliers in a dataset?

Detection methods: visualise with box plots and scatter plots, use IQR method (values below Q1 - 1.5*IQR or above Q3 + 1.5*IQR are outliers), or use z-score (|z| > 3). For high-dimensional data use isolation forest or DBSCAN. Handling options depend on context: drop if clearly erroneous data entry, cap (winsorise) to the 5th/95th percentile if the distribution is skewed, log transform to reduce skewness, or keep them if they represent genuine rare events (fraud detection, high-value customers). Never remove outliers without understanding why they exist and documenting the decision.

Practice these questions with AI

Use our Mock Interview tool to answer questions and receive instant AI scoring and model answers.

Start Mock InterviewGenerate Custom Questions