Pandas Interview Questions Every Data Engineer Must Know
- Tejas Agrawal
- Dec 6
- 2 min read
In today’s data-driven world, Pandas has become an essential tool for every Data Engineer. Whether you're building ETL pipelines, transforming large datasets, or preparing data for analytics and machine learning, Pandas plays a critical role in delivering clean, structured, and high-quality data

1. What makes Pandas different from NumPy?
Answer:Pandas is built on top of NumPy but provides labeled, tabular, and time-series data structures like DataFrame and Series.It supports:
Missing value handling
Grouping
Joins
Indexing
Complex data manipulation
Pandas is preferred for structured data, whereas NumPy is ideal for numerical arrays.
2. What is a DataFrame? Why is it powerful?
Answer:A DataFrame is a 2D labeled table similar to SQL tables or Excel sheets.
Why powerful?
Heterogeneous data allowed
Fast filtering & grouping
Built-in I/O (CSV, SQL, Parquet)
Rich indexing features
Essential for ETL and analytics tasks.
3. What is the difference between .loc and .iloc?
Answer:
loc: Label-based indexing
df.loc["row_label", "column_label"]
iloc: Position-based indexing
df.iloc[3, 2]
loc is flexible for named indexes; iloc is used for fast numeric-based access.
4. How do you handle missing values in Pandas?
Answer:Most used methods:
df.isna()
df.dropna()
df.fillna(value)
df.fillna(method="ffill") # forward fill
Data Engineers use these during data cleaning before loading to warehouses or ML pipelines.
5. Explain groupby() with a real example.
Answer:groupby() splits → applies function → combines results.
Example:
df.groupby("department")["salary"].mean()
Use case:Department-level aggregations in HR, Sales, Finance datasets.
6. How do you merge two DataFrames?
Answer:Using merge() with SQL-like joins:
pd.merge(df1, df2, on="id", how="left")
Supports: inner, left, right, outer joins.Critical for combining tables in ETL pipelines.
7. What is the difference between apply(), map(), and applymap()?
Answer:
Function | Works On | Purpose |
map() | Series | Element-wise |
apply() | Series/DataFrame | Row/column-wise |
applymap() | DataFrame | Element-wise for full table |
Use case:apply() is commonly used for feature engineering.
8. How do you improve performance in Pandas?
Answer:
Use vectorized operations (avoid Python loops)
Use categorical types for repeated values
Convert objects to numeric where possible
Use df.itertuples() for fast iteration
Use chunk processing for large files
Real life: Helps process millions of rows efficiently.
9. What is the difference between pivot() and pivot_table()?
Answer:
pivot(): Does not allow duplicates.
pivot_table(): Allows duplicates and supports aggregation functions.
Example:
df.pivot_table(values="sales", index="region", columns="month", aggfunc="sum")10. How do you read large CSV files efficiently in Pandas?
Answer:Best practices:
pd.read_csv(..., chunksize=50000)
pd.read_csv(..., dtype={"col": "category"})
pd.read_csv(..., usecols=[...])
Chunking + dtype optimization is crucial for Data Engineers processing large datasets.



Comments