From EDA to ML Model: The Step-by-Step Guide Most Tutorials Skip

You’ve Finished Your EDA. Now What? A Beginner’s Complete Roadmap to Building Your First ML Model

https://www.linkedin.com/in/shorya-bisht-a20144349/

The moment most data science tutorials forget to talk about.

There’s a particular kind of paralysis that hits you right after you finish Exploratory Data Analysis.

You’ve made your histograms. You’ve drawn your heatmaps. You’ve spotted outliers, identified missing values, and maybe even written a dozen lines of observations in your notebook. And then — silence. The cursor blinks. You stare at your screen.

Now what?

🚀 Crack FAANG Interviews in 90 Days!
Prep smarter, not longer. Real coding challenges, mock interviews, and expert guidance — all in one place.
🔥 Get Unlimited Access

If you’ve ever felt this way, you’re not alone. EDA is the part of data science that every course covers beautifully. But what comes after it? That’s where most beginners quietly get lost — not because they lack skill, but because no one drew them a map.

This guide is that map.

Think of EDA as the diagnosis a doctor makes when you walk into a clinic. They listen to your symptoms, run some initial tests, look at your X-rays. But diagnosing isn’t healing. The real work — the prescription, the surgery, the therapy — comes after. That’s exactly what we’re going to cover here: everything that happens after the diagnosis, all the way up to delivering a working model.

Let’s walk through it, step by step.


Step 1: Document Your EDA Findings (Before You Forget)

Before you write a single line of preprocessing code, stop and write down what you found.This sounds painfully obvious, but most beginners skip it entirely. They hold everything in their head, confident they’ll remember. Two days later, they don’t.

Think of your EDA findings like a shopping list before a big cook. If you walk into the kitchen without your list, you’ll forget the salt, use the wrong oil, and wonder why the dish doesn’t taste right.

What to document:

  • Which columns have missing values, and roughly how much (e.g., “age column is 15% missing”)
  • Which features are skewed or have outliers
  • Which features appear correlated with each other
  • Which features appear to have the strongest relationship with your target variable
  • Any data quality issues you noticed (wrong data types, inconsistent categories like “Male” vs “male”)
  • Anything that surprised you

Keep this in a simple markdown cell or a text file. It becomes your preprocessing checklist — your game plan for the next steps.

“Good data scientists spend more time thinking about their data than writing code.” — Cassie Kozyrkov, Chief Decision Scientist, Google

Step 2: Define (or Confirm) Your Problem Statement Clearly

You might think you already know what you’re trying to predict. But after EDA, you often know more than when you started — and that knowledge should sharpen your problem definition.

Ask yourself:

What exactly am I predicting? Is it a number (like predicting house prices)? That’s a regression problem. Is it a category (like predicting whether an email is spam or not spam)? That’s a classification problem. Is it finding groups in your data with no labels? That’s clustering.

Is the problem actually solvable with the data I have? EDA sometimes reveals that a key feature you assumed would be available is mostly missing, or that there’s a data leakage issue (future information sneaking into past data). Better to catch this now than after hours of modeling.

What does “success” look like? In real life, this is a business question as much as a technical one. A medical diagnostic model that’s wrong 20% of the time might be completely unacceptable. A movie recommendation system that’s wrong 20% of the time might be perfectly fine.

Nail down your metric now. Will you use accuracy? Precision and recall? RMSE? AUC-ROC? Knowing this before modeling keeps you honest.


Step 3: Split Your Data — Train, Validation, and Test

This step is so important it deserves its own section, and it’s one that beginners frequently get wrong.

Imagine you’re studying for an exam. Your training set is your textbook — you learn from it. Your validation set is your practice exams — you use them to tune your study strategy. Your test set is the actual exam — you only look at it once, when you’re done.

If you peek at the actual exam while studying, your score means nothing. Same with data.

The standard split:

  • Training set (~70–80%) — The model learns from this
  • Validation set (~10–15%) — You use this to tune hyperparameters and compare models
  • Test set (~10–20%) — You use this once, at the very end, to report final performance

In scikit-learn:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)

Critical rule: Any preprocessing (imputation, scaling, encoding) must be fitted on the training set only, then applied to validation and test sets. If you fit your scaler on the entire dataset before splitting, you’ve allowed information from the test set to leak into your model. This is data leakage — one of the most common and silent mistakes in data science.

Use scikit-learn Pipeline objects to enforce this cleanly.

Step 4: Data Preprocessing — Cleaning Up the Kitchen

If EDA was your inspection of the kitchen, preprocessing is the actual cleaning before you start cooking.

This is often the most time-consuming step in any data science project. And it’s where the real skill gap between beginners and experienced practitioners shows.

4a. Handle Missing Values

Data in the wild is almost never complete. You have options:

Drop rows — If a very small percentage of rows have missing values and the missingness seems random, dropping them is perfectly acceptable. Just don’t drop rows indiscriminately.

Impute with the mean or median — For numerical features, filling missing values with the column’s mean (for normally distributed data) or median (for skewed data) is a common and effective approach. The median is your safer bet when you have outliers, because outliers don’t pull the median the way they do the mean.

Impute with the mode — For categorical features, filling with the most frequent category is standard.

Use a placeholder category — Sometimes “missing” is the information. If a customer didn’t fill in their income on a form, that absence might itself be meaningful. Creating a separate category called “Unknown” preserves this signal.

Advanced imputation — Libraries like scikit-learn offer KNNImputer and IterativeImputer for more sophisticated methods. For beginners, start simple.

“Imputation is never perfect — you’re making educated guesses. The goal is to minimize the bias you introduce.” — Max Kuhn & Julia Silge, Tidy Modeling with R

4b. Handle Outliers

An outlier is like that one colleague at a party who has very strong opinions and throws off every group conversation.

Your EDA should have flagged these. Now decide what to do:

  • Remove them if they’re clearly data errors (e.g., a person listed as 300 years old)
  • Cap them (called Winsorization) — you replace values beyond a certain percentile with the boundary value. For example, all values above the 99th percentile get set to the 99th percentile value.
  • Keep them if they’re legitimate and meaningful (e.g., a billionaire in a salary dataset — that’s real data)
  • Transform the feature to reduce their impact (more on this next)

4c. Encode Categorical Variables

Machines don’t understand “Male” and “Female.” They understand numbers. Encoding is the process of converting categories into numbers.

Label Encoding — Assigns each category a number (Male = 0, Female = 1). Simple, but risky: it implies an order (1 > 0) where none exists. Use it only for ordinal categories (e.g., Small = 1, Medium = 2, Large = 3).

One-Hot Encoding — Creates a new binary column for each category. “Color” with values Red, Blue, Green becomes three columns: is_red, is_blue, is_green. This is the standard approach for nominal categories (no order). Beware of the dummy variable trap — always drop one of the resulting columns.

Target Encoding — Replaces each category with the mean of the target variable for that category. Powerful but prone to data leakage if done wrong. Leave this for after you’ve mastered the basics.

4d. Feature Scaling

Imagine you’re comparing the height and weight of people. Height might be in centimeters (160–190 range), while weight is in kilograms (50–100 range). Many algorithms — like K-Nearest Neighbors, SVM, and neural networks — are sensitive to the scale of features. A feature with large values will dominate, even if it’s not more important.

Standardization (Z-score scaling) — Transforms data so it has mean = 0 and standard deviation = 1. This is your general-purpose tool. Use it when you don’t know much about your data’s distribution.

Min-Max Normalization — Scales values to a [0, 1] range. Good when you know your data doesn’t have extreme outliers.

Log Transformation — For highly skewed numerical features, taking the logarithm (e.g., np.log1p(x)) can compress the range dramatically and make the distribution more "normal-shaped." Extremely useful for things like income, population, or price.

“Feature engineering is the process of using domain knowledge to create features that make machine learning algorithms work. It is fundamental to the application of machine learning.” — Pedro Domingos, Professor of Computer Science, University of Washington

Step 5: Feature Engineering — Building Better Ingredients

Preprocessing handles what’s broken. Feature engineering creates what’s new.

This is where your domain understanding pays off. A raw dataset is like a bag of groceries. Feature engineering is chopping, marinating, and combining those ingredients into something a model can actually learn from.

Some powerful examples:

  • If you have a date column, extract day_of_week, month, is_weekend, days_since_event — these often carry more predictive signal than a raw timestamp.
  • If you have latitude and longitude, you can calculate distance to a city center, or cluster locations into neighborhoods.
  • If you have two numerical features, sometimes their ratio or product matters more than either one individually (e.g., BMI is weight divided by height squared).
  • For text data, features like word count, presence of certain keywords, or sentiment scores can be tremendously useful before you even touch NLP techniques.

The key question to ask yourself during feature engineering: “What would a human expert look at to make this prediction?” Then encode that intuition as a feature.


Step 6: Feature Selection — Less Is Often More

After engineering features, you may end up with many. Dozens. Sometimes hundreds. Not all of them are useful, and feeding irrelevant noise into a model hurts performance.

This is the Marie Kondo step of data science. Keep only the features that spark joy — or more precisely, that add predictive value.

Correlation analysis — Remove one of every pair of highly correlated features. If two features say the same thing, you only need one.

Variance thresholding — Features with near-zero variance (almost all the same value) add no information. Drop them.

Univariate feature selection — Test the statistical relationship between each feature and the target. SelectKBest in scikit-learn does this efficiently.

Feature importance from models — Tree-based models like Random Forest give you a feature importance score for free. Train a quick model and let it tell you which features matter most.

Recursive Feature Elimination (RFE) — Trains a model, removes the weakest feature, trains again, repeats. Scikit-learn has a built-in RFE class.

A good rule of thumb: start with fewer features and add more only if performance improves. It’s much easier to add than to debug a bloated model.


Step 7: Choose and Train Your First Model

Now — finally — you get to train a model.

But don’t start with a neural network. Start simple.

“I always start with linear models and simple trees. If they can’t solve the problem, it’s usually a feature problem, not a model problem.” — Jeremy Howard, fast.ai founder

A practical starting framework:

For regression (predicting a continuous number):

  1. Start with Linear Regression — fast, interpretable, a great baseline
  2. Try Ridge or Lasso Regression if you have many features
  3. Graduate to Random Forest Regressor for better performance

For classification (predicting a category):

  1. Start with Logistic Regression — despite the name, it’s a classification model
  2. Try Decision Tree Classifier for interpretability
  3. Graduate to Random Forest Classifier or Gradient Boosting (XGBoost, LightGBM)

Start with defaults. Don’t tune hyperparameters yet. The goal of the first model is to establish a baseline — something to beat.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_val)
print(accuracy_score(y_val, predictions))

Step 8: Evaluate Your Model — Are You Actually Any Good?

Training a model and evaluating a model are two very different things. This is where your choice of metric (from Step 2) becomes critical.

For Classification:

  • Accuracy — What percentage of predictions were correct? Simple, but misleading on imbalanced datasets (if 95% of emails aren’t spam, a model that predicts “not spam” for everything gets 95% accuracy without learning anything).
  • Precision — Of everything you predicted as positive, how many actually were? (Good when false positives are costly — e.g., spam filters)
  • Recall — Of all the actual positives, how many did you catch? (Good when false negatives are costly — e.g., disease detection)
  • F1 Score — The harmonic mean of precision and recall. A good single number when you need to balance both.
  • AUC-ROC — How well can your model distinguish between classes? Great for imbalanced datasets.

For Regression:

  • MAE (Mean Absolute Error) — Average absolute difference between predicted and actual. Easy to interpret in the original units.
  • RMSE (Root Mean Squared Error) — Penalizes large errors more heavily. Use when big mistakes are especially bad.
  • R² (R-squared) — What proportion of the variance in the target does your model explain? 1.0 is perfect.

Cross-Validation: The Reliability Check

Instead of evaluating on a single validation split, cross-validation splits your training data into K folds and trains/evaluates K times, rotating which fold is used for validation. It gives you a more reliable estimate of model performance.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
print(f"Mean: {scores.mean():.3f}, Std: {scores.std():.3f}")

If your standard deviation is high, your model is sensitive to which data it sees — a sign you might need more data or a simpler model.


Step 9: Diagnose Bias and Variance — Is Your Model Overfit or Underfit?

This is one of the most important diagnostic skills you’ll develop.

Think of it like studying for an exam.

Underfitting (High Bias) is like a student who barely studied. They don’t do well on the practice tests or the real exam. The model hasn’t learned enough from the training data. Signs: Low training accuracy AND low validation accuracy. Fix: Use a more complex model, add more features, or reduce regularization.

Overfitting (High Variance) is like a student who memorized the textbook word-for-word but can’t answer any question that’s phrased slightly differently. The model learned the training data too well — including its noise. Signs: High training accuracy but significantly lower validation accuracy. Fix: Simplify the model, add more training data, use regularization (L1/L2), use dropout (for neural networks), or use cross-validation more aggressively.

The sweet spot — the Goldilocks zone — is a model with reasonably high accuracy on both training and validation sets, with a small gap between them.


Step 10: Hyperparameter Tuning — Fine-Tuning the Recipe

Now that you have a working model, it’s time to optimize it.

Hyperparameters are the settings you choose before training — the “knobs” on your model. For a Random Forest, these include the number of trees (n_estimators), the maximum tree depth (max_depth), and the minimum samples at a leaf node. The model doesn't learn these; you set them.

Tuning them is like adjusting the temperature and cooking time on your recipe to get the best result.

Grid Search — Exhaustively tries every combination of hyperparameters you specify. Great for small search spaces.

Random Search — Randomly samples from the hyperparameter space. Surprisingly effective and much faster than grid search for large spaces.

Bayesian Optimization — Uses the results of previous trials to intelligently choose the next combination to try. Libraries like Optuna and Hyperopt implement this. More advanced, but worth learning.

from sklearn.model_selection import RandomizedSearchCV
param_dist = {
'n_estimators': [50, 100, 200],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10]
}
search = RandomizedSearchCV(model, param_dist, n_iter=20, cv=5, random_state=42)
search.fit(X_train, y_train)
print(search.best_params_)

Step 11: Final Evaluation on the Test Set

This is the moment of truth. You’ve trained, tuned, and cross-validated. Now you evaluate your final model on the held-out test set — the data your model has never seen.

Do this once. One evaluation. If you evaluate multiple times and keep tweaking based on test set results, you’re essentially fitting to the test set, and your reported performance won’t generalize.

Report your final metrics honestly. A good model that you understand and can explain is worth far more than a slightly better model you got by overfitting to the test set.


Step 12: Interpret and Communicate Your Results

A model that no one understands or trusts rarely gets used.

Interpretability is how you build that trust.

For simple models: Linear and logistic regression give you coefficients directly — the magnitude and sign of each coefficient tells you how each feature influences predictions.

For complex models: Use tools like:

  • SHAP (SHapley Additive exPlanations) — Shows how much each feature contributed to each individual prediction. Arguably the gold standard of model interpretability.
  • LIME (Local Interpretable Model-agnostic Explanations) — Explains individual predictions by approximating the model locally with a simple model.
  • Feature Importance plots — Quick and visual, built into tree-based models.

And then — communicate clearly. Your stakeholders or teammates don’t care about AUC-ROC. They care about what the model does, how confident you are, and what its failure modes look like.

“The best model is not the most accurate model. It’s the one that gets used.” — Widely attributed to practitioners across the industry

Step 13: Save and Deploy Your Model

Your model lives in memory right now. To make it useful beyond your notebook, you need to persist it.

For saving:

import joblib
joblib.dump(model, 'my_model.pkl')
# Later, to load it:
loaded_model = joblib.load('my_model.pkl')

For deployment, beginners have several accessible options:

  • Streamlit — Build a simple web app around your model in pure Python. You can have something live in under an hour.
  • Flask/FastAPI — Build a REST API that accepts data and returns predictions. FastAPI is the modern choice.
  • Gradio — Similar to Streamlit, especially popular for ML demos.
  • Pickle + cloud — Save the model as a pickle file and deploy on platforms like Heroku, AWS Lambda, or Google Cloud Run.

Deployment is a whole field of its own (MLOps), but getting something basic live is an important milestone. It closes the loop — you built something real.


A Quick Reference Roadmap

To tie everything together, here’s the path at a glance:

  1. Document EDA findings — Your preprocessing checklist
  2. Define your problem clearly — Regression, classification, metric
  3. Preprocess data — Missing values, outliers, encoding, scaling
  4. Engineer features — Create new, meaningful signals
  5. Select features — Remove noise and redundancy
  6. Split data — Train / Validation / Test (fit preprocessors on train only!)
  7. Train a baseline model — Start simple
  8. Evaluate honestly — Right metrics, cross-validation
  9. Diagnose overfitting/underfitting — Bias-variance tradeoff
  10. Tune hyperparameters — Grid or random search
  11. Final test set evaluation — Once, honestly
  12. Interpret and communicate — SHAP, feature importance, clear language
  13. Save and deploy — Make it real

Common Mistakes to Avoid

Learn from others’ pain, not your own.

  • Data leakage — Fitting your preprocessor on the full dataset before splitting. Always fit on training data only.
  • Skipping the baseline — Jumping to a complex model without a simple benchmark. You won’t know if the complexity is actually helping.
  • Ignoring class imbalance — Building a classifier on a dataset where 98% of samples are one class without addressing it. Use stratified splits, class weights, or SMOTE.
  • Chasing accuracy on wrong metrics — Optimizing for accuracy when you should be optimizing for recall (or vice versa).
  • Not versioning your experiments — Use tools like MLflow or even a simple spreadsheet to track which model, which hyperparameters, and which results.
  • Treating the test set as validation — Evaluating on the test set multiple times invalidates it.

Conclusion: The Map Is Not the Territory

Reading a guide like this can make the process feel linear and clean. Reality is messier. You’ll preprocess, train, evaluate, realize your features aren’t good enough, go back to preprocessing, retrain, notice your model is overfitting, tune it, evaluate again, and then — maybe — feel satisfied.

That back-and-forth isn’t failure. That is the process.

Every experienced data scientist has a folder of terrible first models. The difference between a beginner and an expert isn’t the number of perfect models they’ve built — it’s the number of broken ones they’ve learned from and improved.

EDA was your way of listening to the data. Everything after EDA is your way of responding to it.

You’ve done the diagnosis. Now it’s time to do the work.

Start messy. Start imperfect. Start today.

Authored by: Shorya Bisht

— Bhuwan Chettri
Editor, CodeToDeploy

CodeToDeploy Is a Tech-Focused Publication Helping Students, Professionals, And Creators Stay Ahead with AI, Coding, Cloud, Digital Tools, And Career Growth Insights.

Post a Comment

Previous Post Next Post