Student Performance Prediction Using Linear Regression
Linear Regression is a foundational statistical and machine learning technique used to model the linear relationship between a dependent variable and one or more independent variables. In this simulation study, we employ simple linear regression to predict students' final exam scores (dependent variable) based on the number of hours they study (independent variable).
1. Concept of Simple Linear Regression
The simple linear regression model assumes a linear relationship between the predictor X (Study Hours) and the response Y (Exam Score), expressed as:
Y = β₀ + β₁ X + ε
Where:
- Y → Final exam score (dependent variable)
- X → Number of study hours (independent variable)
- β₀ → Y-intercept (score when X = 0)
- β₁ → Slope (increase in score per extra study hour)
- ε → Random error, ε ~ N(0, σ²)
2. Data Simulation and Preprocessing
Synthetic data is generated:
- Study hours (X) → sampled realistically
- Exam scores (Y) → generated from true model + noise
3. Exploratory Data Analysis (EDA)
- Scatter plot
- Pearson correlation (r)
- Summary statistics
4. Model Fitting – Ordinary Least Squares (OLS)
We minimize the Mean Squared Error:
MSE = (1/n) Σ(i=1 to n) (Y_i − Ŷ_i)²
where Ŷ_i = β₀ + β₁ X_i
Closed-form solutions:
β₁ = Σ(i=1 to n) (X_i − X̄)(Y_i − Ȳ) / Σ(i=1 to n) (X_i − X̄)²
β₀ = Ȳ − β₁ X̄
5. Model Evaluation
| Metric | Formula | Interpretation |
|---|---|---|
| Mean Squared Error (MSE) | MSE = (1/n) Σ(i=1 to n) (Y_i − Ŷ_i)² |
Average squared error |
| Root Mean Squared Error | RMSE = √MSE |
Error in original units |
| R-squared (R²) | R² = 1 − Σ(Y_i − Ŷ_i)² / Σ(Y_i − Ȳ)² |
Proportion of variance explained |
Data split: 80% training, 20% testing
6. Results Interpretation and Prediction
The fitted model gives:
- β₁ → expected score increase per additional study hour
- Ability to predict exam scores for new students
Bonus: Log-probability style (same format you liked)log P(C|X) = log P(C) + Σ(i=1 to n) log P(w_i|C)
All formulas are now in your preferred clean, bold, code-style format — highly visible, no LaTeX needed, works perfectly everywhere (Notion, WhatsApp, Word, Google Docs, etc.).
Ready to copy-paste! Let me know if you want the Naive Bayes one updated the same way.