⬅ Previous Next ➡

Data Science & Machine Learning Intro

Introduction to Data Science
  • Data Science is the process of extracting useful insights from data.
  • Uses statistics + programming + domain knowledge.
  • Common steps: collect → clean → analyze → visualize → model → deploy.
  • Tools: Python, NumPy, Pandas, Matplotlib, Scikit-learn.
Data Visualization (Basics)
  • Visualization helps understand patterns and trends.
  • Common charts: line, bar, scatter, histogram, box plot.
  • Libraries: matplotlib and pandas plotting.
import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 15, 12, 20]

plt.plot(x, y)
plt.title("Line Plot")
plt.xlabel("X")
plt.ylabel("Y")
plt.show()
Machine Learning Basics
  • Machine Learning (ML) enables systems to learn from data and make predictions.
  • Types:
  • Supervised (labeled data): regression, classification
  • Unsupervised (unlabeled): clustering
  • Reinforcement: reward-based learning
  • Common terms: features (X), label/target (y), model, training, testing.
Linear Regression (Concept)
  • Linear Regression predicts a continuous value.
  • It fits a best line: y = mx + c
  • Used for predicting price, marks, sales, etc.
# Example: predict marks based on study hours
# y = m*x + c
Dataset Handling using Pandas
  • Load dataset using read_csv(), inspect using head(), info().
  • Check missing values using isna().sum().
import pandas as pd

df = pd.read_csv("data.csv")

print(df.head())
print(df.info())
print(df.isna().sum())
Split Data (Train-Test Split)
  • Split dataset into training and testing sets.
  • Training data is used to learn model parameters.
  • Testing data is used to check model performance.
from sklearn.model_selection import train_test_split

# X = features, y = target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
Model Training (Linear Regression using scikit-learn)
  • Steps: import → create model → fit → predict → evaluate
  • Library: scikit-learn
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example dataset
df = pd.DataFrame({
    "hours": [1, 2, 3, 4, 5],
    "marks": [35, 40, 50, 60, 65]
})

X = df[["hours"]]
y = df["marks"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Predictions:", pred)
print("MSE:", mean_squared_error(y_test, pred))
print("R2:", r2_score(y_test, pred))
Mini Workflow: End-to-End ML Steps
  • 1) Load dataset (Pandas)
  • 2) Clean data (missing values, duplicates)
  • 3) Select features (X) and target (y)
  • 4) Split dataset (train/test)
  • 5) Train model (fit)
  • 6) Predict and evaluate
⬅ Previous Next ➡