This repository contains the code and resources for a supervised machine learning project aimed at predicting whether a deliverd email will be read , acknowledged or ignored. The dataset used is data_email_campaign.csv
.
Bike share demand prediction is a critical aspect of urban transportation planning. This project focuses on using machine learning techniques to predict bike rental demand in Seoul, aiding in efficient resource allocation and city planning.
The dataset data_email_campaign.csv
is included in the 📁 data
directory. It contains information about bike rentals, including weather conditions, temperature, humidity, and other relevant features.
The data_email_campaign.csv
file contains the following columns:
- Email_Id Email id of customer
- Email_Type Email type contains 2 categories : 1 and 2. We can assume that the types are like promotional email or sales email
- Subject_Hotness_Score It is the email’s subject’s score on the basis of how good and effective the content is
- Email_Source_Type It represents the source of the email like sales, marketing or product type email
- Email_Campaign_Type The campaign type of the email
- Customer_Location Categorical data which explains the different demographic location of the customers
- Total_Past_Communications This columns contains the total previous mails from the same source
- Time_Email_sent_Category The time of the day when the email was sent
- Word_Count Total count of word in each email
- Total_links Total number of links in the email
- Total_Images Total Number of images in the email
- Email_Status Our target variable which contains whether the mail
The project is developed using Python and relies on the following libraries:
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
The project involves the following steps:
- Data Cleaning and Preparation
- Exploratory Data Analysis
- Visualization and Insights
- Hypothesis Testing
- Feature Enginerring & Data Pre-processing
- ML Model Training , Implementation and Evaluation
The first step in this project involves cleaning and preparing the data. This includes checking for missing data, removing duplicates, and converting data types. Some of the specific tasks involved in this step include:
- Handling missing data
- Removing duplicates
- Converting data types
- TimeSeries Analysis
The next step in the project is to conduct exploratory data analysis. This involves examining the data to understand its distribution, central tendencies, and correlations between variables.
Hypothesis testing , a statistical method used to make inferences about a population based on a sample of data. To perform hypothesis testing on the 'data_email_campaign.csv' dataset, we first start with a null hypothesis (H0) and an alternative hypothesis (H1), then use statistical tests to determine if there is enough evidence to reject the null hypothesis in favor of the alternative hypothesis.
Below is general step-by-step guide on to perform hypothesis testing on a dataset like SeoulBikeData.csv:
- Define the Hypotheses
- Choose a Significance Level (α)
- Select the Test
- Perform the Test
- Analyze the Results
- Draw Conclusions
- Handling Missing Values
- Handling Outliers
- Label Encoding
- Textual Data Preprocessing
- Feature Manipulation & Selection
- Data Transformation
- Data Scaling
- Dimesionality Reduction
- Data Splitting
The dependent variable is Rented Bike Count is a contionus variable. Hence to Regression ML algorithms are used to train the model to predict the depedent variable.
Following are the ML algorithms on which the model is trained
- Logistic Regression
- Random Forest Classification
- XGBoost Classification
Test | Train | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Sr No. | Model Name | Accuracy | Recall | Precision | F1score | AUC | Accuracy | Recall | Precision | F1score | AUC |
0 | Logistic Regression | 0.542400 | 0.542400 | 0.527200 | 0.517300 | 0.729900 | 0.583100 | 0.583100 | 0.608800 | 0.583700 | 0.766600 |
1 | Logistic Regression + GridSearchCV | 0.542000 | 0.542000 | 0.526900 | 0.516300 | 0.729700 | 0.582800 | 0.582800 | 0.608600 | 0.583100 | 0.766500 |
2 | Random Forest | 0.999700 | 0.999700 | 0.999700 | 0.999700 | 1.000000 | 0.808700 | 0.808700 | 0.808400 | 0.808300 | 0.911100 |
3 | Random Forest + GridSearchCV | 0.999700 | 0.999700 | 0.999700 | 0.999700 | 1.000000 | 0.809200 | 0.809200 | 0.808700 | 0.808700 | 0.911900 |
4 | XGboost | 0.808700 | 0.808700 | 0.809500 | 0.802600 | 0.935500 | 0.776600 | 0.776600 | 0.765500 | 0.765800 | 0.895100 |
5 | XGboost + GridSearchCV | 0.999300 | 0.999300 | 0.999300 | 0.999300 | 1.000000 | 0.824700 | 0.824700 | 0.821200 | 0.822300 | 0.914100 |