Machine Learning in Action

Credit Card Fraud Detection with Machine Learning in Python with Deployment

Estimated read time: 1:20

Learn to use AI like a Pro

Get the latest AI workflows to boost your productivity and business performance, delivered weekly by expert consultants. Enjoy step-by-step guides, weekly Q&A sessions, and full access to our AI workflow archive.

Summary

In a recent video by Tensor Titans, viewers learn how to detect credit card fraud using machine learning. The project, aimed at combating financial fraud, employs a LightGBM model to efficiently classify transactions as fraudulent or legitimate. Through data preprocessing, analysis, and balancing techniques like SMOTE, the workflow builds a robust model that is deployed with a user-friendly streamlit app. The video provides a detailed walkthrough on creating, evaluating, and deploying a fraud detection system, highlighting the importance of data analysis, model selection, and web interface development.

Highlights

LightGBM utilizes leaf-wise growth for faster and more efficient model training. 🍃
Balancing datasets with SMOTE generates synthetic fraud cases to improve ML models. 🔄
Streamlit provides an intuitive interface for deploying machine learning models. 🖥️
Using a classification report and ROC AUC score helps in evaluating the model's performance. 📈
Creating a frontend with Streamlit enhances accessibility and user interaction. 🧑‍💻

Key Takeaways

LightGBM is a powerful tool for building efficient machine learning models! 🚀
Data preprocessing is crucial for model accuracy and efficiency. 🧹
Deploying models can be made easy with Streamlit. 🌐
Balancing data with SMOTE helps improve model performance on imbalanced datasets. ⚖️
Evaluating models with classification reports and ROC AUC scores is vital for assessing performance. 📊

Overview

Tensor Titans introduces an essential project focused on detecting credit card fraud using a machine learning model with LightGBM. This method not only promises speed and efficiency but also ensures memory optimization. The workflow starts with data preprocessing and moves through model training and deployment with Streamlit, making it easier for users to interact with the model and check for fraudulent transactions.

Through meticulous data handling and features like SMOTE for balancing datasets, the workflow ensures that the model can accurately differentiate between legitimate and fraudulent transactions. The instructional video shows how to code an effective system while emphasizing the importance of evaluating models with ROC AUC scores to measure their efficiency in classifying transactions.

Lastly, the project is capped with a deployment on Streamlit, showcasing a user-friendly interface. This commitment to making machine learning accessible and impactful highlights the channel’s dedication to equipping viewers with the knowledge to build practical and powerful AI systems. Subscribe to Tensor Titans for more insights into machine learning projects.

Chapters

00:00 - 00:30: Introduction to Credit Card Fraud Detection with Machine Learning The introduction highlights the prevalence and financial impact of credit card fraud and the potential for machine learning to mitigate it. The chapter outlines a project that includes building a fraud detection model and deploying it via a user-friendly app using LightGBM, a high-performance gradient boosting framework.
00:30 - 01:00: Understanding Light GBM for the Project The chapter introduces Light GBM, emphasizing its efficiency and speed due to its leaf-wise growth method. It states that Light GBM will be used as a model in the project. The workflow for the project involves several steps: importing credit card data, performing data pre-processing, conducting data analysis, splitting the data into training and testing sets, training the model on the training set, and then evaluating the model.
01:00 - 01:30: Project Workflow Overview In the 'Project Workflow Overview' chapter, the discussion focuses on the final steps of preparing a project for deployment. It mentions testing on 'various matrices' which suggests evaluating the model's performance across different test cases or scenarios. The chapter highlights the importance of deploying the model so that it can be accessed and used by others online. However, before deployment, there are necessary technical steps such as creating and activating a project environment and installing required libraries. The chapter concludes with the initiation of creating a new file, paving the way for the development of the project application.
01:30 - 02:00: Setting Up the Environment and Importing Libraries The chapter focuses on setting up the environment necessary for data handling, visualization, and machine learning tasks. It begins with the importation of essential libraries such as Pandas for data handling, Numpy perhaps abbreviated as NPI, Seaborn (possibly abbreviated as cbon), and Matplotlib (possibly abbreviated as M plot L) for visualizing data. PyL is possibly a reference to PyTorch or a similarly smooth-running ML toolkit, used for machine learning tasks. The chapter also mentions using a library or tool (abbreviated as mod) to balance datasets and calculate geographic distances between transaction locations. The data loading process is introduced with a Pandas DataFrame being loaded from a CSV file named 'data set.csv'. The chapter concludes with a quick display command ('DF do'), suggesting a display of the loaded dataset structure or a typical examination command (possibly truncated in transcription).
02:00 - 02:30: Loading and Preprocessing the Dataset The chapter focuses on the dataset preparation phase for a project. It covers the extraction of temporal elements such as hours, days, and months from the transaction date and time column using specialized datetime functions. Additionally, the chapter discusses the removal of irrelevant columns from the dataset to streamline the data for project requirements.
02:30 - 03:30: Handling Categorical Columns and Feature Engineering This chapter covers handling categorical columns and feature engineering in a dataset. It outlines the process of converting categorical columns, specifically 'merchant category' and 'gender', into numerical values using a label encoder. Additionally, it introduces a function to calculate the distance between transaction and merchant locations to identify potentially fraudulent activity when a transaction occurs far from the expected location.
03:30 - 04:30: Balancing the Dataset with SMOTE The chapter titled 'Balancing the Dataset with SMOTE' focuses on the creation of a new feature related to distance in a dataset. The feature measures the distance between the transaction location and the merchant location, which helps in enhancing the dataset. The chapter discusses the process of defining various features that will be used while training a machine learning model. These features include Merchant category, amount, CC number, hour, days, month, and gender as part of the model's training data.
04:30 - 05:30: Splitting Dataset and Training the Model In this chapter, the process of preparing data for model training is discussed. The focus is on selecting features and defining target variables, specifically marking 'is fraud' as the Y variable. It is highlighted that the dataset is unbalanced, with significantly fewer fraudulent transactions compared to legitimate ones. A graphical representation using a count plot is presented to illustrate this imbalance. A technique called 'smart' is mentioned as a means to balance the dataset, ensuring an effective training process.
05:30 - 08:00: Evaluating Model Performance The chapter titled 'Evaluating Model Performance' discusses the use of the Synthetic Minority Over-sampling Technique (SMOTE) to balance a dataset by generating synthetic cases of fraud. This ensures an equal number of fraudulent and legitimate transactions, which is visualized through a count plot. After achieving a balanced dataset, the chapter then describes splitting the dataset into training and testing sets, which allows for the training and evaluation of the model.
08:00 - 10:00: Saving the Model and Building Frontend with Streamlit The chapter discusses the process of training a machine learning model using the LightGBM classifier for a project involving binary classification. The parameters for the model are specified, including using 'gbdt' as the boosting type and setting the objective to binary for classifying transactions as fraudulent or legitimate. Additionally, the AUC metric is chosen for performance evaluation due to its effectiveness with imbalanced datasets.
10:00 - 11:00: Testing the Model Predictions In this chapter, the process of testing model predictions is discussed with specific focus on setting the learning rate, number of leaves in decision trees, and max depth. It highlights the importance of monitoring for overfitting when trees have no limit on depth. The chapter further explains specifying the number of trees (N estimators) and the importance of fitting the model on training data (X_train and Y_train) to enable accurate future predictions.
11:00 - 12:30: Deploying the Model on GitHub and Streamlit In this chapter, the discussion focuses on deploying a model using GitHub and Streamlit. The chapter initially covers using an LGB model to generate predictions. The process includes evaluating the model by passing test parameters to it and assessing its efficacy using a classification report and ROC AUC score. These evaluation tools help determine the model's ability to distinguish between fraudulent and legitimate cases efficiently. Then, it moves on to identify the important features of the trained model by plotting a graph, showcasing the top 10 essential features necessary for the model's performance.
12:30 - 13:00: Conclusion and Encouragement to Subscribe In this section, the conclusion is drawn after discussing the top 10 most important features in training the model to differentiate between fraudulent and legitimate transactions. The chapter moves on to calculate the False Positive Rate (FPR) and True Positive Rate (TPR) of the classification model, followed by determining the ROC (Receiver Operating Characteristic) AUC (Area Under Curve) value. This measurement helps to illustrate the performance of the classification model. The ROC curve is demonstrated to provide a visual understanding of the model's ability to classify correctly. The chapter also encourages readers to subscribe for more updates or information.

Credit Card Fraud Detection with Machine Learning in Python with Deployment Transcription

00:00 - 00:30 hey everyone welcome back to tensor Titans today we are diving into an exciting and crucial project credit card fraud detection using machine learning fraud in transaction cost businesses billions of dollars each year and AI can help us fight back we will build a powerful fraud detection model and a stream rate app to make predictions in real time so make sure to watch till the end because we will trainer model evaluated and even deploy it with a userfriendly interface so before we see the workflow for our project let me introduce you to the light GBM so light GBM is a light gradient boosting machine it is a high performance gradient boosting framework that is optimized for Speed and efficiency unlike traditional
00:30 - 01:00 decision trees like GBM uses Leaf wise growth which means it builds the most significant splits first making it much faster and memory efficient so now that you know that what is light GBM we will be using this as a model for our project so now let's see what's going to be the workflow for our project so first we will start by importing the credit card data after importing the data we will perform some data pre-processing over the data after the data pre-processing we will do some data analysis and after that we will split our data into the training and the testing St then on the training set we will train our model and after training our model we will evalate
01:00 - 01:30 it on various matrices and then finally we will deploy our model so that the other users can use it online so now let's jump into the coding part so let me just quickly create an activate environment for our project now let's just quickly install the libraries for our project let's close the terminal now we will create a new file at app.
01:30 - 02:00 iynb let me just quickly import the essential libraries so here we are using pandas and NPI for handling our data cbon and M plot L will help us to visualize it in py L provides tools for machine learning we also use mod to balance the data set and to calculate the distance between the transactions locations now let's load our data set so DF is equal to pd. CSV data set. CSV let me just quickly show you the data set so DF do
02:00 - 02:30 head so this is the data set for our project now we will first handle the transaction date transaction time column we will be extracting the hours days and month from this column so we will be using two date time function for the pre-processing of the transaction day transaction time column now let's drop the irrelevant features or the columns from a data set so these are the features which are irrelevant and not required by our project so we will be dropping these columns
02:30 - 03:00 so now this is the data set with the columns that we are using in our project so now let's first handle the categorical columns in our data set as merchant category and gender columns are the categorical columns so we will be using label encoder to encode these columns into a numerical format now here we will create a function which will calculate the distance between the transaction location and the merchant location as a sudden transaction happening far from a usual location can be red flag so we will be calculating the distance between them
03:00 - 03:30 we will create a separate feature for the distance between the transaction location and the merchant location so as you can see over here that a separate feature distance is created in the data set now let's define the features which we will be using while training our model so we'll be using features Merchant category amount CC number hour days month gender and we
03:30 - 04:00 will be passing these features As an X variable for the model and for the Y variable we will be using the column is fraud as our data set is highly unbalanced let me show you through a graph that how the number of fraudulent transactions are very less as compared to the legitimate transactions so we will be using a count plot for plotting the number of fraudulent transactions and the legitimate transactions so as you can see that the number of fraudulent transactions are very less compared to the legitimate transaction with the help of the smart we will balance our data set is a
04:00 - 04:30 synthetic minority over sampling technique it helps us to balance our data set by generating the synthetic fraud cases now let me show you again through the count plot as you can see over here that now our data set is balanced number of fraudulent transactions and the number of legitimate transactions are almost equal as our data set is now balanced let's split our data set into the training and the testing split so that we can use the training set for the training of our model and testing set for the evaluation of our model
04:30 - 05:00 as our data set is now split into the training and the testing set let's trainer model so we will be using lgbm classifier as our model for the project now let's specify the parameters for our project so we will be specifying boosting type as gbdt as it is the most common type it is a Gad In boosting decision tree where trees are built sequentially to reduce residual errors then we will specify the objective as binary as this is a binary classification where two possible outcomes are there which is the fraudulent transaction or legitimate transaction then we will specify the Matrix Au a is a good Matrix for imbalanced data set as it considered both both sensitive and specificity then
05:00 - 05:30 we will specify the learning rate as 0.05 which shows a gradual convergence reducing the risk of overfitting then we will specify the number of leaves in each decision tree which is 31 and then we keep the max depth as minus one minus one means that there is no limit on the depth allowing trees to grow as needed however P must monitor for overfitting then we will specify the N estimat is equal to 200 which specifies the number of trees and then finally we fit our model on the X train and the Y train where X TRS are the features and Y trains are the label this step is essential as it allows the model to learn from training data to make future
05:30 - 06:00 predictions now let's predict through our model so yed is equal to LGB model. predict we will pass the X test as parameter now we will evaluate our model through the classification report and the rooc Au score which will help us to measure how well the model is distinguishing between the fraud cases and the legitimate cases so the classification report and the ROC code shows that our model is classifying efficiently now let me show you the important features for the training offer model so we will plot a graph which will show the top 10 features important for our model for the
06:00 - 06:30 prediction so these are the top 10 features importance in the training of our model which enable the model to classify the fraudulent and the legitimate transactions now we will first calculate the fpr which is the false posture rate and the tpr true positive rate of our classification model then we will calculate the ROC AOC value of our model now let me show you the r curve of this model which is a graphical representation of an any classification
06:30 - 07:00 model which shows its performance across the different threshold values a good model will have a curve that leans toward the top left corner which is the higher tpr and the low fpr DPR is a true positive rate and fpr is the false positive rate an Roc curve closer to the zero or one is a better model whereas closer to the diagonals are the poor models you can see the ROC curve of our model and the ROC curve with an AOC of equal to 0.9 in shows that our model is classifying the legitimate and the fra transaction efficiently now let's save our model with the help of the job. dump we will be saving the encoder for the label encoding of the categorical
07:00 - 07:30 features as you can see that that model is Sav now so let's start building a front end for our project so we will be creating a stream interface for our project so let's create a file app.py let's import the necessary libraries for our project now we will load our model so model is equal to jo. load fraud detection model. JB and we will also load our encoder so encoder is equal to job. load
07:30 - 08:00 label encoder JP and then we will create a function to handle the distance between the transaction location and the card holder location let's give the title for our interface so st. title fraud detection system let's create a heading enter the transaction details below and then now we will create the input areas for ious features like
08:00 - 08:30 Merchant category and so on and after getting the input of latitude longitude merge latitude longitude we will pre-process it through the function we created above which we
08:30 - 09:00 will give to the model for prediction and then we will create a button check for fraud and then we will create a pandas data frame for all the inputs received from the user and we will also encode the categorical columns which are the merchant category and the gender with the help of the label encoder we loaded above
09:00 - 09:30 we also transform the credit card number received from the user or the predictions after all the pre-processing on the users's input we will pass the Panda's data frame of input data to the model for the predictions after getting the predictions from the model we will show it on the web interface if the prediction received is one then the transaction is fraudulent and if the prediction received is zero then transaction is
09:30 - 10:00 legitimate so this is the complete code for the front end of our project so now let's see if our project is working or not so we will go to the terminal we will type streamlit run app.py now let's see if our project is working or not so let me just quickly fill the values and check whether it's working or not
10:00 - 10:30 so as you can see that it predicted the legitimate transaction correctly now let's see if it's able to predict the fraudulent transaction as well or not so let me just quickly fill the values for a fraudulent transaction and check whether it's predicting correctly or not
10:30 - 11:00 so as you can see that our model is working fine so now let's deploy our model so let me just quickly go to my GitHub account here we will create a GitHub repository let's name it fraud detection system and click on create repost Tre we will copy this code let's activate our environment so V EnV scripts activate or now our environment is active let's create a requirement file for our
11:00 - 11:30 project so pip freeze requirements.txt now let's initialize our gabos so get in it and now we will add all the required files so get add app.py requirements.txt fraud detection model. JB la. JB press enter and then we will paste the code and press enter so now here you can see there are files are
11:30 - 12:00 uploaded let's copy this URL so now let's enter project again now we will deploy our model so deploy deploy now paste that link that we copied of the app.py URL over here let's name our app fraud detection system and click on deploy let's wait while it's deploying our project see yes as you can see that our project is deployed now let's check if it's working or not so let me just quickly fill the values
12:00 - 12:30 so as you can see that our project is working fine so that's it for today's guys so if you like the project do like share and comment and subscribe to tenses Titan for more exciting machine learning videos [Music]