Editorial Summary :
An imbalanced dataset is one where one or more of the classes are heavily over-represented in relation to the other class(es) In the context of credit card fraud, we won’t be impressed by a model producing a high accuracy score . Instead, we create value for the business by accurately predicting the presence of fraud, not its absence . In this example, we use the recall score as the primary indicator of our model’s performance . The most important attribute of the data is that out of the 284,807 rows of data, only 0.173% rows are cases of fraud . Recall is a measure of how many of the positive cases are accurately predicted by the model . In our specific use case, a higher recall score means we detected more cases of fraud . In fact, we’re going to extend the idea of recall to include the financial costs associated with our classification . We want roughly 0.173% (our global fraud proportion) of our transactions to be fraudulent in both the train and test sets . The way we setup our baseline model treats both our classes as equally important . Luckily the sklearn API makes it extremely easy to let it know of our preference for properly identifying fraud . Using class_weight in your model will definitely help you get more relevant performance from your model . This simple change bumps us up to 90% recall and 93% financial recall! We’re now much more sensitive to fraud and we’ll accurately identify far more cases of fraud than we did using the baseline model . There are many other topics to discuss in this domain, so be on the lookout for future articles in this article . The sklearn API makes the process even easier than that .
Key Highlights :
- An imbalanced dataset is one where one or more of the classes are heavily over-represented in relation to the other class(es) In this example, we’ll use this example dataset to model and predict fraud .
- Recall is a measure of how many of the positive cases are accurately predicted by the model .
- In our specific use case, a higher recall score means we detected more cases of fraud .
- Sklearn uses class_weight to create a dictionary with a key for each class .
- The dictionary is a simple equation that looks at the weight of each class in the data .
The editorial is based on the content sourced from medium.com