Are you suffering from Accuracy Syndrome?

If you have seen your model giving a great accuracy when testing classification model and validating with future real-time data, and still the model is not able to predict the events accurately, it means you are suffering from accuracy syndrome.

It is detrimental to push the models for decision making. You need to work on your model little more.

It happens most of the time when the
predictor classes are not balanced.

Following are some of the scenarios, where the predictor class can be disproportionate:

1) Redeeming the offers issued by retail stores – Usually people have the tendency to forget offers very soon. Even though retail stores spend a lot of money every week on sending promotions and offers to their customers, in practice they are reluctant availing these offers. A very small proportion of customers come back to store for redeeming the offers and translating the offers actually into purchase.

2) Incidents taking place in old-age home – In old-age home, where care takers are experienced and staffs are efficient to do their work responsibly, taking place of any incidents are very rare. If you get a project to predict the incidents in coming week / month, you will get a very few incidents in comparison to non-incidents.

3) Detecting malignant tumours – From the imaging department of a hospital, when you receive a project on making a supervised model for finding malignant tumours, you may come across a negligible number of cancerous tumours. Most of them are benign. There may be a scenario when the model will give a high accuracy model.

There can be so many other cases, when predictor classes are disproportionate. This needs you to work little differently while working on modelling.

Following are the steps to come over accuracy syndrome:

1) Check the proportions of the prediction class. There is no threshold beyond which it will be called disproportionate. This has to be decided by speaking to the domain expert of your organization. The scenarios mentioned above have class proportions in the ratio of (99.2% Vs 0.8%), (95% Vs 5%) and (98% Vs 2%) respectively.

2) Separate out enough of data points, which will be used as a testing dataset. Objective of separating out testing dataset before preparing training dataset will be to test your model work on real time scenarios.

3) There can be two ways to tackle this problem,

a) Under-sampling: When you have enough data, larger class can be under-sampled to match the proportion of smaller class. A calculation is given below:

Inference: For Proportion 3, Accuracy is highest. Hence model will be built on the training set, which will have 60% as success incidents and 40% of failure incidents. Once the model is built all parameters like Precision, recall and F1 score along with area under curve will be calculated. Also, confusion matrix will be created to calculate the accuracy. Based on these parameters on testing dataset created in step 2 decision will be taken.

b) Over-Sampling: Be careful, while over sampling the smaller class/proportions. It will have repetition of rows to make it over sample. This will be a problem when the smaller class is very small. There may not be enough datapoints for learning the training model.

Data Science

Data science, also known as data-driven science, is an interdisciplinary field of scientific methods, processes, algorithms and systems to extract knowledge or insights from data in various forms, either structured or unstructured, similar to data mining.

Data science is a “concept to unify statistics, data analysis, machine learning and their related methods” in order to “understand and analyze actual phenomena” with data. It employs techniques and theories drawn from many fields within the broad areas of mathematics, statistics, information science, and computer science, in particular from the subdomains of machine learning, classification, cluster analysis, uncertainty quantification, computational science, data mining, databases, and visualization.

The Data Science Certification with R has been designed to give you in-depth knowledge of the various data analytics techniques that can be performed using R. The data science course is packed with real-life projects and case studies, and includes R CloudLab for practice.

Data Science is a vast technology that encompasses various aspects in
many fields. Data Science also forms the basis for working with big data and
analytics also. By creating a clear understanding in data science, one can
discover many opportunities as more and more businesses are becoming data
driven. Data science course helps you learn how you can analyze data using
automated methods, collating data from different devices using sophisticated
techniques. Data science can be applicable in many areas such as predictive and
prescriptive analysis, machine learning etc. This data can be used for making
critical business decisions that will have a larger impact.