Using Aster Data's Naive Bayes functions
By Mark Ott, Teradata Aster
Naïve Bayes is a set of functions to train a classification model. A training data set for which we know the outcome (Predictor column) based on input variable columns are used to generate the model.
We then run the model against a set of input variables for which we do not know the Predictor to see what the model says. It’s quite similar to a Decision Tree with one big exception; the input data are independent of one other. This is a strong assumption but it makes the computation of the model extremely simple.
So let’s first look at a generic example to make sense of all this before we write the code.
Suppose you haven’t been feeling well so you go to the doctor and she diagnoses you with the flu. So you take the flu test to confirm this. For someone who really has the flu, the probability the test returns positive is 90%. If someone doesn’t have the flu, it returns positive 9% of the time. The test returns Positive for you. That’s not good news.
But what is the true probability you really have the flu? Hmmm, that’s a good question. Maybe 90% - 9% which means 81% the test is accurate sounds like a good guess. But actually that’s way off. Let’s use Naïve Bayes to get the really probability you have the flu.
The first thing we need is a training data set. In other words, we need to know the probability you have the flu given the population. After some research you discovery only 1% of the people in the US have the flu. This is your base rate and that’s what we work off . From here, it’s easy. We look at the input variables (Have flu, don’t have flu) and run it through the probabilities. Once we get these numbers, we apply it to an equation to get the true probability you have the flu.
As the graphic below shows, you only have a 9% of having the flu. That’ a lot different than the 81% chance we originally thought we had.
So that’s the Big Picture. Now let’s move our attention to an example of Naïve Bayes using the 3 pre-built functions in Aster: naiveBayeReduce andnaiveBayesMap and naiveBayesPredict.
To build the training data set we will be using naiveBayesReduce and naiveBayesMap together in the code. Our known data set consists of the following table namednb_samples_stolenCars.
Our Predictor column = Stolen. The Input variables will be Year, Color, Type, and Origin. Basically we will run the code against this data and it will create a model of which cars are candidates for being stolen based on the 4 input variables. We can then run the model against an entirely new set of input criteria for a car that is not in the model and it will predict if it is a candidate for being stolen.
So let's get started. Here’s the initial code using the 2 functions:
The first 4 lines of code are generic as is the last line using PARTITION so there’s not much to talk about there. Let’s go over the other keywords:
- ON clause points to the known data in the Table as shown in earlier screen shot
- RESPONSE points to Predictor column; in this case, the Stolen column
- NUMERICINPUTS and CATEGORICALINPUTS points to the input variable columns (in our case, Year, Color, Type, Origin). Note these are broken out by data type with Year being NUMERIC and other 3 (Color, Type, Origin) being lumped into CATEGORIC since they are text-based.
That’s about it. Once you run this code, you have your model as shown below.
At this point, you can run against this model against the naiveBayesPredict function and point to new row that you wish to Predict if the car will be stolen or not.
Suppose you are thinking about new vehicle and the car dealer says he has a special on all Red SUV Domestics between 1 and 7 years old. You are concerned about the probability of thefts so you look at the known data (from nb_samples_StolenCars) but there's no data for those vehicles. At this point, I would insert these 7 row into a table namedCarTypeCandidate.
insert into CarTypeCandidate values (11, 1, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (12, 2, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (13, 3, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (14, 4, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (15, 5, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (16, 6, 'Red', 'SUV', 'Domestic');
insert into CarTypeCandidate values (17, 7, 'Red', 'SUV', 'Domestic');
I then point to this table and run it through the model using the Predict function as shown below:
Here's the code and the result set of the query:
It looks like just about every one of the Red SUV Domestics have a good chance of being stolen except for the 7-year old model. Note the higher number between the YES and NO determines the prediction.
So there you have it. Keep your insurance rates low and buying the 7-year old vehicle.
In conclusion, Naïve Bayes creates a model that can then be used to predict outcome of future observations, base on their input variables.
MORE ABOUT SQL-MAPREDUCE OTHER TOPICS |