Source Code Cross Referenced for NaiveBayesMultinomial.java in » Science » weka » weka » classifiers » bayes » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Science » weka » weka.classifiers.bayes

Source Cross Referenced Class Diagram Java Document (Java Doc)

001:        /*
002:         *    This program is free software; you can redistribute it and/or modify
003:         *    it under the terms of the GNU General Public License as published by
004:         *    the Free Software Foundation; either version 2 of the License, or
005:         *    (at your option) any later version.
006:         *
007:         *    This program is distributed in the hope that it will be useful,
008:         *    but WITHOUT ANY WARRANTY; without even the implied warranty of
009:         *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
010:         *    GNU General Public License for more details.
011:         *
012:         *    You should have received a copy of the GNU General Public License
013:         *    along with this program; if not, write to the Free Software
014:         *    Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
015:         */
016:
017:        /*
018:         * NaiveBayesMultinomial.java
019:         * Copyright (C) 2003 University of Waikato, Hamilton, New Zealand
020:         */
021:
022:        package weka.classifiers.bayes;
023:
024:        import weka.classifiers.Classifier;
025:        import weka.core.Capabilities;
026:        import weka.core.Instance;
027:        import weka.core.Instances;
028:        import weka.core.TechnicalInformation;
029:        import weka.core.TechnicalInformationHandler;
030:        import weka.core.Utils;
031:        import weka.core.WeightedInstancesHandler;
032:        import weka.core.Capabilities.Capability;
033:        import weka.core.TechnicalInformation.Field;
034:        import weka.core.TechnicalInformation.Type;
035:
036:        /**
037:         <!-- globalinfo-start -->
038:         * Class for building and using a multinomial Naive Bayes classifier. For more information see,<br/>
039:         * <br/>
040:         * Andrew Mccallum, Kamal Nigam: A Comparison of Event Models for Naive Bayes Text Classification. In: AAAI-98 Workshop on 'Learning for Text Categorization', 1998.<br/>
041:         * <br/>
042:         * The core equation for this classifier:<br/>
043:         * <br/>
044:         * P[Ci|D] = (P[D|Ci] x P[Ci]) / P[D] (Bayes rule)<br/>
045:         * <br/>
046:         * where Ci is class i and D is a document.
047:         * <p/>
048:         <!-- globalinfo-end -->
049:         *
050:         <!-- technical-bibtex-start -->
051:         * BibTeX:
052:         * <pre>
053:         * &#64;inproceedings{Mccallum1998,
054:         *    author = {Andrew Mccallum and Kamal Nigam},
055:         *    booktitle = {AAAI-98 Workshop on 'Learning for Text Categorization'},
056:         *    title = {A Comparison of Event Models for Naive Bayes Text Classification},
057:         *    year = {1998}
058:         * }
059:         * </pre>
060:         * <p/>
061:         <!-- technical-bibtex-end -->
062:         *
063:         <!-- options-start -->
064:         * Valid options are: <p/>
065:         * 
066:         * <pre> -D
067:         *  If set, classifier is run in debug mode and
068:         *  may output additional info to the console</pre>
069:         * 
070:         <!-- options-end -->
071:         *
072:         * @author Andrew Golightly (acg4@cs.waikato.ac.nz)
073:         * @author Bernhard Pfahringer (bernhard@cs.waikato.ac.nz)
074:         * @version $Revision: 1.15 $ 
075:         */
076:        public class NaiveBayesMultinomial extends Classifier implements 
077:                WeightedInstancesHandler, TechnicalInformationHandler {
078:
079:            /** for serialization */
080:            static final long serialVersionUID = 5932177440181257085L;
081:
082:            /**
083:             * probability that a word (w) exists in a class (H) (i.e. Pr[w|H])
084:             * The matrix is in the this format: probOfWordGivenClass[class][wordAttribute]
085:             * NOTE: the values are actually the log of Pr[w|H]
086:             */
087:            protected double[][] m_probOfWordGivenClass;
088:
089:            /** the probability of a class (i.e. Pr[H]) */
090:            protected double[] m_probOfClass;
091:
092:            /** number of unique words */
093:            protected int m_numAttributes;
094:
095:            /** number of class values */
096:            protected int m_numClasses;
097:
098:            /** cache lnFactorial computations */
099:            protected double[] m_lnFactorialCache = new double[] { 0.0, 0.0 };
100:
101:            /** copy of header information for use in toString method */
102:            protected Instances m_headerInfo;
103:
104:            /**
105:             * Returns a string describing this classifier
106:             * @return a description of the classifier suitable for
107:             * displaying in the explorer/experimenter gui
108:             */
109:            public String globalInfo() {
110:                return "Class for building and using a multinomial Naive Bayes classifier. "
111:                        + "For more information see,\n\n"
112:                        + getTechnicalInformation().toString()
113:                        + "\n\n"
114:                        + "The core equation for this classifier:\n\n"
115:                        + "P[Ci|D] = (P[D|Ci] x P[Ci]) / P[D] (Bayes rule)\n\n"
116:                        + "where Ci is class i and D is a document.";
117:            }
118:
119:            /**
120:             * Returns an instance of a TechnicalInformation object, containing 
121:             * detailed information about the technical background of this class,
122:             * e.g., paper reference or book this class is based on.
123:             * 
124:             * @return the technical information about this class
125:             */
126:            public TechnicalInformation getTechnicalInformation() {
127:                TechnicalInformation result;
128:
129:                result = new TechnicalInformation(Type.INPROCEEDINGS);
130:                result
131:                        .setValue(Field.AUTHOR,
132:                                "Andrew Mccallum and Kamal Nigam");
133:                result.setValue(Field.YEAR, "1998");
134:                result
135:                        .setValue(Field.TITLE,
136:                                "A Comparison of Event Models for Naive Bayes Text Classification");
137:                result
138:                        .setValue(Field.BOOKTITLE,
139:                                "AAAI-98 Workshop on 'Learning for Text Categorization'");
140:
141:                return result;
142:            }
143:
144:            /**
145:             * Returns default capabilities of the classifier.
146:             *
147:             * @return      the capabilities of this classifier
148:             */
149:            public Capabilities getCapabilities() {
150:                Capabilities result = super .getCapabilities();
151:
152:                // attributes
153:                result.enable(Capability.NUMERIC_ATTRIBUTES);
154:
155:                // class
156:                result.enable(Capability.NOMINAL_CLASS);
157:                result.enable(Capability.MISSING_CLASS_VALUES);
158:
159:                return result;
160:            }
161:
162:            /**
163:             * Generates the classifier.
164:             *
165:             * @param instances set of instances serving as training data 
166:             * @throws Exception if the classifier has not been generated successfully
167:             */
168:            public void buildClassifier(Instances instances) throws Exception {
169:                // can classifier handle the data?
170:                getCapabilities().testWithFail(instances);
171:
172:                // remove instances with missing class
173:                instances = new Instances(instances);
174:                instances.deleteWithMissingClass();
175:
176:                m_headerInfo = new Instances(instances, 0);
177:                m_numClasses = instances.numClasses();
178:                m_numAttributes = instances.numAttributes();
179:                m_probOfWordGivenClass = new double[m_numClasses][];
180:
181:                /*
182:                  initialising the matrix of word counts
183:                  NOTE: Laplace estimator introduced in case a word that does not appear for a class in the 
184:                  training set does so for the test set
185:                 */
186:                for (int c = 0; c < m_numClasses; c++) {
187:                    m_probOfWordGivenClass[c] = new double[m_numAttributes];
188:                    for (int att = 0; att < m_numAttributes; att++) {
189:                        m_probOfWordGivenClass[c][att] = 1;
190:                    }
191:                }
192:
193:                //enumerate through the instances 
194:                Instance instance;
195:                int classIndex;
196:                double numOccurences;
197:                double[] docsPerClass = new double[m_numClasses];
198:                double[] wordsPerClass = new double[m_numClasses];
199:
200:                java.util.Enumeration enumInsts = instances
201:                        .enumerateInstances();
202:                while (enumInsts.hasMoreElements()) {
203:                    instance = (Instance) enumInsts.nextElement();
204:                    classIndex = (int) instance.value(instance.classIndex());
205:                    docsPerClass[classIndex] += instance.weight();
206:
207:                    for (int a = 0; a < instance.numValues(); a++)
208:                        if (instance.index(a) != instance.classIndex()) {
209:                            if (!instance.isMissing(a)) {
210:                                numOccurences = instance.valueSparse(a)
211:                                        * instance.weight();
212:                                if (numOccurences < 0)
213:                                    throw new Exception(
214:                                            "Numeric attribute values must all be greater or equal to zero.");
215:                                wordsPerClass[classIndex] += numOccurences;
216:                                m_probOfWordGivenClass[classIndex][instance
217:                                        .index(a)] += numOccurences;
218:                            }
219:                        }
220:                }
221:
222:                /*
223:                  normalising probOfWordGivenClass values
224:                  and saving each value as the log of each value
225:                 */
226:                for (int c = 0; c < m_numClasses; c++)
227:                    for (int v = 0; v < m_numAttributes; v++)
228:                        m_probOfWordGivenClass[c][v] = Math
229:                                .log(m_probOfWordGivenClass[c][v]
230:                                        / (wordsPerClass[c] + m_numAttributes - 1));
231:
232:                /*
233:                  calculating Pr(H)
234:                  NOTE: Laplace estimator introduced in case a class does not get mentioned in the set of 
235:                  training instances
236:                 */
237:                final double numDocs = instances.sumOfWeights() + m_numClasses;
238:                m_probOfClass = new double[m_numClasses];
239:                for (int h = 0; h < m_numClasses; h++)
240:                    m_probOfClass[h] = (double) (docsPerClass[h] + 1) / numDocs;
241:            }
242:
243:            /**
244:             * Calculates the class membership probabilities for the given test 
245:             * instance.
246:             *
247:             * @param instance the instance to be classified
248:             * @return predicted class probability distribution
249:             * @throws Exception if there is a problem generating the prediction
250:             */
251:            public double[] distributionForInstance(Instance instance)
252:                    throws Exception {
253:                double[] probOfClassGivenDoc = new double[m_numClasses];
254:
255:                //calculate the array of log(Pr[D|C])
256:                double[] logDocGivenClass = new double[m_numClasses];
257:                for (int h = 0; h < m_numClasses; h++)
258:                    logDocGivenClass[h] = probOfDocGivenClass(instance, h);
259:
260:                double max = logDocGivenClass[Utils.maxIndex(logDocGivenClass)];
261:                double probOfDoc = 0.0;
262:
263:                for (int i = 0; i < m_numClasses; i++) {
264:                    probOfClassGivenDoc[i] = Math
265:                            .exp(logDocGivenClass[i] - max)
266:                            * m_probOfClass[i];
267:                    probOfDoc += probOfClassGivenDoc[i];
268:                }
269:
270:                Utils.normalize(probOfClassGivenDoc, probOfDoc);
271:
272:                return probOfClassGivenDoc;
273:            }
274:
275:            /**
276:             * log(N!) + (for all the words)(log(Pi^ni) - log(ni!))
277:             *  
278:             *  where 
279:             *      N is the total number of words
280:             *      Pi is the probability of obtaining word i
281:             *      ni is the number of times the word at index i occurs in the document
282:             *
283:             * @param inst       The instance to be classified
284:             * @param classIndex The index of the class we are calculating the probability with respect to
285:             *
286:             * @return The log of the probability of the document occuring given the class
287:             */
288:
289:            private double probOfDocGivenClass(Instance inst, int classIndex) {
290:                double answer = 0;
291:                //double totalWords = 0; //no need as we are not calculating the factorial at all.
292:
293:                double freqOfWordInDoc; //should be double
294:                for (int i = 0; i < inst.numValues(); i++)
295:                    if (inst.index(i) != inst.classIndex()) {
296:                        freqOfWordInDoc = inst.valueSparse(i);
297:                        //totalWords += freqOfWordInDoc;
298:                        answer += (freqOfWordInDoc * m_probOfWordGivenClass[classIndex][inst
299:                                .index(i)]); //- lnFactorial(freqOfWordInDoc));
300:                    }
301:
302:                //answer += lnFactorial(totalWords);//The factorial terms don't make 
303:                //any difference to the classifier's
304:                //accuracy, so not needed.
305:
306:                return answer;
307:            }
308:
309:            /**
310:             * Fast computation of ln(n!) for non-negative ints
311:             *
312:             * negative ints are passed on to the general gamma-function
313:             * based version in weka.core.SpecialFunctions
314:             *
315:             * if the current n value is higher than any previous one,
316:             * the cache is extended and filled to cover it
317:             *
318:             * the common case is reduced to a simple array lookup
319:             *
320:             * @param  n the integer 
321:             * @return ln(n!)
322:             */
323:
324:            public double lnFactorial(int n) {
325:                if (n < 0)
326:                    return weka.core.SpecialFunctions.lnFactorial(n);
327:
328:                if (m_lnFactorialCache.length <= n) {
329:                    double[] tmp = new double[n + 1];
330:                    System.arraycopy(m_lnFactorialCache, 0, tmp, 0,
331:                            m_lnFactorialCache.length);
332:                    for (int i = m_lnFactorialCache.length; i < tmp.length; i++)
333:                        tmp[i] = tmp[i - 1] + Math.log(i);
334:                    m_lnFactorialCache = tmp;
335:                }
336:
337:                return m_lnFactorialCache[n];
338:            }
339:
340:            /**
341:             * Returns a string representation of the classifier.
342:             * 
343:             * @return a string representation of the classifier
344:             */
345:            public String toString() {
346:                StringBuffer result = new StringBuffer(
347:                        "The independent probability of a class\n--------------------------------------\n");
348:
349:                for (int c = 0; c < m_numClasses; c++)
350:                    result.append(m_headerInfo.classAttribute().value(c))
351:                            .append("\t").append(
352:                                    Double.toString(m_probOfClass[c])).append(
353:                                    "\n");
354:
355:                result
356:                        .append("\nThe probability of a word given the class\n-----------------------------------------\n\t");
357:
358:                for (int c = 0; c < m_numClasses; c++)
359:                    result.append(m_headerInfo.classAttribute().value(c))
360:                            .append("\t");
361:
362:                result.append("\n");
363:
364:                for (int w = 0; w < m_numAttributes; w++) {
365:                    result.append(m_headerInfo.attribute(w).name())
366:                            .append("\t");
367:                    for (int c = 0; c < m_numClasses; c++)
368:                        result.append(
369:                                Double.toString(Math
370:                                        .exp(m_probOfWordGivenClass[c][w])))
371:                                .append("\t");
372:                    result.append("\n");
373:                }
374:
375:                return result.toString();
376:            }
377:
378:            /**
379:             * Main method for testing this class.
380:             *
381:             * @param argv the options
382:             */
383:            public static void main(String[] argv) {
384:                runClassifier(new NaiveBayesMultinomial(), argv);
385:            }
386:        }

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.