Source Code Cross Referenced for NormalizerDataReader.java in  » 6.0-JDK-Modules-sun » text » sun » text » normalizer » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » 6.0 JDK Modules sun » text » sun.text.normalizer 
Source Cross Referenced  Class Diagram Java Document (Java Doc) 


001:        /*
002:         * Portions Copyright 2003-2006 Sun Microsystems, Inc.  All Rights Reserved.
003:         * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
004:         *
005:         * This code is free software; you can redistribute it and/or modify it
006:         * under the terms of the GNU General Public License version 2 only, as
007:         * published by the Free Software Foundation.  Sun designates this
008:         * particular file as subject to the "Classpath" exception as provided
009:         * by Sun in the LICENSE file that accompanied this code.
010:         *
011:         * This code is distributed in the hope that it will be useful, but WITHOUT
012:         * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
013:         * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
014:         * version 2 for more details (a copy is included in the LICENSE file that
015:         * accompanied this code).
016:         *
017:         * You should have received a copy of the GNU General Public License version
018:         * 2 along with this work; if not, write to the Free Software Foundation,
019:         * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
020:         *
021:         * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
022:         * CA 95054 USA or visit www.sun.com if you need additional information or
023:         * have any questions.
024:         */
025:
026:        /*
027:         *******************************************************************************
028:         * (C) Copyright IBM Corp. 1996-2005 - All Rights Reserved                     *
029:         *                                                                             *
030:         * The original version of this source code and documentation is copyrighted   *
031:         * and owned by IBM, These materials are provided under terms of a License     *
032:         * Agreement between IBM and Sun. This technology is protected by multiple     *
033:         * US and International patents. This notice and attribution to IBM may not    *
034:         * to removed.                                                                 *
035:         *******************************************************************************
036:         */
037:
038:        package sun.text.normalizer;
039:
040:        import java.io.DataInputStream;
041:        import java.io.InputStream;
042:        import java.io.IOException;
043:
044:        /**
045:         * @version     1.0
046:         * @author        Ram Viswanadha
047:         */
048:
049:        /*
050:         * Description of the format of unorm.icu version 2.1.
051:         *
052:         * Main change from version 1 to version 2:
053:         * Use of new, common Trie instead of normalization-specific tries.
054:         * Change to version 2.1: add third/auxiliary trie with associated data.
055:         *
056:         * For more details of how to use the data structures see the code
057:         * in unorm.cpp (runtime normalization code) and
058:         * in gennorm.c and gennorm/store.c (build-time data generation).
059:         *
060:         * For the serialized format of Trie see Trie.c/TrieHeader.
061:         *
062:         * - Overall partition
063:         *
064:         * unorm.icu customarily begins with a UDataInfo structure, see udata.h and .c.
065:         * After that there are the following structures:
066:         *
067:         * char indexes[INDEX_TOP];                   -- INDEX_TOP=32, see enum in this file
068:         *
069:         * Trie normTrie;                           -- size in bytes=indexes[INDEX_TRIE_SIZE]
070:         * 
071:         * char extraData[extraDataTop];            -- extraDataTop=indexes[INDEX_UCHAR_COUNT]
072:         *                                                 extraData[0] contains the number of units for
073:         *                                                 FC_NFKC_Closure (formatVersion>=2.1)
074:         *
075:         * char combiningTable[combiningTableTop];  -- combiningTableTop=indexes[INDEX_COMBINE_DATA_COUNT]
076:         *                                                 combiningTableTop may include one 16-bit padding unit
077:         *                                                 to make sure that fcdTrie is 32-bit-aligned
078:         *
079:         * Trie fcdTrie;                            -- size in bytes=indexes[INDEX_FCD_TRIE_SIZE]
080:         *
081:         * Trie auxTrie;                            -- size in bytes=indexes[INDEX_AUX_TRIE_SIZE]
082:         *
083:         *
084:         * The indexes array contains lengths and sizes of the following arrays and structures
085:         * as well as the following values:
086:         *  indexes[INDEX_COMBINE_FWD_COUNT]=combineFwdTop
087:         *      -- one more than the highest combining index computed for forward-only-combining characters
088:         *  indexes[INDEX_COMBINE_BOTH_COUNT]=combineBothTop-combineFwdTop
089:         *      -- number of combining indexes computed for both-ways-combining characters
090:         *  indexes[INDEX_COMBINE_BACK_COUNT]=combineBackTop-combineBothTop
091:         *      -- number of combining indexes computed for backward-only-combining characters
092:         *
093:         *  indexes[INDEX_MIN_NF*_NO_MAYBE] (where *={ C, D, KC, KD })
094:         *      -- first code point with a quick check NF* value of NO/MAYBE
095:         *
096:         *
097:         * - Tries
098:         *
099:         * The main structures are two Trie tables ("compact arrays"),
100:         * each with one index array and one data array.
101:         * See Trie.h and Trie.c.
102:         *
103:         *
104:         * - Tries in unorm.icu
105:         *
106:         * The first trie (normTrie above)
107:         * provides data for the NF* quick checks and normalization.
108:         * The second trie (fcdTrie above) provides data just for FCD checks.
109:         *
110:         *
111:         * - norm32 data words from the first trie
112:         *
113:         * The norm32Table contains one 32-bit word "norm32" per code point.
114:         * It contains the following bit fields:
115:         * 31..16   extra data index, EXTRA_SHIFT is used to shift this field down
116:         *          if this index is <EXTRA_INDEX_TOP then it is an index into
117:         *              extraData[] where variable-length normalization data for this
118:         *              code point is found
119:         *          if this index is <EXTRA_INDEX_TOP+EXTRA_SURROGATE_TOP
120:         *              then this is a norm32 for a leading surrogate, and the index
121:         *              value is used together with the following trailing surrogate
122:         *              code unit in the second trie access
123:         *          if this index is >=EXTRA_INDEX_TOP+EXTRA_SURROGATE_TOP
124:         *              then this is a norm32 for a "special" character,
125:         *              i.e., the character is a Hangul syllable or a Jamo
126:         *              see EXTRA_HANGUL etc.
127:         *          generally, instead of extracting this index from the norm32 and
128:         *              comparing it with the above constants,
129:         *              the normalization code compares the entire norm32 value
130:         *              with MIN_SPECIAL, SURROGATES_TOP, MIN_HANGUL etc.
131:         *
132:         * 15..8    combining class (cc) according to UnicodeData.txt
133:         *
134:         *  7..6    COMBINES_ANY flags, used in composition to see if a character
135:         *              combines with any following or preceding character(s)
136:         *              at all
137:         *     7    COMBINES_BACK
138:         *     6    COMBINES_FWD
139:         *
140:         *  5..0    quick check flags, set for "no" or "maybe", with separate flags for
141:         *              each normalization form
142:         *              the higher bits are "maybe" flags; for NF*D there are no such flags
143:         *              the lower bits are "no" flags for all forms, in the same order
144:         *              as the "maybe" flags,
145:         *              which is (MSB to LSB): NFKD NFD NFKC NFC
146:         *  5..4    QC_ANY_MAYBE
147:         *  3..0    QC_ANY_NO
148:         *              see further related constants
149:         *
150:         *
151:         * - Extra data per code point
152:         *
153:         * "Extra data" is referenced by the index in norm32.
154:         * It is variable-length data. It is only present, and only those parts
155:         * of it are, as needed for a given character.
156:         * The norm32 extra data index is added to the beginning of extraData[]
157:         * to get to a vector of 16-bit words with data at the following offsets:
158:         *
159:         * [-1]     Combining index for composition.
160:         *              Stored only if norm32&COMBINES_ANY .
161:         * [0]      Lengths of the canonical and compatibility decomposition strings.
162:         *              Stored only if there are decompositions, i.e.,
163:         *              if norm32&(QC_NFD|QC_NFKD)
164:         *          High byte: length of NFKD, or 0 if none
165:         *          Low byte: length of NFD, or 0 if none
166:         *          Each length byte also has another flag:
167:         *              Bit 7 of a length byte is set if there are non-zero
168:         *              combining classes (cc's) associated with the respective
169:         *              decomposition. If this flag is set, then the decomposition
170:         *              is preceded by a 16-bit word that contains the
171:         *              leading and trailing cc's.
172:         *              Bits 6..0 of a length byte are the length of the
173:         *              decomposition string, not counting the cc word.
174:         * [1..n]   NFD
175:         * [n+1..]  NFKD
176:         *
177:         * Each of the two decompositions consists of up to two parts:
178:         * - The 16-bit words with the leading and trailing cc's.
179:         *   This is only stored if bit 7 of the corresponding length byte
180:         *   is set. In this case, at least one of the cc's is not zero.
181:         *   High byte: leading cc==cc of the first code point in the decomposition string
182:         *   Low byte: trailing cc==cc of the last code point in the decomposition string
183:         * - The decomposition string in UTF-16, with length code units.
184:         *
185:         *
186:         * - Combining indexes and combiningTable[]
187:         *
188:         * Combining indexes are stored at the [-1] offset of the extra data
189:         * if the character combines forward or backward with any other characters.
190:         * They are used for (re)composition in NF*C.
191:         * Values of combining indexes are arranged according to whether a character
192:         * combines forward, backward, or both ways:
193:         *    forward-only < both ways < backward-only
194:         *
195:         * The index values for forward-only and both-ways combining characters
196:         * are indexes into the combiningTable[].
197:         * The index values for backward-only combining characters are simply
198:         * incremented from the preceding index values to be unique.
199:         *
200:         * In the combiningTable[], a variable-length list
201:         * of variable-length (back-index, code point) pair entries is stored
202:         * for each forward-combining character.
203:         *
204:         * These back-indexes are the combining indexes of both-ways or backward-only
205:         * combining characters that the forward-combining character combines with.
206:         *
207:         * Each list is sorted in ascending order of back-indexes.
208:         * Each list is terminated with the last back-index having bit 15 set.
209:         *
210:         * Each pair (back-index, code point) takes up either 2 or 3
211:         * 16-bit words.
212:         * The first word of a list entry is the back-index, with its bit 15 set if
213:         * this is the last pair in the list.
214:         *
215:         * The second word contains flags in bits 15..13 that determine
216:         * if there is a third word and how the combined character is encoded:
217:         * 15   set if there is a third word in this list entry
218:         * 14   set if the result is a supplementary character
219:         * 13   set if the result itself combines forward
220:         *
221:         * According to these bits 15..14 of the second word,
222:         * the result character is encoded as follows:
223:         * 00 or 01 The result is <=0x1fff and stored in bits 12..0 of
224:         *          the second word.
225:         * 10       The result is 0x2000..0xffff and stored in the third word.
226:         *          Bits 12..0 of the second word are not used.
227:         * 11       The result is a supplementary character.
228:         *          Bits 9..0 of the leading surrogate are in bits 9..0 of
229:         *          the second word.
230:         *          Add 0xd800 to these bits to get the complete surrogate.
231:         *          Bits 12..10 of the second word are not used.
232:         *          The trailing surrogate is stored in the third word.
233:         *
234:         *
235:         * - FCD trie
236:         *
237:         * The FCD trie is very simple.
238:         * It is a folded trie with 16-bit data words.
239:         * In each word, the high byte contains the leading cc of the character,
240:         * and the low byte contains the trailing cc of the character.
241:         * These cc's are the cc's of the first and last code points in the
242:         * canonical decomposition of the character.
243:         *
244:         * Since all 16 bits are used for cc's, lead surrogates must be tested
245:         * by checking the code unit instead of the trie data.
246:         * This is done only if the 16-bit data word is not zero.
247:         * If the code unit is a leading surrogate and the data word is not zero,
248:         * then instead of cc's it contains the offset for the second trie lookup.
249:         *
250:         *
251:         * - Auxiliary trie and data
252:         *
253:         *
254:         * The auxiliary 16-bit trie contains data for additional properties.
255:         * Bits
256:         * 15..13   reserved
257:         *     12   not NFC_Skippable (f) (formatVersion>=2.2)
258:         *     11   flag: not a safe starter for canonical closure
259:         *     10   composition exclusion
260:         *  9.. 0   index into extraData[] to FC_NFKC_Closure string
261:         *          (not for lead surrogate),
262:         *          or lead surrogate offset (for lead surrogate, if 9..0 not zero)
263:         * 
264:         * Conditions for "NF* Skippable" from Mark Davis' com.ibm.text.UCD.NFSkippable:
265:         * (used in NormalizerTransliterator)
266:         *
267:         * A skippable character is
268:         * a) unassigned, or ALL of the following:
269:         * b) of combining class 0.
270:         * c) not decomposed by this normalization form.
271:         * AND if NFC or NFKC,
272:         * d) can never compose with a previous character.
273:         * e) can never compose with a following character.
274:         * f) can never change if another character is added.
275:         *    Example: a-breve might satisfy all but f, but if you
276:         *    add an ogonek it changes to a-ogonek + breve
277:         *
278:         * a)..e) must be tested from norm32.
279:         * Since f) is more complicated, the (not-)NFC_Skippable flag (f) is built
280:         * into the auxiliary trie.
281:         * The same bit is used for NFC and NFKC; (c) differs for them.
282:         * As usual, we build the "not skippable" flags so that unassigned
283:         * code points get a 0 bit.
284:         * This bit is only valid after (a)..(e) test FALSE; test NFD_NO before (f) as well.
285:         * Test Hangul LV syllables entirely in code.
286:         *   
287:         * 
288:         * - FC_NFKC_Closure strings in extraData[]
289:         *
290:         * Strings are either stored as a single code unit or as the length
291:         * followed by that many units.
292:         * 
293:         */
294:        final class NormalizerDataReader implements  ICUBinary.Authenticate {
295:
296:            /**
297:             * <p>Protected constructor.</p>
298:             * @param inputStream ICU uprop.dat file input stream
299:             * @exception IOException throw if data file fails authentication 
300:             * @draft 2.1
301:             */
302:            protected NormalizerDataReader(InputStream inputStream)
303:                    throws IOException {
304:
305:                unicodeVersion = ICUBinary.readHeader(inputStream,
306:                        DATA_FORMAT_ID, this );
307:                dataInputStream = new DataInputStream(inputStream);
308:            }
309:
310:            // protected methods -------------------------------------------------
311:
312:            protected int[] readIndexes(int length) throws IOException {
313:                int[] indexes = new int[length];
314:                //Read the indexes
315:                for (int i = 0; i < length; i++) {
316:                    indexes[i] = dataInputStream.readInt();
317:                }
318:                return indexes;
319:            }
320:
321:            /**
322:             * <p>Reads unorm.icu, parse it into blocks of data to be stored in
323:             * NormalizerImpl.</P
324:             * @param normBytes
325:             * @param fcdBytes
326:             * @param auxBytes
327:             * @param extraData
328:             * @param combiningTable
329:             * @exception thrown when data reading fails
330:             * @draft 2.1
331:             */
332:            protected void read(byte[] normBytes, byte[] fcdBytes,
333:                    byte[] auxBytes, char[] extraData, char[] combiningTable)
334:                    throws IOException {
335:
336:                //Read the bytes that make up the normTrie     
337:                dataInputStream.read(normBytes);
338:
339:                //normTrieStream= new ByteArrayInputStream(normBytes);
340:
341:                //Read the extra data
342:                for (int i = 0; i < extraData.length; i++) {
343:                    extraData[i] = dataInputStream.readChar();
344:                }
345:
346:                //Read the combining class table
347:                for (int i = 0; i < combiningTable.length; i++) {
348:                    combiningTable[i] = dataInputStream.readChar();
349:                }
350:
351:                //Read the fcdTrie
352:                dataInputStream.read(fcdBytes);
353:
354:                //Read the AuxTrie         
355:                dataInputStream.read(auxBytes);
356:            }
357:
358:            public byte[] getDataFormatVersion() {
359:                return DATA_FORMAT_VERSION;
360:            }
361:
362:            public boolean isDataVersionAcceptable(byte version[]) {
363:                return version[0] == DATA_FORMAT_VERSION[0]
364:                        && version[2] == DATA_FORMAT_VERSION[2]
365:                        && version[3] == DATA_FORMAT_VERSION[3];
366:            }
367:
368:            public byte[] getUnicodeVersion() {
369:                return unicodeVersion;
370:            }
371:
372:            // private data members -------------------------------------------------
373:
374:            /**
375:             * ICU data file input stream
376:             */
377:            private DataInputStream dataInputStream;
378:
379:            private byte[] unicodeVersion;
380:
381:            /**
382:             * File format version that this class understands.
383:             * No guarantees are made if a older version is used
384:             * see store.c of gennorm for more information and values
385:             */
386:            private static final byte DATA_FORMAT_ID[] = { (byte) 0x4E,
387:                    (byte) 0x6F, (byte) 0x72, (byte) 0x6D };
388:            private static final byte DATA_FORMAT_VERSION[] = { (byte) 0x2,
389:                    (byte) 0x2, (byte) 0x5, (byte) 0x2 };
390:
391:        }
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.