Source Code Cross Referenced for UnicodeSet.java in » 6.0-JDK-Modules-sun » text » sun » text » normalizer » Java Source Code / Java DocumentationJava Source Code and Java Documentation

Java Source Code / Java Documentation
1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI
Java
Java Tutorial
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » 6.0 JDK Modules sun » text » sun.text.normalizer
Source Cross Referenced Class Diagram Java Document (Java Doc)
0001:        /*
0002:         * Portions Copyright 2005-2006 Sun Microsystems, Inc.  All Rights Reserved.
0003:         * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
0004:         *
0005:         * This code is free software; you can redistribute it and/or modify it
0006:         * under the terms of the GNU General Public License version 2 only, as
0007:         * published by the Free Software Foundation.  Sun designates this
0008:         * particular file as subject to the "Classpath" exception as provided
0009:         * by Sun in the LICENSE file that accompanied this code.
0010:         *
0011:         * This code is distributed in the hope that it will be useful, but WITHOUT
0012:         * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
0013:         * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
0014:         * version 2 for more details (a copy is included in the LICENSE file that
0015:         * accompanied this code).
0016:         *
0017:         * You should have received a copy of the GNU General Public License version
0018:         * 2 along with this work; if not, write to the Free Software Foundation,
0019:         * Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
0020:         *
0021:         * Please contact Sun Microsystems, Inc., 4150 Network Circle, Santa Clara,
0022:         * CA 95054 USA or visit www.sun.com if you need additional information or
0023:         * have any questions.
0024:         */
0025:
0026:        /*
0027:         *******************************************************************************
0028:         * (C) Copyright IBM Corp. 1996-2005 - All Rights Reserved                     *
0029:         *                                                                             *
0030:         * The original version of this source code and documentation is copyrighted   *
0031:         * and owned by IBM, These materials are provided under terms of a License     *
0032:         * Agreement between IBM and Sun. This technology is protected by multiple     *
0033:         * US and International patents. This notice and attribution to IBM may not    *
0034:         * to removed.                                                                 *
0035:         *******************************************************************************
0036:         */
0037:
0038:        package sun.text.normalizer;
0039:
0040:        import java.text.ParsePosition;
0041:        import java.util.Map;
0042:        import java.util.HashMap;
0043:        import java.util.TreeSet;
0044:        import java.util.Iterator;
0045:        import java.util.Collection;
0046:
0047:        /**
0048:         * A mutable set of Unicode characters and multicharacter strings.  Objects of this class
0049:         * represent <em>character classes</em> used in regular expressions.
0050:         * A character specifies a subset of Unicode code points.  Legal
0051:         * code points are U+0000 to U+10FFFF, inclusive.
0052:         *
0053:         * <p>The UnicodeSet class is not designed to be subclassed.
0054:         *
0055:         * <p><code>UnicodeSet</code> supports two APIs. The first is the
0056:         * <em>operand</em> API that allows the caller to modify the value of
0057:         * a <code>UnicodeSet</code> object. It conforms to Java 2's
0058:         * <code>java.util.Set</code> interface, although
0059:         * <code>UnicodeSet</code> does not actually implement that
0060:         * interface. All methods of <code>Set</code> are supported, with the
0061:         * modification that they take a character range or single character
0062:         * instead of an <code>Object</code>, and they take a
0063:         * <code>UnicodeSet</code> instead of a <code>Collection</code>.  The
0064:         * operand API may be thought of in terms of boolean logic: a boolean
0065:         * OR is implemented by <code>add</code>, a boolean AND is implemented
0066:         * by <code>retain</code>, a boolean XOR is implemented by
0067:         * <code>complement</code> taking an argument, and a boolean NOT is
0068:         * implemented by <code>complement</code> with no argument.  In terms
0069:         * of traditional set theory function names, <code>add</code> is a
0070:         * union, <code>retain</code> is an intersection, <code>remove</code>
0071:         * is an asymmetric difference, and <code>complement</code> with no
0072:         * argument is a set complement with respect to the superset range
0073:         * <code>MIN_VALUE-MAX_VALUE</code>
0074:         *
0075:         * <p>The second API is the
0076:         * <code>applyPattern()</code>/<code>toPattern()</code> API from the
0077:         * <code>java.text.Format</code>-derived classes.  Unlike the
0078:         * methods that add characters, add categories, and control the logic
0079:         * of the set, the method <code>applyPattern()</code> sets all
0080:         * attributes of a <code>UnicodeSet</code> at once, based on a
0081:         * string pattern.
0082:         *
0083:         * <p><b>Pattern syntax</b></p>
0084:         *
0085:         * Patterns are accepted by the constructors and the
0086:         * <code>applyPattern()</code> methods and returned by the
0087:         * <code>toPattern()</code> method.  These patterns follow a syntax
0088:         * similar to that employed by version 8 regular expression character
0089:         * classes.  Here are some simple examples:
0090:         *
0091:         * <blockquote>
0092:         *   <table>
0093:         *     <tr align="top">
0094:         *       <td nowrap valign="top" align="left"><code>[]</code></td>
0095:         *       <td valign="top">No characters</td>
0096:         *     </tr><tr align="top">
0097:         *       <td nowrap valign="top" align="left"><code>[a]</code></td>
0098:         *       <td valign="top">The character 'a'</td>
0099:         *     </tr><tr align="top">
0100:         *       <td nowrap valign="top" align="left"><code>[ae]</code></td>
0101:         *       <td valign="top">The characters 'a' and 'e'</td>
0102:         *     </tr>
0103:         *     <tr>
0104:         *       <td nowrap valign="top" align="left"><code>[a-e]</code></td>
0105:         *       <td valign="top">The characters 'a' through 'e' inclusive, in Unicode code
0106:         *       point order</td>
0107:         *     </tr>
0108:         *     <tr>
0109:         *       <td nowrap valign="top" align="left"><code>[\\u4E01]</code></td>
0110:         *       <td valign="top">The character U+4E01</td>
0111:         *     </tr>
0112:         *     <tr>
0113:         *       <td nowrap valign="top" align="left"><code>[a{ab}{ac}]</code></td>
0114:         *       <td valign="top">The character 'a' and the multicharacter strings &quot;ab&quot; and
0115:         *       &quot;ac&quot;</td>
0116:         *     </tr>
0117:         *     <tr>
0118:         *       <td nowrap valign="top" align="left"><code>[\p{Lu}]</code></td>
0119:         *       <td valign="top">All characters in the general category Uppercase Letter</td>
0120:         *     </tr>
0121:         *   </table>
0122:         * </blockquote>
0123:         *
0124:         * Any character may be preceded by a backslash in order to remove any special
0125:         * meaning.  White space characters, as defined by UCharacterProperty.isRuleWhiteSpace(), are
0126:         * ignored, unless they are escaped.
0127:         *
0128:         * <p>Property patterns specify a set of characters having a certain
0129:         * property as defined by the Unicode standard.  Both the POSIX-like
0130:         * "[:Lu:]" and the Perl-like syntax "\p{Lu}" are recognized.  For a
0131:         * complete list of supported property patterns, see the User's Guide
0132:         * for UnicodeSet at
0133:         * <a href="http://oss.software.ibm.com/icu/userguide/unicodeSet.html">
0134:         * http://oss.software.ibm.com/icu/userguide/unicodeSet.html</a>.
0135:         * Actual determination of property data is defined by the underlying
0136:         * Unicode database as implemented by UCharacter.
0137:         *
0138:         * <p>Patterns specify individual characters, ranges of characters, and
0139:         * Unicode property sets.  When elements are concatenated, they
0140:         * specify their union.  To complement a set, place a '^' immediately
0141:         * after the opening '['.  Property patterns are inverted by modifying
0142:         * their delimiters; "[:^foo]" and "\P{foo}".  In any other location,
0143:         * '^' has no special meaning.
0144:         *
0145:         * <p>Ranges are indicated by placing two a '-' between two
0146:         * characters, as in "a-z".  This specifies the range of all
0147:         * characters from the left to the right, in Unicode order.  If the
0148:         * left character is greater than or equal to the
0149:         * right character it is a syntax error.  If a '-' occurs as the first
0150:         * character after the opening '[' or '[^', or if it occurs as the
0151:         * last character before the closing ']', then it is taken as a
0152:         * literal.  Thus "[a\\-b]", "[-ab]", and "[ab-]" all indicate the same
0153:         * set of three characters, 'a', 'b', and '-'.
0154:         *
0155:         * <p>Sets may be intersected using the '&' operator or the asymmetric
0156:         * set difference may be taken using the '-' operator, for example,
0157:         * "[[:L:]&[\\u0000-\\u0FFF]]" indicates the set of all Unicode letters
0158:         * with values less than 4096.  Operators ('&' and '|') have equal
0159:         * precedence and bind left-to-right.  Thus
0160:         * "[[:L:]-[a-z]-[\\u0100-\\u01FF]]" is equivalent to
0161:         * "[[[:L:]-[a-z]]-[\\u0100-\\u01FF]]".  This only really matters for
0162:         * difference; intersection is commutative.
0163:         *
0164:         * <table>
0165:         * <tr valign=top><td nowrap><code>[a]</code><td>The set containing 'a'
0166:         * <tr valign=top><td nowrap><code>[a-z]</code><td>The set containing 'a'
0167:         * through 'z' and all letters in between, in Unicode order
0168:         * <tr valign=top><td nowrap><code>[^a-z]</code><td>The set containing
0169:         * all characters but 'a' through 'z',
0170:         * that is, U+0000 through 'a'-1 and 'z'+1 through U+10FFFF
0171:         * <tr valign=top><td nowrap><code>[[<em>pat1</em>][<em>pat2</em>]]</code>
0172:         * <td>The union of sets specified by <em>pat1</em> and <em>pat2</em>
0173:         * <tr valign=top><td nowrap><code>[[<em>pat1</em>]&[<em>pat2</em>]]</code>
0174:         * <td>The intersection of sets specified by <em>pat1</em> and <em>pat2</em>
0175:         * <tr valign=top><td nowrap><code>[[<em>pat1</em>]-[<em>pat2</em>]]</code>
0176:         * <td>The asymmetric difference of sets specified by <em>pat1</em> and
0177:         * <em>pat2</em>
0178:         * <tr valign=top><td nowrap><code>[:Lu:] or \p{Lu}</code>
0179:         * <td>The set of characters having the specified
0180:         * Unicode property; in
0181:         * this case, Unicode uppercase letters
0182:         * <tr valign=top><td nowrap><code>[:^Lu:] or \P{Lu}</code>
0183:         * <td>The set of characters <em>not</em> having the given
0184:         * Unicode property
0185:         * </table>
0186:         *
0187:         * <p><b>Warning</b>: you cannot add an empty string ("") to a UnicodeSet.</p>
0188:         *
0189:         * <p><b>Formal syntax</b></p>
0190:         *
0191:         * <blockquote>
0192:         *   <table>
0193:         *     <tr align="top">
0194:         *       <td nowrap valign="top" align="right"><code>pattern :=&nbsp; </code></td>
0195:         *       <td valign="top"><code>('[' '^'? item* ']') |
0196:         *       property</code></td>
0197:         *     </tr>
0198:         *     <tr align="top">
0199:         *       <td nowrap valign="top" align="right"><code>item :=&nbsp; </code></td>
0200:         *       <td valign="top"><code>char | (char '-' char) | pattern-expr<br>
0201:         *       </code></td>
0202:         *     </tr>
0203:         *     <tr align="top">
0204:         *       <td nowrap valign="top" align="right"><code>pattern-expr :=&nbsp; </code></td>
0205:         *       <td valign="top"><code>pattern | pattern-expr pattern |
0206:         *       pattern-expr op pattern<br>
0207:         *       </code></td>
0208:         *     </tr>
0209:         *     <tr align="top">
0210:         *       <td nowrap valign="top" align="right"><code>op :=&nbsp; </code></td>
0211:         *       <td valign="top"><code>'&amp;' | '-'<br>
0212:         *       </code></td>
0213:         *     </tr>
0214:         *     <tr align="top">
0215:         *       <td nowrap valign="top" align="right"><code>special :=&nbsp; </code></td>
0216:         *       <td valign="top"><code>'[' | ']' | '-'<br>
0217:         *       </code></td>
0218:         *     </tr>
0219:         *     <tr align="top">
0220:         *       <td nowrap valign="top" align="right"><code>char :=&nbsp; </code></td>
0221:         *       <td valign="top"><em>any character that is not</em><code> special<br>
0222:         *       | ('\\' </code><em>any character</em><code>)<br>
0223:         *       | ('&#92;u' hex hex hex hex)<br>
0224:         *       </code></td>
0225:         *     </tr>
0226:         *     <tr align="top">
0227:         *       <td nowrap valign="top" align="right"><code>hex :=&nbsp; </code></td>
0228:         *       <td valign="top"><em>any character for which
0229:         *       </em><code>Character.digit(c, 16)</code><em>
0230:         *       returns a non-negative result</em></td>
0231:         *     </tr>
0232:         *     <tr>
0233:         *       <td nowrap valign="top" align="right"><code>property :=&nbsp; </code></td>
0234:         *       <td valign="top"><em>a Unicode property set pattern</td>
0235:         *     </tr>
0236:         *   </table>
0237:         *   <br>
0238:         *   <table border="1">
0239:         *     <tr>
0240:         *       <td>Legend: <table>
0241:         *         <tr>
0242:         *           <td nowrap valign="top"><code>a := b</code></td>
0243:         *           <td width="20" valign="top">&nbsp; </td>
0244:         *           <td valign="top"><code>a</code> may be replaced by <code>b</code> </td>
0245:         *         </tr>
0246:         *         <tr>
0247:         *           <td nowrap valign="top"><code>a?</code></td>
0248:         *           <td valign="top"></td>
0249:         *           <td valign="top">zero or one instance of <code>a</code><br>
0250:         *           </td>
0251:         *         </tr>
0252:         *         <tr>
0253:         *           <td nowrap valign="top"><code>a*</code></td>
0254:         *           <td valign="top"></td>
0255:         *           <td valign="top">one or more instances of <code>a</code><br>
0256:         *           </td>
0257:         *         </tr>
0258:         *         <tr>
0259:         *           <td nowrap valign="top"><code>a | b</code></td>
0260:         *           <td valign="top"></td>
0261:         *           <td valign="top">either <code>a</code> or <code>b</code><br>
0262:         *           </td>
0263:         *         </tr>
0264:         *         <tr>
0265:         *           <td nowrap valign="top"><code>'a'</code></td>
0266:         *           <td valign="top"></td>
0267:         *           <td valign="top">the literal string between the quotes </td>
0268:         *         </tr>
0269:         *       </table>
0270:         *       </td>
0271:         *     </tr>
0272:         *   </table>
0273:         * </blockquote>
0274:         *
0275:         * @author Alan Liu
0276:         * @stable ICU 2.0
0277:         */
0278:        public class UnicodeSet implements  UnicodeMatcher {
0279:
0280:            private static final int LOW = 0x000000; // LOW <= all valid values. ZERO for codepoints
0281:            private static final int HIGH = 0x110000; // HIGH > all valid values. 10000 for code units.
0282:            // 110000 for codepoints
0283:
0284:            /**
0285:             * Minimum value that can be stored in a UnicodeSet.
0286:             * @stable ICU 2.0
0287:             */
0288:            public static final int MIN_VALUE = LOW;
0289:
0290:            /**
0291:             * Maximum value that can be stored in a UnicodeSet.
0292:             * @stable ICU 2.0
0293:             */
0294:            public static final int MAX_VALUE = HIGH - 1;
0295:
0296:            private int len; // length used; list may be longer to minimize reallocs
0297:            private int[] list; // MUST be terminated with HIGH
0298:            private int[] rangeList; // internal buffer
0299:            private int[] buffer; // internal buffer
0300:
0301:            // NOTE: normally the field should be of type SortedSet; but that is missing a public clone!!
0302:            // is not private so that UnicodeSetIterator can get access
0303:            TreeSet strings = new TreeSet();
0304:
0305:            /**
0306:             * The pattern representation of this set.  This may not be the
0307:             * most economical pattern.  It is the pattern supplied to
0308:             * applyPattern(), with variables substituted and whitespace
0309:             * removed.  For sets constructed without applyPattern(), or
0310:             * modified using the non-pattern API, this string will be null,
0311:             * indicating that toPattern() must generate a pattern
0312:             * representation from the inversion list.
0313:             */
0314:            private String pat = null;
0315:
0316:            private static final int START_EXTRA = 16; // initial storage. Must be >= 0
0317:            private static final int GROW_EXTRA = START_EXTRA; // extra amount for growth. Must be >= 0
0318:
0319:            /**
0320:             * A set of all characters _except_ the second through last characters of
0321:             * certain ranges.  These ranges are ranges of characters whose
0322:             * properties are all exactly alike, e.g. CJK Ideographs from
0323:             * U+4E00 to U+9FA5.
0324:             */
0325:            private static UnicodeSet INCLUSIONS = null;
0326:
0327:            //----------------------------------------------------------------
0328:            // Public API
0329:            //----------------------------------------------------------------
0330:
0331:            /**
0332:             * Constructs an empty set.
0333:             * @stable ICU 2.0
0334:             */
0335:            public UnicodeSet() {
0336:                list = new int[1 + START_EXTRA];
0337:                list[len++] = HIGH;
0338:            }
0339:
0340:            /**
0341:             * Constructs a set containing the given range. If <code>end >
0342:             * start</code> then an empty set is created.
0343:             *
0344:             * @param start first character, inclusive, of range
0345:             * @param end last character, inclusive, of range
0346:             * @stable ICU 2.0
0347:             */
0348:            public UnicodeSet(int start, int end) {
0349:                this ();
0350:                complement(start, end);
0351:            }
0352:
0353:            /**
0354:             * Constructs a set from the given pattern.  See the class description
0355:             * for the syntax of the pattern language.  Whitespace is ignored.
0356:             * @param pattern a string specifying what characters are in the set
0357:             * @exception java.lang.IllegalArgumentException if the pattern contains
0358:             * a syntax error.
0359:             * @stable ICU 2.0
0360:             */
0361:            public UnicodeSet(String pattern) {
0362:                this ();
0363:                applyPattern(pattern, null, null, IGNORE_SPACE);
0364:            }
0365:
0366:            /**
0367:             * Make this object represent the same set as <code>other</code>.
0368:             * @param other a <code>UnicodeSet</code> whose value will be
0369:             * copied to this object
0370:             * @stable ICU 2.0
0371:             */
0372:            public UnicodeSet set(UnicodeSet other) {
0373:                list = (int[]) other.list.clone();
0374:                len = other.len;
0375:                pat = other.pat;
0376:                strings = (TreeSet) other.strings.clone();
0377:                return this ;
0378:            }
0379:
0380:            /**
0381:             * Modifies this set to represent the set specified by the given pattern.
0382:             * See the class description for the syntax of the pattern language.
0383:             * Whitespace is ignored.
0384:             * @param pattern a string specifying what characters are in the set
0385:             * @exception java.lang.IllegalArgumentException if the pattern
0386:             * contains a syntax error.
0387:             * @stable ICU 2.0
0388:             */
0389:            public final UnicodeSet applyPattern(String pattern) {
0390:                return applyPattern(pattern, null, null, IGNORE_SPACE);
0391:            }
0392:
0393:            /**
0394:             * Append the <code>toPattern()</code> representation of a
0395:             * string to the given <code>StringBuffer</code>.
0396:             */
0397:            private static void _appendToPat(StringBuffer buf, String s,
0398:                    boolean escapeUnprintable) {
0399:                for (int i = 0; i < s.length(); i += UTF16.getCharCount(i)) {
0400:                    _appendToPat(buf, UTF16.charAt(s, i), escapeUnprintable);
0401:                }
0402:            }
0403:
0404:            /**
0405:             * Append the <code>toPattern()</code> representation of a
0406:             * character to the given <code>StringBuffer</code>.
0407:             */
0408:            private static void _appendToPat(StringBuffer buf, int c,
0409:                    boolean escapeUnprintable) {
0410:                if (escapeUnprintable && Utility.isUnprintable(c)) {
0411:                    // Use hex escape notation (<backslash>uxxxx or <backslash>Uxxxxxxxx) for anything
0412:                    // unprintable
0413:                    if (Utility.escapeUnprintable(buf, c)) {
0414:                        return;
0415:                    }
0416:                }
0417:                // Okay to let ':' pass through
0418:                switch (c) {
0419:                case '[': // SET_OPEN:
0420:                case ']': // SET_CLOSE:
0421:                case '-': // HYPHEN:
0422:                case '^': // COMPLEMENT:
0423:                case '&': // INTERSECTION:
0424:                case '\\': //BACKSLASH:
0425:                case '{':
0426:                case '}':
0427:                case '$':
0428:                case ':':
0429:                    buf.append('\\');
0430:                    break;
0431:                default:
0432:                    // Escape whitespace
0433:                    if (UCharacterProperty.isRuleWhiteSpace(c)) {
0434:                        buf.append('\\');
0435:                    }
0436:                    break;
0437:                }
0438:                UTF16.append(buf, c);
0439:            }
0440:
0441:            /**
0442:             * Append a string representation of this set to result.  This will be
0443:             * a cleaned version of the string passed to applyPattern(), if there
0444:             * is one.  Otherwise it will be generated.
0445:             */
0446:            private StringBuffer _toPattern(StringBuffer result,
0447:                    boolean escapeUnprintable) {
0448:                if (pat != null) {
0449:                    int i;
0450:                    int backslashCount = 0;
0451:                    for (i = 0; i < pat.length();) {
0452:                        int c = UTF16.charAt(pat, i);
0453:                        i += UTF16.getCharCount(c);
0454:                        if (escapeUnprintable && Utility.isUnprintable(c)) {
0455:                            // If the unprintable character is preceded by an odd
0456:                            // number of backslashes, then it has been escaped.
0457:                            // Before unescaping it, we delete the final
0458:                            // backslash.
0459:                            if ((backslashCount % 2) == 1) {
0460:                                result.setLength(result.length() - 1);
0461:                            }
0462:                            Utility.escapeUnprintable(result, c);
0463:                            backslashCount = 0;
0464:                        } else {
0465:                            UTF16.append(result, c);
0466:                            if (c == '\\') {
0467:                                ++backslashCount;
0468:                            } else {
0469:                                backslashCount = 0;
0470:                            }
0471:                        }
0472:                    }
0473:                    return result;
0474:                }
0475:
0476:                return _generatePattern(result, escapeUnprintable);
0477:            }
0478:
0479:            /**
0480:             * Generate and append a string representation of this set to result.
0481:             * This does not use this.pat, the cleaned up copy of the string
0482:             * passed to applyPattern().
0483:             * @stable ICU 2.0
0484:             */
0485:            public StringBuffer _generatePattern(StringBuffer result,
0486:                    boolean escapeUnprintable) {
0487:                result.append('[');
0488:
0489:                int count = getRangeCount();
0490:
0491:                // If the set contains at least 2 intervals and includes both
0492:                // MIN_VALUE and MAX_VALUE, then the inverse representation will
0493:                // be more economical.
0494:                if (count > 1 && getRangeStart(0) == MIN_VALUE
0495:                        && getRangeEnd(count - 1) == MAX_VALUE) {
0496:
0497:                    // Emit the inverse
0498:                    result.append('^');
0499:
0500:                    for (int i = 1; i < count; ++i) {
0501:                        int start = getRangeEnd(i - 1) + 1;
0502:                        int end = getRangeStart(i) - 1;
0503:                        _appendToPat(result, start, escapeUnprintable);
0504:                        if (start != end) {
0505:                            if ((start + 1) != end) {
0506:                                result.append('-');
0507:                            }
0508:                            _appendToPat(result, end, escapeUnprintable);
0509:                        }
0510:                    }
0511:                }
0512:
0513:                // Default; emit the ranges as pairs
0514:                else {
0515:                    for (int i = 0; i < count; ++i) {
0516:                        int start = getRangeStart(i);
0517:                        int end = getRangeEnd(i);
0518:                        _appendToPat(result, start, escapeUnprintable);
0519:                        if (start != end) {
0520:                            if ((start + 1) != end) {
0521:                                result.append('-');
0522:                            }
0523:                            _appendToPat(result, end, escapeUnprintable);
0524:                        }
0525:                    }
0526:                }
0527:
0528:                if (strings.size() > 0) {
0529:                    Iterator it = strings.iterator();
0530:                    while (it.hasNext()) {
0531:                        result.append('{');
0532:                        _appendToPat(result, (String) it.next(),
0533:                                escapeUnprintable);
0534:                        result.append('}');
0535:                    }
0536:                }
0537:                return result.append(']');
0538:            }
0539:
0540:            /**
0541:             * Adds the specified range to this set if it is not already
0542:             * present.  If this set already contains the specified range,
0543:             * the call leaves this set unchanged.  If <code>end > start</code>
0544:             * then an empty range is added, leaving the set unchanged.
0545:             *
0546:             * @param start first character, inclusive, of range to be added
0547:             * to this set.
0548:             * @param end last character, inclusive, of range to be added
0549:             * to this set.
0550:             * @stable ICU 2.0
0551:             */
0552:            public UnicodeSet add(int start, int end) {
0553:                if (start < MIN_VALUE || start > MAX_VALUE) {
0554:                    throw new IllegalArgumentException("Invalid code point U+"
0555:                            + Utility.hex(start, 6));
0556:                }
0557:                if (end < MIN_VALUE || end > MAX_VALUE) {
0558:                    throw new IllegalArgumentException("Invalid code point U+"
0559:                            + Utility.hex(end, 6));
0560:                }
0561:                if (start < end) {
0562:                    add(range(start, end), 2, 0);
0563:                } else if (start == end) {
0564:                    add(start);
0565:                }
0566:                return this ;
0567:            }
0568:
0569:            /**
0570:             * Adds the specified character to this set if it is not already
0571:             * present.  If this set already contains the specified character,
0572:             * the call leaves this set unchanged.
0573:             * @stable ICU 2.0
0574:             */
0575:            public final UnicodeSet add(int c) {
0576:                if (c < MIN_VALUE || c > MAX_VALUE) {
0577:                    throw new IllegalArgumentException("Invalid code point U+"
0578:                            + Utility.hex(c, 6));
0579:                }
0580:
0581:                // find smallest i such that c < list[i]
0582:                // if odd, then it is IN the set
0583:                // if even, then it is OUT of the set
0584:                int i = findCodePoint(c);
0585:
0586:                // already in set?
0587:                if ((i & 1) != 0)
0588:                    return this ;
0589:
0590:                // HIGH is 0x110000
0591:                // assert(list[len-1] == HIGH);
0592:
0593:                // empty = [HIGH]
0594:                // [start_0, limit_0, start_1, limit_1, HIGH]
0595:
0596:                // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH]
0597:                //                             ^
0598:                //                             list[i]
0599:
0600:                // i == 0 means c is before the first range
0601:
0602:                if (c == list[i] - 1) {
0603:                    // c is before start of next range
0604:                    list[i] = c;
0605:                    // if we touched the HIGH mark, then add a new one
0606:                    if (c == MAX_VALUE) {
0607:                        ensureCapacity(len + 1);
0608:                        list[len++] = HIGH;
0609:                    }
0610:                    if (i > 0 && c == list[i - 1]) {
0611:                        // collapse adjacent ranges
0612:
0613:                        // [..., start_k-1, c, c, limit_k, ..., HIGH]
0614:                        //                     ^
0615:                        //                     list[i]
0616:                        System.arraycopy(list, i + 1, list, i - 1, len - i - 1);
0617:                        len -= 2;
0618:                    }
0619:                }
0620:
0621:                else if (i > 0 && c == list[i - 1]) {
0622:                    // c is after end of prior range
0623:                    list[i - 1]++;
0624:                    // no need to chcek for collapse here
0625:                }
0626:
0627:                else {
0628:                    // At this point we know the new char is not adjacent to
0629:                    // any existing ranges, and it is not 10FFFF.
0630:
0631:                    // [..., start_k-1, limit_k-1, start_k, limit_k, ..., HIGH]
0632:                    //                             ^
0633:                    //                             list[i]
0634:
0635:                    // [..., start_k-1, limit_k-1, c, c+1, start_k, limit_k, ..., HIGH]
0636:                    //                             ^
0637:                    //                             list[i]
0638:
0639:                    // Don't use ensureCapacity() to save on copying.
0640:                    // NOTE: This has no measurable impact on performance,
0641:                    // but it might help in some usage patterns.
0642:                    if (len + 2 > list.length) {
0643:                        int[] temp = new int[len + 2 + GROW_EXTRA];
0644:                        if (i != 0)
0645:                            System.arraycopy(list, 0, temp, 0, i);
0646:                        System.arraycopy(list, i, temp, i + 2, len - i);
0647:                        list = temp;
0648:                    } else {
0649:                        System.arraycopy(list, i, list, i + 2, len - i);
0650:                    }
0651:
0652:                    list[i] = c;
0653:                    list[i + 1] = c + 1;
0654:                    len += 2;
0655:                }
0656:
0657:                pat = null;
0658:                return this ;
0659:            }
0660:
0661:            /**
0662:             * Adds the specified multicharacter to this set if it is not already
0663:             * present.  If this set already contains the multicharacter,
0664:             * the call leaves this set unchanged.
0665:             * Thus "ch" => {"ch"}
0666:             * <br><b>Warning: you cannot add an empty string ("") to a UnicodeSet.</b>
0667:             * @param s the source string
0668:             * @return this object, for chaining
0669:             * @stable ICU 2.0
0670:             */
0671:            public final UnicodeSet add(String s) {
0672:
0673:                int cp = getSingleCP(s);
0674:                if (cp < 0) {
0675:                    strings.add(s);
0676:                    pat = null;
0677:                } else {
0678:                    add(cp, cp);
0679:                }
0680:                return this ;
0681:            }
0682:
0683:            /**
0684:             * @return a code point IF the string consists of a single one.
0685:             * otherwise returns -1.
0686:             * @param string to test
0687:             */
0688:            private static int getSingleCP(String s) {
0689:                if (s.length() < 1) {
0690:                    throw new IllegalArgumentException(
0691:                            "Can't use zero-length strings in UnicodeSet");
0692:                }
0693:                if (s.length() > 2)
0694:                    return -1;
0695:                if (s.length() == 1)
0696:                    return s.charAt(0);
0697:
0698:                // at this point, len = 2
0699:                int cp = UTF16.charAt(s, 0);
0700:                if (cp > 0xFFFF) { // is surrogate pair
0701:                    return cp;
0702:                }
0703:                return -1;
0704:            }
0705:
0706:            /**
0707:             * Complements the specified range in this set.  Any character in
0708:             * the range will be removed if it is in this set, or will be
0709:             * added if it is not in this set.  If <code>end > start</code>
0710:             * then an empty range is complemented, leaving the set unchanged.
0711:             *
0712:             * @param start first character, inclusive, of range to be removed
0713:             * from this set.
0714:             * @param end last character, inclusive, of range to be removed
0715:             * from this set.
0716:             * @stable ICU 2.0
0717:             */
0718:            public UnicodeSet complement(int start, int end) {
0719:                if (start < MIN_VALUE || start > MAX_VALUE) {
0720:                    throw new IllegalArgumentException("Invalid code point U+"
0721:                            + Utility.hex(start, 6));
0722:                }
0723:                if (end < MIN_VALUE || end > MAX_VALUE) {
0724:                    throw new IllegalArgumentException("Invalid code point U+"
0725:                            + Utility.hex(end, 6));
0726:                }
0727:                if (start <= end) {
0728:                    xor(range(start, end), 2, 0);
0729:                }
0730:                pat = null;
0731:                return this ;
0732:            }
0733:
0734:            /**
0735:             * This is equivalent to
0736:             * <code>complement(MIN_VALUE, MAX_VALUE)</code>.
0737:             * @stable ICU 2.0
0738:             */
0739:            public UnicodeSet complement() {
0740:                if (list[0] == LOW) {
0741:                    System.arraycopy(list, 1, list, 0, len - 1);
0742:                    --len;
0743:                } else {
0744:                    ensureCapacity(len + 1);
0745:                    System.arraycopy(list, 0, list, 1, len);
0746:                    list[0] = LOW;
0747:                    ++len;
0748:                }
0749:                pat = null;
0750:                return this ;
0751:            }
0752:
0753:            /**
0754:             * Returns true if this set contains the given character.
0755:             * @param c character to be checked for containment
0756:             * @return true if the test condition is met
0757:             * @stable ICU 2.0
0758:             */
0759:            public boolean contains(int c) {
0760:                if (c < MIN_VALUE || c > MAX_VALUE) {
0761:                    throw new IllegalArgumentException("Invalid code point U+"
0762:                            + Utility.hex(c, 6));
0763:                }
0764:
0765:                /*
0766:                // Set i to the index of the start item greater than ch
0767:                // We know we will terminate without length test!
0768:                int i = -1;
0769:                while (true) {
0770:                    if (c < list[++i]) break;
0771:                }
0772:                 */
0773:
0774:                int i = findCodePoint(c);
0775:
0776:                return ((i & 1) != 0); // return true if odd
0777:            }
0778:
0779:            /**
0780:             * Returns the smallest value i such that c < list[i].  Caller
0781:             * must ensure that c is a legal value or this method will enter
0782:             * an infinite loop.  This method performs a binary search.
0783:             * @param c a character in the range MIN_VALUE..MAX_VALUE
0784:             * inclusive
0785:             * @return the smallest integer i in the range 0..len-1,
0786:             * inclusive, such that c < list[i]
0787:             */
0788:            private final int findCodePoint(int c) {
0789:                /* Examples:
0790:                                                   findCodePoint(c)
0791:                   set              list[]         c=0 1 3 4 7 8
0792:                   ===              ==============   ===========
0793:                   []               [110000]         0 0 0 0 0 0
0794:                   [\u0000-\u0003]  [0, 4, 110000]   1 1 1 2 2 2
0795:                   [\u0004-\u0007]  [4, 8, 110000]   0 0 0 1 1 2
0796:                   [:all:]          [0, 110000]      1 1 1 1 1 1
0797:                 */
0798:
0799:                // Return the smallest i such that c < list[i].  Assume
0800:                // list[len - 1] == HIGH and that c is legal (0..HIGH-1).
0801:                if (c < list[0])
0802:                    return 0;
0803:                // High runner test.  c is often after the last range, so an
0804:                // initial check for this condition pays off.
0805:                if (len >= 2 && c >= list[len - 2])
0806:                    return len - 1;
0807:                int lo = 0;
0808:                int hi = len - 1;
0809:                // invariant: c >= list[lo]
0810:                // invariant: c < list[hi]
0811:                for (;;) {
0812:                    int i = (lo + hi) >>> 1;
0813:                    if (i == lo)
0814:                        return hi;
0815:                    if (c < list[i]) {
0816:                        hi = i;
0817:                    } else {
0818:                        lo = i;
0819:                    }
0820:                }
0821:            }
0822:
0823:            /**
0824:             * Adds all of the elements in the specified set to this set if
0825:             * they're not already present.  This operation effectively
0826:             * modifies this set so that its value is the <i>union</i> of the two
0827:             * sets.  The behavior of this operation is unspecified if the specified
0828:             * collection is modified while the operation is in progress.
0829:             *
0830:             * @param c set whose elements are to be added to this set.
0831:             * @stable ICU 2.0
0832:             */
0833:            public UnicodeSet addAll(UnicodeSet c) {
0834:                add(c.list, c.len, 0);
0835:                strings.addAll(c.strings);
0836:                return this ;
0837:            }
0838:
0839:            /**
0840:             * Retains only the elements in this set that are contained in the
0841:             * specified set.  In other words, removes from this set all of
0842:             * its elements that are not contained in the specified set.  This
0843:             * operation effectively modifies this set so that its value is
0844:             * the <i>intersection</i> of the two sets.
0845:             *
0846:             * @param c set that defines which elements this set will retain.
0847:             * @stable ICU 2.0
0848:             */
0849:            public UnicodeSet retainAll(UnicodeSet c) {
0850:                retain(c.list, c.len, 0);
0851:                strings.retainAll(c.strings);
0852:                return this ;
0853:            }
0854:
0855:            /**
0856:             * Removes from this set all of its elements that are contained in the
0857:             * specified set.  This operation effectively modifies this
0858:             * set so that its value is the <i>asymmetric set difference</i> of
0859:             * the two sets.
0860:             *
0861:             * @param c set that defines which elements will be removed from
0862:             *          this set.
0863:             * @stable ICU 2.0
0864:             */
0865:            public UnicodeSet removeAll(UnicodeSet c) {
0866:                retain(c.list, c.len, 2);
0867:                strings.removeAll(c.strings);
0868:                return this ;
0869:            }
0870:
0871:            /**
0872:             * Removes all of the elements from this set.  This set will be
0873:             * empty after this call returns.
0874:             * @stable ICU 2.0
0875:             */
0876:            public UnicodeSet clear() {
0877:                list[0] = HIGH;
0878:                len = 1;
0879:                pat = null;
0880:                strings.clear();
0881:                return this ;
0882:            }
0883:
0884:            /**
0885:             * Iteration method that returns the number of ranges contained in
0886:             * this set.
0887:             * @see #getRangeStart
0888:             * @see #getRangeEnd
0889:             * @stable ICU 2.0
0890:             */
0891:            public int getRangeCount() {
0892:                return len / 2;
0893:            }
0894:
0895:            /**
0896:             * Iteration method that returns the first character in the
0897:             * specified range of this set.
0898:             * @exception ArrayIndexOutOfBoundsException if index is outside
0899:             * the range <code>0..getRangeCount()-1</code>
0900:             * @see #getRangeCount
0901:             * @see #getRangeEnd
0902:             * @stable ICU 2.0
0903:             */
0904:            public int getRangeStart(int index) {
0905:                return list[index * 2];
0906:            }
0907:
0908:            /**
0909:             * Iteration method that returns the last character in the
0910:             * specified range of this set.
0911:             * @exception ArrayIndexOutOfBoundsException if index is outside
0912:             * the range <code>0..getRangeCount()-1</code>
0913:             * @see #getRangeStart
0914:             * @see #getRangeEnd
0915:             * @stable ICU 2.0
0916:             */
0917:            public int getRangeEnd(int index) {
0918:                return (list[index * 2 + 1] - 1);
0919:            }
0920:
0921:            //----------------------------------------------------------------
0922:            // Implementation: Pattern parsing
0923:            //----------------------------------------------------------------
0924:
0925:            /**
0926:             * Parses the given pattern, starting at the given position.  The character
0927:             * at pattern.charAt(pos.getIndex()) must be '[', or the parse fails.
0928:             * Parsing continues until the corresponding closing ']'.  If a syntax error
0929:             * is encountered between the opening and closing brace, the parse fails.
0930:             * Upon return from a successful parse, the ParsePosition is updated to
0931:             * point to the character following the closing ']', and an inversion
0932:             * list for the parsed pattern is returned.  This method
0933:             * calls itself recursively to parse embedded subpatterns.
0934:             *
0935:             * @param pattern the string containing the pattern to be parsed.  The
0936:             * portion of the string from pos.getIndex(), which must be a '[', to the
0937:             * corresponding closing ']', is parsed.
0938:             * @param pos upon entry, the position at which to being parsing.  The
0939:             * character at pattern.charAt(pos.getIndex()) must be a '['.  Upon return
0940:             * from a successful parse, pos.getIndex() is either the character after the
0941:             * closing ']' of the parsed pattern, or pattern.length() if the closing ']'
0942:             * is the last character of the pattern string.
0943:             * @return an inversion list for the parsed substring
0944:             * of <code>pattern</code>
0945:             * @exception java.lang.IllegalArgumentException if the parse fails.
0946:             */
0947:            UnicodeSet applyPattern(String pattern, ParsePosition pos,
0948:                    SymbolTable symbols, int options) {
0949:
0950:                // Need to build the pattern in a temporary string because
0951:                // _applyPattern calls add() etc., which set pat to empty.
0952:                boolean parsePositionWasNull = pos == null;
0953:                if (parsePositionWasNull) {
0954:                    pos = new ParsePosition(0);
0955:                }
0956:
0957:                StringBuffer rebuiltPat = new StringBuffer();
0958:                RuleCharacterIterator chars = new RuleCharacterIterator(
0959:                        pattern, symbols, pos);
0960:                applyPattern(chars, symbols, rebuiltPat, options);
0961:                if (chars.inVariable()) {
0962:                    syntaxError(chars, "Extra chars in variable value");
0963:                }
0964:                pat = rebuiltPat.toString();
0965:                if (parsePositionWasNull) {
0966:                    int i = pos.getIndex();
0967:
0968:                    // Skip over trailing whitespace
0969:                    if ((options & IGNORE_SPACE) != 0) {
0970:                        i = Utility.skipWhitespace(pattern, i);
0971:                    }
0972:
0973:                    if (i != pattern.length()) {
0974:                        throw new IllegalArgumentException("Parse of \""
0975:                                + pattern + "\" failed at " + i);
0976:                    }
0977:                }
0978:                return this ;
0979:            }
0980:
0981:            /**
0982:             * Parse the pattern from the given RuleCharacterIterator.  The
0983:             * iterator is advanced over the parsed pattern.
0984:             * @param chars iterator over the pattern characters.  Upon return
0985:             * it will be advanced to the first character after the parsed
0986:             * pattern, or the end of the iteration if all characters are
0987:             * parsed.
0988:             * @param symbols symbol table to use to parse and dereference
0989:             * variables, or null if none.
0990:             * @param rebuiltPat the pattern that was parsed, rebuilt or
0991:             * copied from the input pattern, as appropriate.
0992:             * @param options a bit mask of zero or more of the following:
0993:             * IGNORE_SPACE, CASE.
0994:             */
0995:            void applyPattern(RuleCharacterIterator chars, SymbolTable symbols,
0996:                    StringBuffer rebuiltPat, int options) {
0997:
0998:                // Syntax characters: [ ] ^ - & { }
0999:
1000:                // Recognized special forms for chars, sets: c-c s-s s&s
1001:
1002:                int opts = RuleCharacterIterator.PARSE_VARIABLES
1003:                        | RuleCharacterIterator.PARSE_ESCAPES;
1004:                if ((options & IGNORE_SPACE) != 0) {
1005:                    opts |= RuleCharacterIterator.SKIP_WHITESPACE;
1006:                }
1007:
1008:                StringBuffer pat = new StringBuffer(), buf = null;
1009:                boolean usePat = false;
1010:                UnicodeSet scratch = null;
1011:                Object backup = null;
1012:
1013:                // mode: 0=before [, 1=between [...], 2=after ]
1014:                // lastItem: 0=none, 1=char, 2=set
1015:                int lastItem = 0, lastChar = 0, mode = 0;
1016:                char op = 0;
1017:
1018:                boolean invert = false;
1019:
1020:                clear();
1021:
1022:                while (mode != 2 && !chars.atEnd()) {
1023:                    if (false) {
1024:                        // Debugging assertion
1025:                        if (!((lastItem == 0 && op == 0)
1026:                                || (lastItem == 1 && (op == 0 || op == '-')) || (lastItem == 2 && (op == 0
1027:                                || op == '-' || op == '&')))) {
1028:                            throw new IllegalArgumentException();
1029:                        }
1030:                    }
1031:
1032:                    int c = 0;
1033:                    boolean literal = false;
1034:                    UnicodeSet nested = null;
1035:
1036:                    // -------- Check for property pattern
1037:
1038:                    // setMode: 0=none, 1=unicodeset, 2=propertypat, 3=preparsed
1039:                    int setMode = 0;
1040:                    if (resemblesPropertyPattern(chars, opts)) {
1041:                        setMode = 2;
1042:                    }
1043:
1044:                    // -------- Parse '[' of opening delimiter OR nested set.
1045:                    // If there is a nested set, use `setMode' to define how
1046:                    // the set should be parsed.  If the '[' is part of the
1047:                    // opening delimiter for this pattern, parse special
1048:                    // strings "[", "[^", "[-", and "[^-".  Check for stand-in
1049:                    // characters representing a nested set in the symbol
1050:                    // table.
1051:
1052:                    else {
1053:                        // Prepare to backup if necessary
1054:                        backup = chars.getPos(backup);
1055:                        c = chars.next(opts);
1056:                        literal = chars.isEscaped();
1057:
1058:                        if (c == '[' && !literal) {
1059:                            if (mode == 1) {
1060:                                chars.setPos(backup); // backup
1061:                                setMode = 1;
1062:                            } else {
1063:                                // Handle opening '[' delimiter
1064:                                mode = 1;
1065:                                pat.append('[');
1066:                                backup = chars.getPos(backup); // prepare to backup
1067:                                c = chars.next(opts);
1068:                                literal = chars.isEscaped();
1069:                                if (c == '^' && !literal) {
1070:                                    invert = true;
1071:                                    pat.append('^');
1072:                                    backup = chars.getPos(backup); // prepare to backup
1073:                                    c = chars.next(opts);
1074:                                    literal = chars.isEscaped();
1075:                                }
1076:                                // Fall through to handle special leading '-';
1077:                                // otherwise restart loop for nested [], \p{}, etc.
1078:                                if (c == '-') {
1079:                                    literal = true;
1080:                                    // Fall through to handle literal '-' below
1081:                                } else {
1082:                                    chars.setPos(backup); // backup
1083:                                    continue;
1084:                                }
1085:                            }
1086:                        } else if (symbols != null) {
1087:                            UnicodeMatcher m = symbols.lookupMatcher(c); // may be null
1088:                            if (m != null) {
1089:                                try {
1090:                                    nested = (UnicodeSet) m;
1091:                                    setMode = 3;
1092:                                } catch (ClassCastException e) {
1093:                                    syntaxError(chars, "Syntax error");
1094:                                }
1095:                            }
1096:                        }
1097:                    }
1098:
1099:                    // -------- Handle a nested set.  This either is inline in
1100:                    // the pattern or represented by a stand-in that has
1101:                    // previously been parsed and was looked up in the symbol
1102:                    // table.
1103:
1104:                    if (setMode != 0) {
1105:                        if (lastItem == 1) {
1106:                            if (op != 0) {
1107:                                syntaxError(chars,
1108:                                        "Char expected after operator");
1109:                            }
1110:                            add(lastChar, lastChar);
1111:                            _appendToPat(pat, lastChar, false);
1112:                            lastItem = op = 0;
1113:                        }
1114:
1115:                        if (op == '-' || op == '&') {
1116:                            pat.append(op);
1117:                        }
1118:
1119:                        if (nested == null) {
1120:                            if (scratch == null)
1121:                                scratch = new UnicodeSet();
1122:                            nested = scratch;
1123:                        }
1124:                        switch (setMode) {
1125:                        case 1:
1126:                            nested.applyPattern(chars, symbols, pat, options);
1127:                            break;
1128:                        case 2:
1129:                            chars.skipIgnored(opts);
1130:                            nested.applyPropertyPattern(chars, pat, symbols);
1131:                            break;
1132:                        case 3: // `nested' already parsed
1133:                            nested._toPattern(pat, false);
1134:                            break;
1135:                        }
1136:
1137:                        usePat = true;
1138:
1139:                        if (mode == 0) {
1140:                            // Entire pattern is a category; leave parse loop
1141:                            set(nested);
1142:                            mode = 2;
1143:                            break;
1144:                        }
1145:
1146:                        switch (op) {
1147:                        case '-':
1148:                            removeAll(nested);
1149:                            break;
1150:                        case '&':
1151:                            retainAll(nested);
1152:                            break;
1153:                        case 0:
1154:                            addAll(nested);
1155:                            break;
1156:                        }
1157:
1158:                        op = 0;
1159:                        lastItem = 2;
1160:
1161:                        continue;
1162:                    }
1163:
1164:                    if (mode == 0) {
1165:                        syntaxError(chars, "Missing '['");
1166:                    }
1167:
1168:                    // -------- Parse special (syntax) characters.  If the
1169:                    // current character is not special, or if it is escaped,
1170:                    // then fall through and handle it below.
1171:
1172:                    if (!literal) {
1173:                        switch (c) {
1174:                        case ']':
1175:                            if (lastItem == 1) {
1176:                                add(lastChar, lastChar);
1177:                                _appendToPat(pat, lastChar, false);
1178:                            }
1179:                            // Treat final trailing '-' as a literal
1180:                            if (op == '-') {
1181:                                add(op, op);
1182:                                pat.append(op);
1183:                            } else if (op == '&') {
1184:                                syntaxError(chars, "Trailing '&'");
1185:                            }
1186:                            pat.append(']');
1187:                            mode = 2;
1188:                            continue;
1189:                        case '-':
1190:                            if (op == 0) {
1191:                                if (lastItem != 0) {
1192:                                    op = (char) c;
1193:                                    continue;
1194:                                } else {
1195:                                    // Treat final trailing '-' as a literal
1196:                                    add(c, c);
1197:                                    c = chars.next(opts);
1198:                                    literal = chars.isEscaped();
1199:                                    if (c == ']' && !literal) {
1200:                                        pat.append("-]");
1201:                                        mode = 2;
1202:                                        continue;
1203:                                    }
1204:                                }
1205:                            }
1206:                            syntaxError(chars, "'-' not after char or set");
1207:                        case '&':
1208:                            if (lastItem == 2 && op == 0) {
1209:                                op = (char) c;
1210:                                continue;
1211:                            }
1212:                            syntaxError(chars, "'&' not after set");
1213:                        case '^':
1214:                            syntaxError(chars, "'^' not after '['");
1215:                        case '{':
1216:                            if (op != 0) {
1217:                                syntaxError(chars,
1218:                                        "Missing operand after operator");
1219:                            }
1220:                            if (lastItem == 1) {
1221:                                add(lastChar, lastChar);
1222:                                _appendToPat(pat, lastChar, false);
1223:                            }
1224:                            lastItem = 0;
1225:                            if (buf == null) {
1226:                                buf = new StringBuffer();
1227:                            } else {
1228:                                buf.setLength(0);
1229:                            }
1230:                            boolean ok = false;
1231:                            while (!chars.atEnd()) {
1232:                                c = chars.next(opts);
1233:                                literal = chars.isEscaped();
1234:                                if (c == '}' && !literal) {
1235:                                    ok = true;
1236:                                    break;
1237:                                }
1238:                                UTF16.append(buf, c);
1239:                            }
1240:                            if (buf.length() < 1 || !ok) {
1241:                                syntaxError(chars,
1242:                                        "Invalid multicharacter string");
1243:                            }
1244:                            // We have new string. Add it to set and continue;
1245:                            // we don't need to drop through to the further
1246:                            // processing
1247:                            add(buf.toString());
1248:                            pat.append('{');
1249:                            _appendToPat(pat, buf.toString(), false);
1250:                            pat.append('}');
1251:                            continue;
1252:                        case SymbolTable.SYMBOL_REF:
1253:                            //         symbols  nosymbols
1254:                            // [a-$]   error    error (ambiguous)
1255:                            // [a$]    anchor   anchor
1256:                            // [a-$x]  var "x"* literal '$'
1257:                            // [a-$.]  error    literal '$'
1258:                            // *We won't get here in the case of var "x"
1259:                            backup = chars.getPos(backup);
1260:                            c = chars.next(opts);
1261:                            literal = chars.isEscaped();
1262:                            boolean anchor = (c == ']' && !literal);
1263:                            if (symbols == null && !anchor) {
1264:                                c = SymbolTable.SYMBOL_REF;
1265:                                chars.setPos(backup);
1266:                                break; // literal '$'
1267:                            }
1268:                            if (anchor && op == 0) {
1269:                                if (lastItem == 1) {
1270:                                    add(lastChar, lastChar);
1271:                                    _appendToPat(pat, lastChar, false);
1272:                                }
1273:                                add(UnicodeMatcher.ETHER);
1274:                                usePat = true;
1275:                                pat.append(SymbolTable.SYMBOL_REF).append(']');
1276:                                mode = 2;
1277:                                continue;
1278:                            }
1279:                            syntaxError(chars, "Unquoted '$'");
1280:                        default:
1281:                            break;
1282:                        }
1283:                    }
1284:
1285:                    // -------- Parse literal characters.  This includes both
1286:                    // escaped chars ("\u4E01") and non-syntax characters
1287:                    // ("a").
1288:
1289:                    switch (lastItem) {
1290:                    case 0:
1291:                        lastItem = 1;
1292:                        lastChar = c;
1293:                        break;
1294:                    case 1:
1295:                        if (op == '-') {
1296:                            if (lastChar >= c) {
1297:                                // Don't allow redundant (a-a) or empty (b-a) ranges;
1298:                                // these are most likely typos.
1299:                                syntaxError(chars, "Invalid range");
1300:                            }
1301:                            add(lastChar, c);
1302:                            _appendToPat(pat, lastChar, false);
1303:                            pat.append(op);
1304:                            _appendToPat(pat, c, false);
1305:                            lastItem = op = 0;
1306:                        } else {
1307:                            add(lastChar, lastChar);
1308:                            _appendToPat(pat, lastChar, false);
1309:                            lastChar = c;
1310:                        }
1311:                        break;
1312:                    case 2:
1313:                        if (op != 0) {
1314:                            syntaxError(chars, "Set expected after operator");
1315:                        }
1316:                        lastChar = c;
1317:                        lastItem = 1;
1318:                        break;
1319:                    }
1320:                }
1321:
1322:                if (mode != 2) {
1323:                    syntaxError(chars, "Missing ']'");
1324:                }
1325:
1326:                chars.skipIgnored(opts);
1327:
1328:                if (invert) {
1329:                    complement();
1330:                }
1331:
1332:                // Use the rebuilt pattern (pat) only if necessary.  Prefer the
1333:                // generated pattern.
1334:                if (usePat) {
1335:                    rebuiltPat.append(pat.toString());
1336:                } else {
1337:                    _generatePattern(rebuiltPat, false);
1338:                }
1339:            }
1340:
1341:            private static void syntaxError(RuleCharacterIterator chars,
1342:                    String msg) {
1343:                throw new IllegalArgumentException("Error: " + msg + " at \""
1344:                        + Utility.escape(chars.toString()) + '"');
1345:            }
1346:
1347:            //----------------------------------------------------------------
1348:            // Implementation: Utility methods
1349:            //----------------------------------------------------------------
1350:
1351:            private void ensureCapacity(int newLen) {
1352:                if (newLen <= list.length)
1353:                    return;
1354:                int[] temp = new int[newLen + GROW_EXTRA];
1355:                System.arraycopy(list, 0, temp, 0, len);
1356:                list = temp;
1357:            }
1358:
1359:            private void ensureBufferCapacity(int newLen) {
1360:                if (buffer != null && newLen <= buffer.length)
1361:                    return;
1362:                buffer = new int[newLen + GROW_EXTRA];
1363:            }
1364:
1365:            /**
1366:             * Assumes start <= end.
1367:             */
1368:            private int[] range(int start, int end) {
1369:                if (rangeList == null) {
1370:                    rangeList = new int[] { start, end + 1, HIGH };
1371:                } else {
1372:                    rangeList[0] = start;
1373:                    rangeList[1] = end + 1;
1374:                }
1375:                return rangeList;
1376:            }
1377:
1378:            //----------------------------------------------------------------
1379:            // Implementation: Fundamental operations
1380:            //----------------------------------------------------------------
1381:
1382:            // polarity = 0, 3 is normal: x xor y
1383:            // polarity = 1, 2: x xor ~y == x === y
1384:
1385:            private UnicodeSet xor(int[] other, int otherLen, int polarity) {
1386:                ensureBufferCapacity(len + otherLen);
1387:                int i = 0, j = 0, k = 0;
1388:                int a = list[i++];
1389:                int b;
1390:                if (polarity == 1 || polarity == 2) {
1391:                    b = LOW;
1392:                    if (other[j] == LOW) { // skip base if already LOW
1393:                        ++j;
1394:                        b = other[j];
1395:                    }
1396:                } else {
1397:                    b = other[j++];
1398:                }
1399:                // simplest of all the routines
1400:                // sort the values, discarding identicals!
1401:                while (true) {
1402:                    if (a < b) {
1403:                        buffer[k++] = a;
1404:                        a = list[i++];
1405:                    } else if (b < a) {
1406:                        buffer[k++] = b;
1407:                        b = other[j++];
1408:                    } else if (a != HIGH) { // at this point, a == b
1409:                        // discard both values!
1410:                        a = list[i++];
1411:                        b = other[j++];
1412:                    } else { // DONE!
1413:                        buffer[k++] = HIGH;
1414:                        len = k;
1415:                        break;
1416:                    }
1417:                }
1418:                // swap list and buffer
1419:                int[] temp = list;
1420:                list = buffer;
1421:                buffer = temp;
1422:                pat = null;
1423:                return this ;
1424:            }
1425:
1426:            // polarity = 0 is normal: x union y
1427:            // polarity = 2: x union ~y
1428:            // polarity = 1: ~x union y
1429:            // polarity = 3: ~x union ~y
1430:
1431:            private UnicodeSet add(int[] other, int otherLen, int polarity) {
1432:                ensureBufferCapacity(len + otherLen);
1433:                int i = 0, j = 0, k = 0;
1434:                int a = list[i++];
1435:                int b = other[j++];
1436:                // change from xor is that we have to check overlapping pairs
1437:                // polarity bit 1 means a is second, bit 2 means b is.
1438:                main: while (true) {
1439:                    switch (polarity) {
1440:                    case 0: // both first; take lower if unequal
1441:                        if (a < b) { // take a
1442:                            // Back up over overlapping ranges in buffer[]
1443:                            if (k > 0 && a <= buffer[k - 1]) {
1444:                                // Pick latter end value in buffer[] vs. list[]
1445:                                a = max(list[i], buffer[--k]);
1446:                            } else {
1447:                                // No overlap
1448:                                buffer[k++] = a;
1449:                                a = list[i];
1450:                            }
1451:                            i++; // Common if/else code factored out
1452:                            polarity ^= 1;
1453:                        } else if (b < a) { // take b
1454:                            if (k > 0 && b <= buffer[k - 1]) {
1455:                                b = max(other[j], buffer[--k]);
1456:                            } else {
1457:                                buffer[k++] = b;
1458:                                b = other[j];
1459:                            }
1460:                            j++;
1461:                            polarity ^= 2;
1462:                        } else { // a == b, take a, drop b
1463:                            if (a == HIGH)
1464:                                break main;
1465:                            // This is symmetrical; it doesn't matter if
1466:                            // we backtrack with a or b. - liu
1467:                            if (k > 0 && a <= buffer[k - 1]) {
1468:                                a = max(list[i], buffer[--k]);
1469:                            } else {
1470:                                // No overlap
1471:                                buffer[k++] = a;
1472:                                a = list[i];
1473:                            }
1474:                            i++;
1475:                            polarity ^= 1;
1476:                            b = other[j++];
1477:                            polarity ^= 2;
1478:                        }
1479:                        break;
1480:                    case 3: // both second; take higher if unequal, and drop other
1481:                        if (b <= a) { // take a
1482:                            if (a == HIGH)
1483:                                break main;
1484:                            buffer[k++] = a;
1485:                        } else { // take b
1486:                            if (b == HIGH)
1487:                                break main;
1488:                            buffer[k++] = b;
1489:                        }
1490:                        a = list[i++];
1491:                        polarity ^= 1; // factored common code
1492:                        b = other[j++];
1493:                        polarity ^= 2;
1494:                        break;
1495:                    case 1: // a second, b first; if b < a, overlap
1496:                        if (a < b) { // no overlap, take a
1497:                            buffer[k++] = a;
1498:                            a = list[i++];
1499:                            polarity ^= 1;
1500:                        } else if (b < a) { // OVERLAP, drop b
1501:                            b = other[j++];
1502:                            polarity ^= 2;
1503:                        } else { // a == b, drop both!
1504:                            if (a == HIGH)
1505:                                break main;
1506:                            a = list[i++];
1507:                            polarity ^= 1;
1508:                            b = other[j++];
1509:                            polarity ^= 2;
1510:                        }
1511:                        break;
1512:                    case 2: // a first, b second; if a < b, overlap
1513:                        if (b < a) { // no overlap, take b
1514:                            buffer[k++] = b;
1515:                            b = other[j++];
1516:                            polarity ^= 2;
1517:                        } else if (a < b) { // OVERLAP, drop a
1518:                            a = list[i++];
1519:                            polarity ^= 1;
1520:                        } else { // a == b, drop both!
1521:                            if (a == HIGH)
1522:                                break main;
1523:                            a = list[i++];
1524:                            polarity ^= 1;
1525:                            b = other[j++];
1526:                            polarity ^= 2;
1527:                        }
1528:                        break;
1529:                    }
1530:                }
1531:                buffer[k++] = HIGH; // terminate
1532:                len = k;
1533:                // swap list and buffer
1534:                int[] temp = list;
1535:                list = buffer;
1536:                buffer = temp;
1537:                pat = null;
1538:                return this ;
1539:            }
1540:
1541:            // polarity = 0 is normal: x intersect y
1542:            // polarity = 2: x intersect ~y == set-minus
1543:            // polarity = 1: ~x intersect y
1544:            // polarity = 3: ~x intersect ~y
1545:
1546:            private UnicodeSet retain(int[] other, int otherLen, int polarity) {
1547:                ensureBufferCapacity(len + otherLen);
1548:                int i = 0, j = 0, k = 0;
1549:                int a = list[i++];
1550:                int b = other[j++];
1551:                // change from xor is that we have to check overlapping pairs
1552:                // polarity bit 1 means a is second, bit 2 means b is.
1553:                main: while (true) {
1554:                    switch (polarity) {
1555:                    case 0: // both first; drop the smaller
1556:                        if (a < b) { // drop a
1557:                            a = list[i++];
1558:                            polarity ^= 1;
1559:                        } else if (b < a) { // drop b
1560:                            b = other[j++];
1561:                            polarity ^= 2;
1562:                        } else { // a == b, take one, drop other
1563:                            if (a == HIGH)
1564:                                break main;
1565:                            buffer[k++] = a;
1566:                            a = list[i++];
1567:                            polarity ^= 1;
1568:                            b = other[j++];
1569:                            polarity ^= 2;
1570:                        }
1571:                        break;
1572:                    case 3: // both second; take lower if unequal
1573:                        if (a < b) { // take a
1574:                            buffer[k++] = a;
1575:                            a = list[i++];
1576:                            polarity ^= 1;
1577:                        } else if (b < a) { // take b
1578:                            buffer[k++] = b;
1579:                            b = other[j++];
1580:                            polarity ^= 2;
1581:                        } else { // a == b, take one, drop other
1582:                            if (a == HIGH)
1583:                                break main;
1584:                            buffer[k++] = a;
1585:                            a = list[i++];
1586:                            polarity ^= 1;
1587:                            b = other[j++];
1588:                            polarity ^= 2;
1589:                        }
1590:                        break;
1591:                    case 1: // a second, b first;
1592:                        if (a < b) { // NO OVERLAP, drop a
1593:                            a = list[i++];
1594:                            polarity ^= 1;
1595:                        } else if (b < a) { // OVERLAP, take b
1596:                            buffer[k++] = b;
1597:                            b = other[j++];
1598:                            polarity ^= 2;
1599:                        } else { // a == b, drop both!
1600:                            if (a == HIGH)
1601:                                break main;
1602:                            a = list[i++];
1603:                            polarity ^= 1;
1604:                            b = other[j++];
1605:                            polarity ^= 2;
1606:                        }
1607:                        break;
1608:                    case 2: // a first, b second; if a < b, overlap
1609:                        if (b < a) { // no overlap, drop b
1610:                            b = other[j++];
1611:                            polarity ^= 2;
1612:                        } else if (a < b) { // OVERLAP, take a
1613:                            buffer[k++] = a;
1614:                            a = list[i++];
1615:                            polarity ^= 1;
1616:                        } else { // a == b, drop both!
1617:                            if (a == HIGH)
1618:                                break main;
1619:                            a = list[i++];
1620:                            polarity ^= 1;
1621:                            b = other[j++];
1622:                            polarity ^= 2;
1623:                        }
1624:                        break;
1625:                    }
1626:                }
1627:                buffer[k++] = HIGH; // terminate
1628:                len = k;
1629:                // swap list and buffer
1630:                int[] temp = list;
1631:                list = buffer;
1632:                buffer = temp;
1633:                pat = null;
1634:                return this ;
1635:            }
1636:
1637:            private static final int max(int a, int b) {
1638:                return (a > b) ? a : b;
1639:            }
1640:
1641:            //----------------------------------------------------------------
1642:            // Generic filter-based scanning code
1643:            //----------------------------------------------------------------
1644:
1645:            private static interface Filter {
1646:                boolean contains(int codePoint);
1647:            }
1648:
1649:            // VersionInfo for unassigned characters
1650:            static final VersionInfo NO_VERSION = VersionInfo.getInstance(0, 0,
1651:                    0, 0);
1652:
1653:            private static class VersionFilter implements  Filter {
1654:                VersionInfo version;
1655:
1656:                VersionFilter(VersionInfo version) {
1657:                    this .version = version;
1658:                }
1659:
1660:                public boolean contains(int ch) {
1661:                    VersionInfo v = UCharacter.getAge(ch);
1662:                    // Reference comparison ok; VersionInfo caches and reuses
1663:                    // unique objects.
1664:                    return v != NO_VERSION && v.compareTo(version) <= 0;
1665:                }
1666:            }
1667:
1668:            private static synchronized UnicodeSet getInclusions() {
1669:                if (INCLUSIONS == null) {
1670:                    UCharacterProperty property = UCharacterProperty
1671:                            .getInstance();
1672:                    INCLUSIONS = property.getInclusions();
1673:                }
1674:                return INCLUSIONS;
1675:            }
1676:
1677:            /**
1678:             * Generic filter-based scanning code for UCD property UnicodeSets.
1679:             */
1680:            private UnicodeSet applyFilter(Filter filter) {
1681:                // Walk through all Unicode characters, noting the start
1682:                // and end of each range for which filter.contain(c) is
1683:                // true.  Add each range to a set.
1684:                //
1685:                // To improve performance, use the INCLUSIONS set, which
1686:                // encodes information about character ranges that are known
1687:                // to have identical properties, such as the CJK Ideographs
1688:                // from U+4E00 to U+9FA5.  INCLUSIONS contains all characters
1689:                // except the first characters of such ranges.
1690:                //
1691:                // TODO Where possible, instead of scanning over code points,
1692:                // use internal property data to initialize UnicodeSets for
1693:                // those properties.  Scanning code points is slow.
1694:
1695:                clear();
1696:
1697:                int startHasProperty = -1;
1698:                UnicodeSet inclusions = getInclusions();
1699:                int limitRange = inclusions.getRangeCount();
1700:
1701:                for (int j = 0; j < limitRange; ++j) {
1702:                    // get current range
1703:                    int start = inclusions.getRangeStart(j);
1704:                    int end = inclusions.getRangeEnd(j);
1705:
1706:                    // for all the code points in the range, process
1707:                    for (int ch = start; ch <= end; ++ch) {
1708:                        // only add to the unicodeset on inflection points --
1709:                        // where the hasProperty value changes to false
1710:                        if (filter.contains(ch)) {
1711:                            if (startHasProperty < 0) {
1712:                                startHasProperty = ch;
1713:                            }
1714:                        } else if (startHasProperty >= 0) {
1715:                            add(startHasProperty, ch - 1);
1716:                            startHasProperty = -1;
1717:                        }
1718:                    }
1719:                }
1720:                if (startHasProperty >= 0) {
1721:                    add(startHasProperty, 0x10FFFF);
1722:                }
1723:
1724:                return this ;
1725:            }
1726:
1727:            /**
1728:             * Remove leading and trailing rule white space and compress
1729:             * internal rule white space to a single space character.
1730:             *
1731:             * @see UCharacterProperty#isRuleWhiteSpace
1732:             */
1733:            private static String mungeCharName(String source) {
1734:                StringBuffer buf = new StringBuffer();
1735:                for (int i = 0; i < source.length();) {
1736:                    int ch = UTF16.charAt(source, i);
1737:                    i += UTF16.getCharCount(ch);
1738:                    if (UCharacterProperty.isRuleWhiteSpace(ch)) {
1739:                        if (buf.length() == 0
1740:                                || buf.charAt(buf.length() - 1) == ' ') {
1741:                            continue;
1742:                        }
1743:                        ch = ' '; // convert to ' '
1744:                    }
1745:                    UTF16.append(buf, ch);
1746:                }
1747:                if (buf.length() != 0 && buf.charAt(buf.length() - 1) == ' ') {
1748:                    buf.setLength(buf.length() - 1);
1749:                }
1750:                return buf.toString();
1751:            }
1752:
1753:            //----------------------------------------------------------------
1754:            // Property set API
1755:            //----------------------------------------------------------------
1756:
1757:            /**
1758:             * Modifies this set to contain those code points which have the
1759:             * given value for the given property.  Prior contents of this
1760:             * set are lost.
1761:             * @param propertyAlias
1762:             * @param valueAlias
1763:             * @param symbols if not null, then symbols are first called to see if a property
1764:             * is available. If true, then everything else is skipped.
1765:             * @return this set
1766:             * @draft ICU 3.2
1767:             * @deprecated This is a draft API and might change in a future release of ICU.
1768:             */
1769:            public UnicodeSet applyPropertyAlias(String propertyAlias,
1770:                    String valueAlias, SymbolTable symbols) {
1771:                if (propertyAlias.equals("Age")) {
1772:                    // Must munge name, since
1773:                    // VersionInfo.getInstance() does not do
1774:                    // 'loose' matching.
1775:                    VersionInfo version = VersionInfo
1776:                            .getInstance(mungeCharName(valueAlias));
1777:                    applyFilter(new VersionFilter(version));
1778:                    return this ;
1779:                } else
1780:                    throw new IllegalArgumentException("Unsupported property");
1781:            }
1782:
1783:            /**
1784:             * Return true if the given iterator appears to point at a
1785:             * property pattern.  Regardless of the result, return with the
1786:             * iterator unchanged.
1787:             * @param chars iterator over the pattern characters.  Upon return
1788:             * it will be unchanged.
1789:             * @param iterOpts RuleCharacterIterator options
1790:             */
1791:            private static boolean resemblesPropertyPattern(
1792:                    RuleCharacterIterator chars, int iterOpts) {
1793:                boolean result = false;
1794:                iterOpts &= ~RuleCharacterIterator.PARSE_ESCAPES;
1795:                Object pos = chars.getPos(null);
1796:                int c = chars.next(iterOpts);
1797:                if (c == '[' || c == '\\') {
1798:                    int d = chars.next(iterOpts
1799:                            & ~RuleCharacterIterator.SKIP_WHITESPACE);
1800:                    result = (c == '[') ? (d == ':')
1801:                            : (d == 'N' || d == 'p' || d == 'P');
1802:                }
1803:                chars.setPos(pos);
1804:                return result;
1805:            }
1806:
1807:            /**
1808:             * Parse the given property pattern at the given parse position.
1809:             * @param symbols TODO
1810:             */
1811:            private UnicodeSet applyPropertyPattern(String pattern,
1812:                    ParsePosition ppos, SymbolTable symbols) {
1813:                int pos = ppos.getIndex();
1814:
1815:                // On entry, ppos should point to one of the following locations:
1816:
1817:                // Minimum length is 5 characters, e.g. \p{L}
1818:                if ((pos + 5) > pattern.length()) {
1819:                    return null;
1820:                }
1821:
1822:                boolean posix = false; // true for [:pat:], false for \p{pat} \P{pat} \N{pat}
1823:                boolean isName = false; // true for \N{pat}, o/w false
1824:                boolean invert = false;
1825:
1826:                // Look for an opening [:, [:^, \p, or \P
1827:                if (pattern.regionMatches(pos, "[:", 0, 2)) {
1828:                    posix = true;
1829:                    pos = Utility.skipWhitespace(pattern, pos + 2);
1830:                    if (pos < pattern.length() && pattern.charAt(pos) == '^') {
1831:                        ++pos;
1832:                        invert = true;
1833:                    }
1834:                } else if (pattern.regionMatches(true, pos, "\\p", 0, 2)
1835:                        || pattern.regionMatches(pos, "\\N", 0, 2)) {
1836:                    char c = pattern.charAt(pos + 1);
1837:                    invert = (c == 'P');
1838:                    isName = (c == 'N');
1839:                    pos = Utility.skipWhitespace(pattern, pos + 2);
1840:                    if (pos == pattern.length() || pattern.charAt(pos++) != '{') {
1841:                        // Syntax error; "\p" or "\P" not followed by "{"
1842:                        return null;
1843:                    }
1844:                } else {
1845:                    // Open delimiter not seen
1846:                    return null;
1847:                }
1848:
1849:                // Look for the matching close delimiter, either :] or }
1850:                int close = pattern.indexOf(posix ? ":]" : "}", pos);
1851:                if (close < 0) {
1852:                    // Syntax error; close delimiter missing
1853:                    return null;
1854:                }
1855:
1856:                // Look for an '=' sign.  If this is present, we will parse a
1857:                // medium \p{gc=Cf} or long \p{GeneralCategory=Format}
1858:                // pattern.
1859:                int equals = pattern.indexOf('=', pos);
1860:                String propName, valueName;
1861:                if (equals >= 0 && equals < close && !isName) {
1862:                    // Equals seen; parse medium/long pattern
1863:                    propName = pattern.substring(pos, equals);
1864:                    valueName = pattern.substring(equals + 1, close);
1865:                }
1866:
1867:                else {
1868:                    // Handle case where no '=' is seen, and \N{}
1869:                    propName = pattern.substring(pos, close);
1870:                    valueName = "";
1871:
1872:                    // Handle \N{name}
1873:                    if (isName) {
1874:                        // This is a little inefficient since it means we have to
1875:                        // parse "na" back to UProperty.NAME even though we already
1876:                        // know it's UProperty.NAME.  If we refactor the API to
1877:                        // support args of (int, String) then we can remove
1878:                        // "na" and make this a little more efficient.
1879:                        valueName = propName;
1880:                        propName = "na";
1881:                    }
1882:                }
1883:
1884:                applyPropertyAlias(propName, valueName, symbols);
1885:
1886:                if (invert) {
1887:                    complement();
1888:                }
1889:
1890:                // Move to the limit position after the close delimiter
1891:                ppos.setIndex(close + (posix ? 2 : 1));
1892:
1893:                return this ;
1894:            }
1895:
1896:            /**
1897:             * Parse a property pattern.
1898:             * @param chars iterator over the pattern characters.  Upon return
1899:             * it will be advanced to the first character after the parsed
1900:             * pattern, or the end of the iteration if all characters are
1901:             * parsed.
1902:             * @param rebuiltPat the pattern that was parsed, rebuilt or
1903:             * copied from the input pattern, as appropriate.
1904:             * @param symbols TODO
1905:             */
1906:            private void applyPropertyPattern(RuleCharacterIterator chars,
1907:                    StringBuffer rebuiltPat, SymbolTable symbols) {
1908:                String pat = chars.lookahead();
1909:                ParsePosition pos = new ParsePosition(0);
1910:                applyPropertyPattern(pat, pos, symbols);
1911:                if (pos.getIndex() == 0) {
1912:                    syntaxError(chars, "Invalid property pattern");
1913:                }
1914:                chars.jumpahead(pos.getIndex());
1915:                rebuiltPat.append(pat.substring(0, pos.getIndex()));
1916:            }
1917:
1918:            //----------------------------------------------------------------
1919:            // Case folding API
1920:            //----------------------------------------------------------------
1921:
1922:            /**
1923:             * Bitmask for constructor and applyPattern() indicating that
1924:             * white space should be ignored.  If set, ignore characters for
1925:             * which UCharacterProperty.isRuleWhiteSpace() returns true,
1926:             * unless they are quoted or escaped.  This may be ORed together
1927:             * with other selectors.
1928:             * @internal
1929:             */
1930:            public static final int IGNORE_SPACE = 1;
1931:
1932:        }
www.java2java.com | Contact Us
All other trademarks are property of their respective owners.