au.id.jericho.lib.html

Java Source Code / Java Documentation
1. 6.0 JDK Core
2. 6.0 JDK Modules
3. 6.0 JDK Modules com.sun
4. 6.0 JDK Modules com.sun.java
5. 6.0 JDK Modules sun
6. 6.0 JDK Platform
7. Ajax
8. Apache Harmony Java SE
9. Aspect oriented
10. Authentication Authorization
11. Blogger System
12. Build
13. Byte Code
14. Cache
15. Chart
16. Chat
17. Code Analyzer
18. Collaboration
19. Content Management System
20. Database Client
21. Database DBMS
22. Database JDBC Connection Pool
23. Database ORM
24. Development
25. EJB Server geronimo
26. EJB Server GlassFish
27. EJB Server JBoss 4.2.1
28. EJB Server resin 3.1.5
29. ERP CRM Financial
30. ESB
31. Forum
32. GIS
33. Graphic Library
34. Groupware
35. HTML Parser
36. IDE
37. IDE Eclipse
38. IDE Netbeans
39. Installer
40. Internationalization Localization
41. Inversion of Control
42. Issue Tracking
43. J2EE
44. JBoss
45. JMS
46. JMX
47. Library
48. Mail Clients
49. Net
50. Parser
51. PDF
52. Portal
53. Profiler
54. Project Management
55. Report
56. RSS RDF
57. Rule Engine
58. Science
59. Scripting
60. Search Engine
61. Security
62. Sevlet Container
63. Source Control
64. Swing Library
65. Template Engine
66. Test Coverage
67. Testing
68. UML
69. Web Crawler
70. Web Framework
71. Web Mail
72. Web Server
73. Web Services
74. Web Services apache cxf 2.0.1
75. Web Services AXIS2
76. Wiki Engine
77. Workflow Engines
78. XML
79. XML UI
Java
Java Tutorial
Java Open Source
Jar File Download
Java Articles
Java Products
Java by API
Photoshop Tutorials
Maya Tutorials
Flash Tutorials
3ds-Max Tutorials
Illustrator Tutorials
GIMP Tutorials
C# / C Sharp
C# / CSharp Tutorial
C# / CSharp Open Source
ASP.Net
ASP.NET Tutorial
JavaScript DHTML
JavaScript Tutorial
JavaScript Reference
HTML / CSS
HTML CSS Reference
C / ANSI-C
C Tutorial
C++
C++ Tutorial
Ruby
PHP
Python
Python Tutorial
Python Open Source
SQL Server / T-SQL
SQL Server / T-SQL Tutorial
Oracle PL / SQL
Oracle PL/SQL Tutorial
PostgreSQL
SQL / MySQL
MySQL Tutorial
VB.Net
VB.Net Tutorial
Flash / Flex / ActionScript
VBA / Excel / Access / Word
XML
XML Tutorial
Microsoft Office PowerPoint 2007 Tutorial
Microsoft Office Excel 2007 Tutorial
Microsoft Office Word 2007 Tutorial
Java Source Code / Java Documentation » HTML Parser » jericho html » au.id.jericho.lib.html 
au.id.jericho.lib.html
Jericho HTML Parser

A simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. Also provides high-level HTML form manipulation functions.

For an introduction to the API, the documentation of the {@link au.id.jericho.lib.html.Source} class is the best place to start.

For a summary of features and sample applications, visit the homepage at http://jerichohtml.sourceforge.net

For downloads, support and updates visit the SourceForge.net project page at http://sourceforge.net/projects/jerichohtml/

The Jericho HTML Parser is an open source library released under the GNU Lesser General Public License (LGPL). You are therefore free to use it in commercial applications subject to the terms detailed in the licence document.

Java Source File NameTypeComment
Attribute.javaClass Represents a single attribute name/value segment within a StartTag .
Attributes.javaClass Represents the list of Attribute objects present within a particular StartTag .
AttributesOutputSegment.javaClass Implements an OutputSegment whose content is a list of attribute name/value pairs.
BasicLogFormatter.javaClass Provides basic formatting for log messages.

This class extends the java.util.logging.Formatter class, allowing it to be specified as a formatter for the java.util.logging system.

The static BasicLogFormatter.format(String level,String message,String loggerName) method provides a means of using the same formatting outside of the java.util.logging framework.

BlankOutputSegment.javaClass Implements an OutputSegment whose content is a string of spaces with the same length as the segment.
Cache.javaClass Represents a cached map of character positions to tags.
CharacterEntityReference.javaClass Represents an HTML Character Entity Reference.
CharacterReference.javaClass Represents an HTML Character Reference, implemented by the subclasses CharacterEntityReference and NumericCharacterReference .
CharOutputSegment.javaClass Implements an OutputSegment whose content is a single character constant.
CharStreamSource.javaInterface Represents a character stream source.
CharStreamSourceUtil.javaClass Contains static utility methods for manipulating the way data is retrieved from a CharStreamSource object.
Config.javaClass Encapsulates global configuration properties which determine the behaviour of various functions.
Element.javaClass Represents an element in a specific document, which encompasses a , an optional and all in between.

Take the following HTML segment as an example:

<p>This is a sample paragraph.</p>

The whole segment is represented by an Element object.

EncodingDetector.javaClass
EndTag.javaClass Represents the end tag of an in a specific document.
EndTagType.javaClass Defines the syntax for an end tag type.
EndTagTypeGenericImplementation.javaClass Provides a generic implementation of the abstract EndTagType class based on the most common end tag behaviour.
EndTagTypeMasonComponentCalledWithContent.javaClass
EndTagTypeMasonNamedBlock.javaClass
EndTagTypeNormal.javaClass
EndTagTypeUnregistered.javaClass
FormControl.javaClass Represents an HTML form control.

A FormControl consists of a single that matches one of the .

The term output element is used to describe the element that is if this form control is in an OutputDocument .

A predefined value control is a form control for which FormControl.getFormControlType() . FormControlType.hasPredefinedValue hasPredefinedValue() returns true.

FormControlOutputStyle.javaClass An enumerated type representing the three major output styles of a output element.
FormControlType.javaClass Represents the control type of a FormControl .
FormField.javaClass Represents a field in an HTML form, a field being defined as the group of all having the same .

The FormField.getFormControls() method can be used to obtain the collection of this field's constituent FormControl objects.

The FormFields class, which represents a collection of FormField objects, provides the highest level interface for dealing with form fields and controls.

FormFields.javaClass Represents a collection of FormField objects.
HTMLElementName.javaInterface Contains static fields representing the of all elements defined in the HTML 4.01 specification.
HTMLElementNameSet.javaClass
HTMLElements.javaClass Contains static methods which group by the characteristics of their associated elements.

An HTML element is a normal element with a that matches one of the (ignoring case). This type of element spans the logical HTML element as described in the HTML 4.01 specification section 3.2.1, which may be implicitly terminated if it specifies an .

The term Non-HTML element refers to a normal element with a that does not match one of the . This type of element must be either a single tag element or explicitly terminated.

All of the sets returned by the methods in this class may be modified to customise the behaviour of the parser. Care must be taken however to ensure that the sets only contain tag names in lower case.

Below is a table summarising the default characteristics of each HTML element.

HTMLElementTerminatingTagNameSets.javaClass
IntStringHashMap.javaClass This is an internal class used to efficiently map integers to strings, which is used in the CharacterEntityReference class.
Logger.javaInterface Defines the interface for handling log messages.
LoggerDisabled.javaClass
LoggerFactory.javaClass
LoggerProvider.javaInterface Defines the interface for a factory class to provide Logger instances for each Source object.
LoggerProviderDisabled.javaClass
LoggerProviderJava.javaClass
LoggerProviderJCL.javaClass
LoggerProviderLog4J.javaClass
LoggerProviderSLF4J.javaClass
LoggerProviderSTDERR.javaClass
MasonTagTypes.javaClass Contains related to the Mason server platform.
NumericCharacterReference.javaClass Represents an HTML Numeric Character Reference.
OutputDocument.javaClass Represents a modified version of an original Source document.

An OutputDocument represents an original source document that has been modified by substituting segments of it with other text. Each of these substitutions must be registered in the output document, which is most commonly done using the various replace, remove or insert methods in this class. These methods internally one or more OutputSegment objects to define each substitution. After all of the substitutions have been registered, the modified text can be retrieved using the OutputDocument.writeTo(Writer) or OutputDocument.toString() methods.

The registered may be adjacent, and as of version 2.5 may also overlap. In most cases only output segments that have been or legitimately overlap each other.

OutputSegment.javaInterface Defines the interface for an output segment, which is used in an OutputDocument to replace segments of the source document with other text.
OutputSegmentComparator.javaClass
OverlappingOutputSegmentsException.javaClass Previously signalled the detection of overlapping in an OutputDocument .
ParseText.javaClass Represents the text from the document that is to be parsed.
PHPTagTypes.javaClass Contains related to the PHP server platform.
RemoveOutputSegment.javaClass Implements an OutputSegment with no content.
Renderer.javaClass Performs a simple rendering of HTML markup into text.
RowColumnVector.javaClass Represents the row and column number of a character position in the source document.
Segment.javaClass Represents a segment of a Source document.
Source.javaClass Represents a source HTML document.

The first step in parsing an HTML document is always to construct a Source object from the source data, which can be a String, Reader, InputStream or URL. Each constructor uses all the evidence available to determine the original of the data.

Once the Source object has been created, you can immediately start searching for or within the document using the tag search methods.

In certain circumstances you may be able to improve performance by calling the Source.fullSequentialParse() method before calling any tag search methods.

SourceFormatter.javaClass Formats HTML source by laying out each non-inline-level element on a new line with an appropriate indent.

Any indentation present in the original source text is removed.

Use one of the following methods to obtain the output:

The output text is functionally equivalent to the original source and should be rendered identically unless specified below.

The following points describe the process in general terms. Any aspect of the algorithm not specifically mentioned here is subject to change without notice in future versions.

  • Every element that is not an appears on a new line with an indent corresponding to its in the document element hierarchy.
  • The indent is formed by writing n repetitions of the string specified in the SourceFormatter.setIndentString(String) IndentString property, where n is the depth of the indentation.
  • The of an indented element starts on a new line and is indented at a depth one greater than that of the element, with the end tag appearing on a new line at the same depth as the start tag. If the content contains only text and , it may continue on the same line as the start tag.
StartTag.javaClass Represents the start tag of an in a specific document.
StartTagType.javaClass Defines the syntax for a start tag type.

A start tag type is any TagType that with the character '<' (as with all tag types), but whose second character is not '/'.

This includes types for many tags which stand alone, without a , and would not intuitively be categorised as a "start tag".

StartTagTypeCDATASection.javaClass
StartTagTypeComment.javaClass
StartTagTypeDoctypeDeclaration.javaClass
StartTagTypeGenericImplementation.javaClass Provides a generic implementation of the abstract StartTagType class based on the most common start tag behaviour.
StartTagTypeMarkupDeclaration.javaClass
StartTagTypeMasonComponentCall.javaClass
StartTagTypeMasonComponentCalledWithContent.javaClass
StartTagTypeMasonNamedBlock.javaClass
StartTagTypeNormal.javaClass
StartTagTypePHPScript.javaClass
StartTagTypePHPShort.javaClass
StartTagTypePHPStandard.javaClass
StartTagTypeServerCommon.javaClass
StartTagTypeUnregistered.javaClass
StartTagTypeXMLDeclaration.javaClass
StartTagTypeXMLProcessingInstruction.javaClass
StreamEncodingDetector.javaClass
StringOutputSegment.javaClass Implements an OutputSegment whose content is a CharSequence.
SubCache.javaClass Represents a cached map of character positions to tags for a particular tag type, or for all tag types if the tagType field is null.
Tag.javaClass Represents either a StartTag or EndTag in a specific document.

Take the following HTML segment as an example:

<p>This is a sample paragraph.</p>

The "<p>" is represented by a StartTag object, and the "</p>" is represented by an EndTag object, both of which are subclasses of the Tag class. The whole segment, including the start tag, its corresponding end tag and all of the content in between, is represented by an Element object.

Tag Parsing Process

The following process describes how each tag is identified by the parser:
  1. Every '<' character found in the source document is considered to be the start of a tag. The characters following it are compared with the of all the , and a list of matching tag types is determined.
  2. A more detailed analysis of the source is performed according to the features of each matching tag type from the first step, in order of precedence, until a valid tag is able to be constructed.

    The analysis performed in relation to each candidate tag type is a two-stage process:

    1. The position of the tag is checked to determine whether it is . In theory, a is valid in any position, but a non-server tag is not valid inside another non-server tag.

      The TagType#isValidPosition(Source, int pos, int[] fullSequentialParseData) method is responsible for this check and has a common default implementation for all tag types (although custom tag types can override it if necessary). Its behaviour differs depending on whether or not a is peformed. See the documentation of the TagType.isValidPosition(Sourceintint[]) isValidPosition method for full details.

    2. A final analysis is performed by the TagType#constructTagAt(Source, int pos) method of the candidate tag type. This method returns a valid Tag object if all conditions of the candidate tag type are met, otherwise it returns null and the process continues with the next candidate tag type.
  3. If the source does not match the start delimiter or syntax of any registered tag type, the segment spanning it and the next '>' character is taken to be an tag. Some tag search methods ignore unregistered tags.
TagType.javaClass Defines the syntax for a tag type that can be recognised by the parser.

This class is the root abstract class common to all tag types, and contains methods to and tag types as well as various methods to aid in their implementation.

Every tag type is represented by an instance of a class (usually a singleton) that must be a subclass of either StartTagType or EndTagType .

TagTypeRegister.javaClass
TextExtractor.javaClass Extracts the textual content from HTML markup.
Util.javaClass Contains miscellaneous utility methods not directly associated with the HTML Parser library.
WriterLogger.javaClass Provides an implementation of the Logger interface that sends output to the specified java.io.Writer.
www.java2java.com | Contact Us
Copyright 2009 - 12 Demo Source and Support. All rights reserved.
All other trademarks are property of their respective owners.