org.archive.crawler

Java Source Code / Java Documentation

1.	6.0 JDK Core
2.	6.0 JDK Modules
3.	6.0 JDK Modules com.sun
4.	6.0 JDK Modules com.sun.java
5.	6.0 JDK Modules sun
6.	6.0 JDK Platform
7.	Ajax
8.	Apache Harmony Java SE
9.	Aspect oriented
10.	Authentication Authorization
11.	Blogger System
12.	Build
13.	Byte Code
14.	Cache
15.	Chart
16.	Chat
17.	Code Analyzer
18.	Collaboration
19.	Content Management System
20.	Database Client
21.	Database DBMS
22.	Database JDBC Connection Pool
23.	Database ORM
24.	Development
25.	EJB Server geronimo
26.	EJB Server GlassFish
27.	EJB Server JBoss 4.2.1
28.	EJB Server resin 3.1.5
29.	ERP CRM Financial
30.	ESB
31.	Forum
32.	GIS
33.	Graphic Library
34.	Groupware
35.	HTML Parser
36.	IDE
37.	IDE Eclipse
38.	IDE Netbeans
39.	Installer
40.	Internationalization Localization
41.	Inversion of Control
42.	Issue Tracking
43.	J2EE
44.	JBoss
45.	JMS
46.	JMX
47.	Library
48.	Mail Clients
49.	Net
50.	Parser
51.	PDF
52.	Portal
53.	Profiler
54.	Project Management
55.	Report
56.	RSS RDF
57.	Rule Engine
58.	Science
59.	Scripting
60.	Search Engine
61.	Security
62.	Sevlet Container
63.	Source Control
64.	Swing Library
65.	Template Engine
66.	Test Coverage
67.	Testing
68.	UML
69.	Web Crawler
70.	Web Framework
71.	Web Mail
72.	Web Server
73.	Web Services
74.	Web Services apache cxf 2.0.1
75.	Web Services AXIS2
76.	Wiki Engine
77.	Workflow Engines
78.	XML
79.	XML UI

Java

Java Tutorial

Illustrator Tutorials

GIMP Tutorials

C# / C Sharp

C# / CSharp Tutorial

C# / CSharp Open Source

SQL Server / T-SQL Tutorial

Oracle PL / SQL

Oracle PL/SQL Tutorial

Flash / Flex / ActionScript

VBA / Excel / Access / Word

XML

XML Tutorial

Microsoft Office PowerPoint 2007 Tutorial

Microsoft Office Excel 2007 Tutorial

Microsoft Office Word 2007 Tutorial

Java Source Code / Java Documentation » Web Crawler » heritrix » org.archive.crawler

org.archive.crawler
Introduction to Heritrix. Heritrix is designed to be easily extensible via 3rd party modules. Architecture The software is divided into several packages of varying importance. The relationship between them will be covered in some greater depth after their introductions. The root package (this) contains the executable class {@link org.archive.crawler.Heritrix Heritrix}. That class will load the crawler, parsing command line arguments. If a WUI is to be launched it will launch it. It can also start jobs (with or without the WUI) that are specified in command line options. framework {@link org.archive.crawler.framework org.archive.crawler.framework} The `framework` package contains most of the core classes for running a crawl. It also contains a number of Interfaces for extensible items, the implementatations of whom can be found in other classes. Heritrix is in effect divided into two types of classes. Core classes - these can often be configured but not replaced. Pluggable classes - these must implment a given interface or extend a specific class but 3rd parties can introduce their own implementations. The framework thus contains a selection of the core classes and a number of the Interfaces and base classes for the pluggable classes. datamodel {@link org.archive.crawler.datamodel org.archive.crawler.datamodel} Contains various classes that make up the crawlers data structure. Including such essentials as the CandidateURI and CrawlURI classes that wrap the discovered URIs for processing. admin {@link org.archive.crawler.admin org.archive.crawler.admin} The `admin` package contains classes that are used by the Web UI. This includes some core classes and a specific implementation of the `Statistics Tracking` interface found in the `framework` package that is designed to provide the UI with information about ongoing crawls. Pluggable modules The following is a listing of the types of pluggable modules found in Heritrix with brief explanations of each and linking to their respective API documentation. Frontier A `Frontier` maintains the internal state of a crawl while it is in progress. What URIs have been discovered, which should be crawled next, etc. Needless to say this is one of the most important modules in any crawl and the provided implementation should generally be appropriate unless a very different strategy for ordering URIs for crawling is desired. {@link org.archive.crawler.framework.Frontier Frontier} is the interface that all `Frontiers` must implement. {@link org.archive.crawler.frontier org.archive.crawler.frontier} package contains the provided implementation of a `Frontier` along with it's supporting classes. Processor When a URI is crawled, a {@link org.archive.crawler.framework.ToeThread ToeThread} will execute a series of `processors` on it. The processors are split into 5 distinct chains that are exectued in sequence: Pre-fetch processing chain Fetch processing chain Extractor processing chain Write/Index processing chain Post-processing chain Each of these chains contain any number of `processors`. The processors all inherit from a generic {@link org.archive.crawler.framework.Processor Processor}. While the processors are divided into the five categories above that is strictly a high level configuration and any processor can be in any chain (although doing link extraction before fetching a document is clearly of no use). Numerous processors are provided with Heritrix in the following packages: {@link org.archive.crawler.prefetch org.archive.crawler.prefetch} package contains processors run before the URI is fetched from the Internet. {@link org.archive.crawler.fetcher org.archive.crawler.fetcher} package contains processors that fetch URI from the Internet. Typically each processor handles a different protocol. {@link org.archive.crawler.extractor org.archive.crawler.extractor} package contains processors that perform link extractions on various document types. {@link org.archive.crawler.writer org.archive.crawler.writer} package contains a processor that writes an ARC file with the fetched document. {@link org.archive.crawler.postprocessor org.archive.crawler.postprocessor} package contain processors that do wrapup on the processing, reporting links back to the Frontier etc. Filter Scope Scopes are special filters that are applied to the crawl as a whole to define it's scope. Any given crawl will employ exactly one scope object to define what URIs are considered 'within scope'. Several implementations covering the most commonly desired scopes are provided (broad, domain, host etc.). However custom implementations can be made of these to define any arbitrary scope. It should be noted though that usually any type of limitations to the scope of a crawl can be more easily achived using one of the existing scopes and modifing it with appropriate filters. {@link org.archive.crawler.framework.CrawlScope CrawlScope} - Base class for scopes. {@link org.archive.crawler.scope org.archive.crawler.scope} package. Contains provided scopes. Statistics Tracking Any number of statistics tracking modules can be added to a crawl to gather run time information about it's progress. These modules can both interrogate the `Frontier` for what sparse date it exposes but they can also subscribe to {@link org.archive.crawler.event.CrawlURIDispositionListener Crawled URI Disposition} events to monitor the completion of each URI that is processed. An interface for {@link org.archive.crawler.framework.StatisticsTracking statistics tracking} is provided as well as a partial implementation ({@link org.archive.crawler.framework.AbstractTracker AbstractTracker}) that does much of the work common to most statistics tracking modules. Furthermore the `admin` package implements a statistics tracking module ({@link org.archive.crawler.admin.StatisticsTracker StatisticsTracker}) that generates a log of the crawlers progress as well as providing information that the UI uses. It also compiles end-of-crawl reports that contain all of the information it has gathered in the course of the crawl. It is highly recommended that it always be used when running crawls via the UI.
Java Source File Name	Type	Comment
CommandLineParser.java	Class	Print Heritrix command-line usage message.
Heritrix.java	Class	Main class for Heritrix crawler. Heritrix is usually launched by a shell script that backgrounds heritrix that redirects all stdout and stderr emitted by heritrix to a log file.
SimpleHttpServer.java	Class	Wrapper for embedded Jetty server.
WebappLifecycle.java	Class	Calls start and stop of Heritrix when Heritrix is bundled as a webapp.

www.java2java.com | Contact Us

All other trademarks are property of their respective owners.

org.archive.crawler

Introduction to Heritrix.

Architecture

framework

datamodel

admin

Pluggable modules

Frontier

Processor

Filter

Scope

Statistics Tracking