public class Robots extends Object
Modifier and Type | Class and Description |
---|---|
protected class |
Robots.Host
This class maintains status for a given host.
|
protected static class |
Robots.Record
This class represents a record in a robots.txt file.
|
Modifier and Type | Field and Description |
---|---|
static String |
_rcsid |
protected Map |
cache
This is the cache hash - which is keyed by the protocol/host/port, and has a Host object as the
value.
|
protected ThrottledFetcher |
fetcher
Fetcher to use to get the data from wherever
|
protected int |
refCount
Reference count
|
protected static String |
ROBOT_CONNECTION_TYPE
Robots connection type value
|
protected static String |
ROBOT_FILE_NAME
Robot file name value
|
protected static int |
ROBOT_TIMEOUT_MILLISECONDS
Robots fetch timeout value
|
Constructor and Description |
---|
Robots(ThrottledFetcher fetcher)
Constructor.
|
Modifier and Type | Method and Description |
---|---|
protected static boolean |
doesPathMatch(String path,
int pathIndex,
String spec,
int specIndex)
Recursive method for matching specification to path.
|
protected static boolean |
doesPathMatch(String path,
String spec)
Check if path matches specification
|
boolean |
isFetchAllowed(IThreadContext threadContext,
String throttleGroupName,
String protocol,
int port,
String hostName,
String pathString,
String userAgent,
String from,
String proxyHost,
int proxyPort,
String proxyAuthDomain,
String proxyAuthUsername,
String proxyAuthPassword,
IProcessActivity activities,
int connectionLimit)
Decide whether a specific robot can crawl a specific URL.
|
protected static String |
makeReadable(String inputString)
Convert a string from the robots file into a readable form that does NOT contain NUL characters (since postgresql does not accept those).
|
void |
noteConnectionEstablished()
Note that a connection has been established.
|
void |
noteConnectionReleased()
Note that a connection has been released, and free resources if no reason
to retain them.
|
void |
poll()
Clean idle stuff out of cache
|
public static final String _rcsid
protected static final int ROBOT_TIMEOUT_MILLISECONDS
protected static final String ROBOT_CONNECTION_TYPE
protected static final String ROBOT_FILE_NAME
protected ThrottledFetcher fetcher
protected int refCount
protected Map cache
public Robots(ThrottledFetcher fetcher)
public void noteConnectionEstablished()
public void noteConnectionReleased()
public void poll()
public boolean isFetchAllowed(IThreadContext threadContext, String throttleGroupName, String protocol, int port, String hostName, String pathString, String userAgent, String from, String proxyHost, int proxyPort, String proxyAuthDomain, String proxyAuthUsername, String proxyAuthPassword, IProcessActivity activities, int connectionLimit) throws ManifoldCFException, ServiceInterruption
userAgent
- is the user-agent string used by the robot.from
- is the email address.protocol
- is the name of the protocol (e.g. "http")port
- is the port number (-1 being the default for the protocol)hostName
- is the fqdn of the hostpathString
- is the path (non-query) part of the URLManifoldCFException
ServiceInterruption
protected static String makeReadable(String inputString)
protected static boolean doesPathMatch(String path, String spec)