Skip to content

Commit

Permalink
Merge pull request #6 from WordCollector/staging
Browse files Browse the repository at this point in the history
misc!: Restructure the API, add tests, add validation.
  • Loading branch information
vxern authored Jan 9, 2023
2 parents de0a3a4 + 4771fbb commit 4387f3e
Show file tree
Hide file tree
Showing 14 changed files with 1,434 additions and 238 deletions.
47 changes: 47 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,50 @@
## 2.2.0+1

- Compiled the example files into a single file `example.dart`.

## 2.2.0

- Added support for the `Crawl-delay` and `Host` fields.

## 2.1.0+1

- Updated README.md.

## 2.1.0

- Added a method `.validate()` for validating files.
- Renamed `parser.dart` to `robots.dart`.

## 2.0.1

- Converted the `onlyApplicableTo` parameter in `Robots.parse()` from a `String`
into a `Set` to allow multiple user-agents to be specified at once.
- Fixed the `onlyApplicableTo` parameter in `Robots.parse()` not being taken
into account.

## 2.0.0

- Additions:
- Added dependencies:
- `meta` for static analysis.
- Added developer dependencies:
- `test` for testing.
- Added support for the 'Sitemap' field.
- Added support for specifying:
- The precedent rule type for determining whether a certain user-agent can
or cannot access a certain path. (`PrecedentRuleType`)
- The comparison strategy to use for comparing rule precedence.
(`PrecedenceStrategy`)
- Added tests.
- Changes:
- Bumped the minimum SDK version to `2.17.0` for enhanced enum support.
- Improvements:
- Made all structs `const` and marked them as `@sealed` and `@immutable`.
- Deletions:
- Removed dependencies:
- `sprint`
- `web_scraper`

## 1.1.1

- Updated project description.
Expand Down
120 changes: 106 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,116 @@
## A lightweight `robots.txt` ruleset parser to ensure your application follows the standard protocol.
## A complete, dependency-less and fully documented `robots.txt` ruleset parser.

### Usage

The following code gets the `robots.txt` robot exclusion ruleset of a website.
You can obtain the robot exclusion rulesets for a particular website as follows:

`quietMode` determines whether or not the library should print warning messages in the case of the `robots.txt` not being valid or other errors.
```dart
// Get the contents of the `robots.txt` file.
final contents = /* Your method of obtaining the contents of a `robots.txt` file. */;
// Parse the contents.
final robots = Robots.parse(contents);
```

Now that you have parsed the `robots.txt` file, you can perform checks to
establish whether or not a user-agent is allowed to visit a particular path:

```dart
// Create an instance of the `robots.txt` parser
final robots = Robots(host: 'https://github.com/');
// Read the ruleset of the website
await robots.read();
final userAgent = /* Your user-agent. */;
print(robots.verifyCanAccess('/gist/', userAgent: userAgent)); // False
print(robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)); // True
```

Now that the `robots.txt` file has been read, we can verify whether we can visit a certain path or not:
If you are not concerned about rules pertaining to any other user-agents, and we
only care about our own, you may instruct the parser to ignore them by
specifying only those that matter to us:

```dart
final userAgent = '*';
print("Can '$userAgent' visit '/gist/'?");
print(robots.canVisitPath('/gist/', userAgent: '*')); // It cannot
print("Can '$userAgent' visit '/wordcollector/robots_txt'?");
print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*')); // It can
```
// Parse the contents, disregarding user-agents other than 'WordCollector'.
final robots = Robots.parse(contents, onlyApplicableTo: const {'WordCollector'});
```

The `Robots.parse()` function does not have any built-in structure validation.
It will not throw exceptions, and will fail silently wherever appropriate. If
the file contents passed into it were not a valid `robots.txt` file, there is no
guarantee that it will produce useful data, and disallow a bot wherever
possible.

If you wish to ensure before parsing that a particular file is valid, use the
`Robots.validate()` function. Unlike `Robots.parse()`, this one **will throw** a
`FormatException` if the file is not valid:

```dart
// Validating an invalid file will throw a `FormatException`.
try {
Robots.validate('This is an obviously invalid robots.txt file.');
} on FormatException {
print('As expected, this file is flagged as invalid.');
}
// Validating an already valid file will not throw anything.
try {
Robots.validate('''
User-agent: *
Crawl-delay: 10
Disallow: /
Allow: /file.txt
Host: https://hosting.example.com/
Sitemap: https://example.com/sitemap.xml
''');
print('As expected also, this file is not flagged as invalid.');
} on FormatException {
// Code to handle an invalid file.
}
```

By default, the validator will only accept the following fields:

- User-agent
- Allow
- Disallow
- Sitemap
- Crawl-delay
- Host

If you want to accept files that feature any other fields, you will have to
specify them as so:

```dart
try {
Robots.validate(
'''
User-agent: *
Custom-field: value
''',
allowedFieldNames: {'Custom-field'},
);
} on FormatException {
// Code to handle an invalid file.
}
```

By default, the `Allow` field is treated as having precedence by the parser.
This is the standard approach to both writing and reading `robots.txt` files,
however, you can instruct the parser to follow another approach by telling it to
do so:

```dart
robots.verifyCanAccess(
'/path',
userAgent: userAgent,
typePrecedence: RuleTypePrecedence.disallow,
);
```

Similarly, fields defined **later** in the file are considered to have
precedence too. Similarly also, this is the standard approach. You can instruct
the parser to rule otherwise:

```dart
robots.verifyCanAccess(
'/path',
userAgent: userAgent,
comparisonMethod: PrecedenceStrategy.lowerTakesPrecedence,
);
```
6 changes: 5 additions & 1 deletion analysis_options.yaml
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
include: package:words/core.yaml
include: package:words/core.yaml

linter:
rules:
directives_ordering: false
110 changes: 93 additions & 17 deletions example/example.dart
Original file line number Diff line number Diff line change
@@ -1,32 +1,108 @@
import 'dart:convert';
import 'dart:io';

import 'package:robots_txt/robots_txt.dart';

Future main() async {
// Create an instance of the `robots.txt` parser.
final robots = Robots(host: 'https://github.com/');
// Read the ruleset of the website.
await robots.read();
// Print the ruleset.
Future<void> main() async {
// Get the contents of the `robots.txt` file.
final contents = await fetchFileContents(host: 'github.com');
// Parse the contents.
final robots = Robots.parse(contents);

// Print the rulesets.
for (final ruleset in robots.rulesets) {
// Print the user-agent the ruleset applies to.
print(ruleset.appliesTo);
// Print the user-agent this ruleset applies to.
print('User-agent: ${ruleset.userAgent}');

if (ruleset.allows.isNotEmpty) {
print('Allows:');
print('Allowed:');
}
// Print the path expressions allowed by this ruleset.
// Print the regular expressions that match to paths allowed by this
// ruleset.
for (final rule in ruleset.allows) {
print(' - ${rule.expression}');
print(' - ${rule.pattern}');
}

if (ruleset.disallows.isNotEmpty) {
print('Disallows:');
print('Disallowed:');
}
// Print the path expressions disallowed by this ruleset.
// Print the regular expressions that match to paths disallowed by this
// ruleset.
for (final rule in ruleset.disallows) {
print(' - ${rule.expression}');
print(' - ${rule.pattern}');
}
}

const userAgent = 'WordCollector';

// False: it cannot.
print(robots.canVisitPath('/gist/', userAgent: '*'));
print(
"Can '$userAgent' access /gist/? ${robots.verifyCanAccess('/gist/', userAgent: userAgent)}",
);
// True: it can.
print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*'));
return;
print(
"Can '$userAgent' access /wordcollector/robots_txt/? ${robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)}",
);

// Validating an invalid file will throw a `FormatException`.
try {
Robots.validate('This is an obviously invalid robots.txt file.');
} on FormatException {
print('As expected, the first file is flagged as invalid.');
}

// Validating an already valid file will not throw anything.
try {
Robots.validate('''
User-agent: *
Crawl-delay: 10
Disallow: /
Allow: /file.txt
Host: https://hosting.example.com/
Sitemap: https://example.com/sitemap.xml
''');
print('As expected also, the second file is not flagged as invalid.');
} on FormatException {
print('Welp, this was not supposed to happen.');
}

late final String contentsFromBefore;

// Validating a file with unsupported fields.
try {
Robots.validate(
contentsFromBefore = '''
User-agent: *
Some-field: abcd.txt
''',
);
} on FormatException {
print(
'This file is invalid on the grounds that it contains fields we did not '
'expect it to have.',
);
print(
"Let's fix that by including the custom field in the call to validate().",
);
try {
Robots.validate(contentsFromBefore, allowedFieldNames: {'Some-field'});
print('Aha! Now there are no issues.');
} on FormatException {
print('Welp, this also was not supposed to happen.');
}
}
}

Future<String> fetchFileContents({required String host}) async {
final client = HttpClient();

final contents = await client
.get(host, 80, '/robots.txt')
.then((request) => request.close())
.then((response) => response.transform(utf8.decoder).join());

client.close();

return contents;
}
5 changes: 3 additions & 2 deletions lib/robots_txt.dart
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
/// Lightweight, fully documented `robots.txt` file parser.
library robots_txt;

export 'src/parser.dart';
export 'src/rule.dart';
export 'src/robots.dart' show Robots, PrecedentRuleType, FieldType;
export 'src/rule.dart' show Rule, FindRule, Precedence, PrecedenceStrategy;
export 'src/ruleset.dart' show Ruleset, FindRuleInRuleset;
Loading

0 comments on commit 4387f3e

Please sign in to comment.