-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #6 from WordCollector/staging
misc!: Restructure the API, add tests, add validation.
- Loading branch information
Showing
14 changed files
with
1,434 additions
and
238 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,24 +1,116 @@ | ||
## A lightweight `robots.txt` ruleset parser to ensure your application follows the standard protocol. | ||
## A complete, dependency-less and fully documented `robots.txt` ruleset parser. | ||
|
||
### Usage | ||
|
||
The following code gets the `robots.txt` robot exclusion ruleset of a website. | ||
You can obtain the robot exclusion rulesets for a particular website as follows: | ||
|
||
`quietMode` determines whether or not the library should print warning messages in the case of the `robots.txt` not being valid or other errors. | ||
```dart | ||
// Get the contents of the `robots.txt` file. | ||
final contents = /* Your method of obtaining the contents of a `robots.txt` file. */; | ||
// Parse the contents. | ||
final robots = Robots.parse(contents); | ||
``` | ||
|
||
Now that you have parsed the `robots.txt` file, you can perform checks to | ||
establish whether or not a user-agent is allowed to visit a particular path: | ||
|
||
```dart | ||
// Create an instance of the `robots.txt` parser | ||
final robots = Robots(host: 'https://github.com/'); | ||
// Read the ruleset of the website | ||
await robots.read(); | ||
final userAgent = /* Your user-agent. */; | ||
print(robots.verifyCanAccess('/gist/', userAgent: userAgent)); // False | ||
print(robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)); // True | ||
``` | ||
|
||
Now that the `robots.txt` file has been read, we can verify whether we can visit a certain path or not: | ||
If you are not concerned about rules pertaining to any other user-agents, and we | ||
only care about our own, you may instruct the parser to ignore them by | ||
specifying only those that matter to us: | ||
|
||
```dart | ||
final userAgent = '*'; | ||
print("Can '$userAgent' visit '/gist/'?"); | ||
print(robots.canVisitPath('/gist/', userAgent: '*')); // It cannot | ||
print("Can '$userAgent' visit '/wordcollector/robots_txt'?"); | ||
print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*')); // It can | ||
``` | ||
// Parse the contents, disregarding user-agents other than 'WordCollector'. | ||
final robots = Robots.parse(contents, onlyApplicableTo: const {'WordCollector'}); | ||
``` | ||
|
||
The `Robots.parse()` function does not have any built-in structure validation. | ||
It will not throw exceptions, and will fail silently wherever appropriate. If | ||
the file contents passed into it were not a valid `robots.txt` file, there is no | ||
guarantee that it will produce useful data, and disallow a bot wherever | ||
possible. | ||
|
||
If you wish to ensure before parsing that a particular file is valid, use the | ||
`Robots.validate()` function. Unlike `Robots.parse()`, this one **will throw** a | ||
`FormatException` if the file is not valid: | ||
|
||
```dart | ||
// Validating an invalid file will throw a `FormatException`. | ||
try { | ||
Robots.validate('This is an obviously invalid robots.txt file.'); | ||
} on FormatException { | ||
print('As expected, this file is flagged as invalid.'); | ||
} | ||
// Validating an already valid file will not throw anything. | ||
try { | ||
Robots.validate(''' | ||
User-agent: * | ||
Crawl-delay: 10 | ||
Disallow: / | ||
Allow: /file.txt | ||
Host: https://hosting.example.com/ | ||
Sitemap: https://example.com/sitemap.xml | ||
'''); | ||
print('As expected also, this file is not flagged as invalid.'); | ||
} on FormatException { | ||
// Code to handle an invalid file. | ||
} | ||
``` | ||
|
||
By default, the validator will only accept the following fields: | ||
|
||
- User-agent | ||
- Allow | ||
- Disallow | ||
- Sitemap | ||
- Crawl-delay | ||
- Host | ||
|
||
If you want to accept files that feature any other fields, you will have to | ||
specify them as so: | ||
|
||
```dart | ||
try { | ||
Robots.validate( | ||
''' | ||
User-agent: * | ||
Custom-field: value | ||
''', | ||
allowedFieldNames: {'Custom-field'}, | ||
); | ||
} on FormatException { | ||
// Code to handle an invalid file. | ||
} | ||
``` | ||
|
||
By default, the `Allow` field is treated as having precedence by the parser. | ||
This is the standard approach to both writing and reading `robots.txt` files, | ||
however, you can instruct the parser to follow another approach by telling it to | ||
do so: | ||
|
||
```dart | ||
robots.verifyCanAccess( | ||
'/path', | ||
userAgent: userAgent, | ||
typePrecedence: RuleTypePrecedence.disallow, | ||
); | ||
``` | ||
|
||
Similarly, fields defined **later** in the file are considered to have | ||
precedence too. Similarly also, this is the standard approach. You can instruct | ||
the parser to rule otherwise: | ||
|
||
```dart | ||
robots.verifyCanAccess( | ||
'/path', | ||
userAgent: userAgent, | ||
comparisonMethod: PrecedenceStrategy.lowerTakesPrecedence, | ||
); | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,5 @@ | ||
include: package:words/core.yaml | ||
include: package:words/core.yaml | ||
|
||
linter: | ||
rules: | ||
directives_ordering: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,32 +1,108 @@ | ||
import 'dart:convert'; | ||
import 'dart:io'; | ||
|
||
import 'package:robots_txt/robots_txt.dart'; | ||
|
||
Future main() async { | ||
// Create an instance of the `robots.txt` parser. | ||
final robots = Robots(host: 'https://github.com/'); | ||
// Read the ruleset of the website. | ||
await robots.read(); | ||
// Print the ruleset. | ||
Future<void> main() async { | ||
// Get the contents of the `robots.txt` file. | ||
final contents = await fetchFileContents(host: 'github.com'); | ||
// Parse the contents. | ||
final robots = Robots.parse(contents); | ||
|
||
// Print the rulesets. | ||
for (final ruleset in robots.rulesets) { | ||
// Print the user-agent the ruleset applies to. | ||
print(ruleset.appliesTo); | ||
// Print the user-agent this ruleset applies to. | ||
print('User-agent: ${ruleset.userAgent}'); | ||
|
||
if (ruleset.allows.isNotEmpty) { | ||
print('Allows:'); | ||
print('Allowed:'); | ||
} | ||
// Print the path expressions allowed by this ruleset. | ||
// Print the regular expressions that match to paths allowed by this | ||
// ruleset. | ||
for (final rule in ruleset.allows) { | ||
print(' - ${rule.expression}'); | ||
print(' - ${rule.pattern}'); | ||
} | ||
|
||
if (ruleset.disallows.isNotEmpty) { | ||
print('Disallows:'); | ||
print('Disallowed:'); | ||
} | ||
// Print the path expressions disallowed by this ruleset. | ||
// Print the regular expressions that match to paths disallowed by this | ||
// ruleset. | ||
for (final rule in ruleset.disallows) { | ||
print(' - ${rule.expression}'); | ||
print(' - ${rule.pattern}'); | ||
} | ||
} | ||
|
||
const userAgent = 'WordCollector'; | ||
|
||
// False: it cannot. | ||
print(robots.canVisitPath('/gist/', userAgent: '*')); | ||
print( | ||
"Can '$userAgent' access /gist/? ${robots.verifyCanAccess('/gist/', userAgent: userAgent)}", | ||
); | ||
// True: it can. | ||
print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*')); | ||
return; | ||
print( | ||
"Can '$userAgent' access /wordcollector/robots_txt/? ${robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)}", | ||
); | ||
|
||
// Validating an invalid file will throw a `FormatException`. | ||
try { | ||
Robots.validate('This is an obviously invalid robots.txt file.'); | ||
} on FormatException { | ||
print('As expected, the first file is flagged as invalid.'); | ||
} | ||
|
||
// Validating an already valid file will not throw anything. | ||
try { | ||
Robots.validate(''' | ||
User-agent: * | ||
Crawl-delay: 10 | ||
Disallow: / | ||
Allow: /file.txt | ||
Host: https://hosting.example.com/ | ||
Sitemap: https://example.com/sitemap.xml | ||
'''); | ||
print('As expected also, the second file is not flagged as invalid.'); | ||
} on FormatException { | ||
print('Welp, this was not supposed to happen.'); | ||
} | ||
|
||
late final String contentsFromBefore; | ||
|
||
// Validating a file with unsupported fields. | ||
try { | ||
Robots.validate( | ||
contentsFromBefore = ''' | ||
User-agent: * | ||
Some-field: abcd.txt | ||
''', | ||
); | ||
} on FormatException { | ||
print( | ||
'This file is invalid on the grounds that it contains fields we did not ' | ||
'expect it to have.', | ||
); | ||
print( | ||
"Let's fix that by including the custom field in the call to validate().", | ||
); | ||
try { | ||
Robots.validate(contentsFromBefore, allowedFieldNames: {'Some-field'}); | ||
print('Aha! Now there are no issues.'); | ||
} on FormatException { | ||
print('Welp, this also was not supposed to happen.'); | ||
} | ||
} | ||
} | ||
|
||
Future<String> fetchFileContents({required String host}) async { | ||
final client = HttpClient(); | ||
|
||
final contents = await client | ||
.get(host, 80, '/robots.txt') | ||
.then((request) => request.close()) | ||
.then((response) => response.transform(utf8.decoder).join()); | ||
|
||
client.close(); | ||
|
||
return contents; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,5 +1,6 @@ | ||
/// Lightweight, fully documented `robots.txt` file parser. | ||
library robots_txt; | ||
|
||
export 'src/parser.dart'; | ||
export 'src/rule.dart'; | ||
export 'src/robots.dart' show Robots, PrecedentRuleType, FieldType; | ||
export 'src/rule.dart' show Rule, FindRule, Precedence, PrecedenceStrategy; | ||
export 'src/ruleset.dart' show Ruleset, FindRuleInRuleset; |
Oops, something went wrong.