Merge pull request #6 from WordCollector/staging

misc!: Restructure the API, add tests, add validation.
vxern · Jan 9, 2023 · 4387f3e · 4387f3e
2 parents de0a3a4 + 4771fbb
commit 4387f3e
Show file tree

Hide file tree

Showing 14 changed files with 1,434 additions and 238 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,50 @@
+## 2.2.0+1
+
+- Compiled the example files into a single file `example.dart`.
+
+## 2.2.0
+
+- Added support for the `Crawl-delay` and `Host` fields.
+
+## 2.1.0+1
+
+- Updated README.md.
+
+## 2.1.0
+
+- Added a method `.validate()` for validating files.
+- Renamed `parser.dart` to `robots.dart`.
+
+## 2.0.1
+
+- Converted the `onlyApplicableTo` parameter in `Robots.parse()` from a `String`
+  into a `Set` to allow multiple user-agents to be specified at once.
+- Fixed the `onlyApplicableTo` parameter in `Robots.parse()` not being taken
+  into account.
+
+## 2.0.0
+
+- Additions:
+  - Added dependencies:
+    - `meta` for static analysis.
+  - Added developer dependencies:
+    - `test` for testing.
+  - Added support for the 'Sitemap' field.
+  - Added support for specifying:
+    - The precedent rule type for determining whether a certain user-agent can
+      or cannot access a certain path. (`PrecedentRuleType`)
+    - The comparison strategy to use for comparing rule precedence.
+      (`PrecedenceStrategy`)
+  - Added tests.
+- Changes:
+  - Bumped the minimum SDK version to `2.17.0` for enhanced enum support.
+- Improvements:
+  - Made all structs `const` and marked them as `@sealed` and `@immutable`.
+- Deletions:
+  - Removed dependencies:
+    - `sprint`
+    - `web_scraper`
+
 ## 1.1.1
 
 - Updated project description.

diff --git a/README.md b/README.md
@@ -1,24 +1,116 @@
-## A lightweight `robots.txt` ruleset parser to ensure your application follows the standard protocol.
+## A complete, dependency-less and fully documented `robots.txt` ruleset parser.
 
 ### Usage
 
-The following code gets the `robots.txt` robot exclusion ruleset of a website.
+You can obtain the robot exclusion rulesets for a particular website as follows:
 
-`quietMode` determines whether or not the library should print warning messages in the case of the `robots.txt` not being valid or other errors.
+```dart
+// Get the contents of the `robots.txt` file.
+final contents = /* Your method of obtaining the contents of a `robots.txt` file. */;
+// Parse the contents.
+final robots = Robots.parse(contents);
+```
+
+Now that you have parsed the `robots.txt` file, you can perform checks to
+establish whether or not a user-agent is allowed to visit a particular path:
 
 ```dart
-// Create an instance of the `robots.txt` parser
-final robots = Robots(host: 'https://github.com/');
-// Read the ruleset of the website
-await robots.read();
+final userAgent = /* Your user-agent. */;
+print(robots.verifyCanAccess('/gist/', userAgent: userAgent)); // False
+print(robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)); // True
 ```
 
-Now that the `robots.txt` file has been read, we can verify whether we can visit a certain path or not:
+If you are not concerned about rules pertaining to any other user-agents, and we
+only care about our own, you may instruct the parser to ignore them by
+specifying only those that matter to us:
 
 ```dart
-final userAgent = '*';
-print("Can '$userAgent' visit '/gist/'?");
-print(robots.canVisitPath('/gist/', userAgent: '*')); // It cannot
-print("Can '$userAgent' visit '/wordcollector/robots_txt'?");
-print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*')); // It can
-```
+// Parse the contents, disregarding user-agents other than 'WordCollector'.
+final robots = Robots.parse(contents, onlyApplicableTo: const {'WordCollector'});
+```
+
+The `Robots.parse()` function does not have any built-in structure validation.
+It will not throw exceptions, and will fail silently wherever appropriate. If
+the file contents passed into it were not a valid `robots.txt` file, there is no
+guarantee that it will produce useful data, and disallow a bot wherever
+possible.
+
+If you wish to ensure before parsing that a particular file is valid, use the
+`Robots.validate()` function. Unlike `Robots.parse()`, this one **will throw** a
+`FormatException` if the file is not valid:
+
+```dart
+// Validating an invalid file will throw a `FormatException`.
+try {
+  Robots.validate('This is an obviously invalid robots.txt file.');
+} on FormatException {
+  print('As expected, this file is flagged as invalid.');
+}
+
+// Validating an already valid file will not throw anything.
+try {
+  Robots.validate('''
+User-agent: *
+Crawl-delay: 10
+Disallow: /
+Allow: /file.txt
+
+Host: https://hosting.example.com/
+Sitemap: https://example.com/sitemap.xml
+''');
+  print('As expected also, this file is not flagged as invalid.');
+} on FormatException {
+  // Code to handle an invalid file.
+}
+```
+
+By default, the validator will only accept the following fields:
+
+- User-agent
+- Allow
+- Disallow
+- Sitemap
+- Crawl-delay
+- Host
+
+If you want to accept files that feature any other fields, you will have to
+specify them as so:
+
+```dart
+try {
+  Robots.validate(
+    '''
+User-agent: *
+Custom-field: value
+''',
+    allowedFieldNames: {'Custom-field'},
+  );
+} on FormatException {
+  // Code to handle an invalid file.
+}
+```
+
+By default, the `Allow` field is treated as having precedence by the parser.
+This is the standard approach to both writing and reading `robots.txt` files,
+however, you can instruct the parser to follow another approach by telling it to
+do so:
+
+```dart
+robots.verifyCanAccess(
+  '/path', 
+  userAgent: userAgent, 
+  typePrecedence: RuleTypePrecedence.disallow,
+);
+```
+
+Similarly, fields defined **later** in the file are considered to have
+precedence too. Similarly also, this is the standard approach. You can instruct
+the parser to rule otherwise:
+
+```dart
+robots.verifyCanAccess(
+  '/path',
+  userAgent: userAgent,
+  comparisonMethod: PrecedenceStrategy.lowerTakesPrecedence,
+);
+```
diff --git a/analysis_options.yaml b/analysis_options.yaml
@@ -1 +1,5 @@
-include: package:words/core.yaml
+include: package:words/core.yaml
+
+linter:
+  rules:
+    directives_ordering: false
diff --git a/example/example.dart b/example/example.dart
@@ -1,32 +1,108 @@
+import 'dart:convert';
+import 'dart:io';
+
 import 'package:robots_txt/robots_txt.dart';
 
-Future main() async {
-  // Create an instance of the `robots.txt` parser.
-  final robots = Robots(host: 'https://github.com/');
-  // Read the ruleset of the website.
-  await robots.read();
-  // Print the ruleset.
+Future<void> main() async {
+  // Get the contents of the `robots.txt` file.
+  final contents = await fetchFileContents(host: 'github.com');
+  // Parse the contents.
+  final robots = Robots.parse(contents);
+
+  // Print the rulesets.
   for (final ruleset in robots.rulesets) {
-    // Print the user-agent the ruleset applies to.
-    print(ruleset.appliesTo);
+    // Print the user-agent this ruleset applies to.
+    print('User-agent: ${ruleset.userAgent}');
+
     if (ruleset.allows.isNotEmpty) {
-      print('Allows:');
+      print('Allowed:');
     }
-    // Print the path expressions allowed by this ruleset.
+    // Print the regular expressions that match to paths allowed by this
+    // ruleset.
     for (final rule in ruleset.allows) {
-      print('  - ${rule.expression}');
+      print('  - ${rule.pattern}');
     }
+
     if (ruleset.disallows.isNotEmpty) {
-      print('Disallows:');
+      print('Disallowed:');
     }
-    // Print the path expressions disallowed by this ruleset.
+    // Print the regular expressions that match to paths disallowed by this
+    // ruleset.
     for (final rule in ruleset.disallows) {
-      print('  - ${rule.expression}');
+      print('  - ${rule.pattern}');
     }
   }
+
+  const userAgent = 'WordCollector';
+
   // False: it cannot.
-  print(robots.canVisitPath('/gist/', userAgent: '*'));
+  print(
+    "Can '$userAgent' access /gist/? ${robots.verifyCanAccess('/gist/', userAgent: userAgent)}",
+  );
   // True: it can.
-  print(robots.canVisitPath('/wordcollector/robots_txt', userAgent: '*'));
-  return;
+  print(
+    "Can '$userAgent' access /wordcollector/robots_txt/? ${robots.verifyCanAccess('/wordcollector/robots_txt/', userAgent: userAgent)}",
+  );
+
+  // Validating an invalid file will throw a `FormatException`.
+  try {
+    Robots.validate('This is an obviously invalid robots.txt file.');
+  } on FormatException {
+    print('As expected, the first file is flagged as invalid.');
+  }
+
+  // Validating an already valid file will not throw anything.
+  try {
+    Robots.validate('''
+User-agent: *
+Crawl-delay: 10
+Disallow: /
+Allow: /file.txt
+
+Host: https://hosting.example.com/
+Sitemap: https://example.com/sitemap.xml
+''');
+    print('As expected also, the second file is not flagged as invalid.');
+  } on FormatException {
+    print('Welp, this was not supposed to happen.');
+  }
+
+  late final String contentsFromBefore;
+
+  // Validating a file with unsupported fields.
+  try {
+    Robots.validate(
+      contentsFromBefore = '''
+User-agent: *
+Some-field: abcd.txt
+''',
+    );
+  } on FormatException {
+    print(
+      'This file is invalid on the grounds that it contains fields we did not '
+      'expect it to have.',
+    );
+    print(
+      "Let's fix that by including the custom field in the call to validate().",
+    );
+    try {
+      Robots.validate(contentsFromBefore, allowedFieldNames: {'Some-field'});
+      print('Aha! Now there are no issues.');
+    } on FormatException {
+      print('Welp, this also was not supposed to happen.');
+    }
+  }
+}
+
+Future<String> fetchFileContents({required String host}) async {
+  final client = HttpClient();
+
+  final contents = await client
+      .get(host, 80, '/robots.txt')
+      .then((request) => request.close())
+      .then((response) => response.transform(utf8.decoder).join());
+
+  client.close();
+
+  return contents;
 }
diff --git a/lib/robots_txt.dart b/lib/robots_txt.dart
@@ -1,5 +1,6 @@
 /// Lightweight, fully documented `robots.txt` file parser.
 library robots_txt;
 
-export 'src/parser.dart';
-export 'src/rule.dart';
+export 'src/robots.dart' show Robots, PrecedentRuleType, FieldType;
+export 'src/rule.dart' show Rule, FindRule, Precedence, PrecedenceStrategy;
+export 'src/ruleset.dart' show Ruleset, FindRuleInRuleset;