Add rudimentary keyword search for CCPA policies using HTML file #5

objorkman · 2022-04-19T19:50:54Z


url = 'https://docs.github.com/en/github/site-policy/github-privacy-statement'  # Not used!
result = polipy.get_policy(url, html_file='policy.html')  # policy.html contains file to check

result.save(output_dir='.')

Output is currently printed and stored as a string

nsamarin · 2022-04-20T20:02:24Z

polipy/extractors.py

@@ -11,8 +12,10 @@ def extract(extractor, **kwargs):
        content = extract_text(**kwargs)
    return content

-def extract_text(url_type, url=None, dynamic_source=None, static_source=None, **kwargs):
-    if url_type is None or url_type in ['html', 'other']:
+def extract_text(url_type, url=None, dynamic_source=None, static_source=None, html_file=None, **kwargs):


Looks good! I would do it this way so you can save the extracted keywords separately, as a separate extractor:

def extract(extractor, **kwargs): content = None if extractor == 'text': content = extract_text(**kwargs) elif extractor == 'keywords' and 'html_file' in kwargs and kwargs['html_file'] is not None: content = extract_ccpa_info(kwargs['html_file']) return content def extract_text(url_type, url=None, dynamic_source=None, static_source=None, **kwargs): if url_type is None or url_type in ['html', 'other']: content = extract_html(dynamic_source, url) elif url_type == 'pdf': content = extract_pdf(static_source) elif url_type == 'plain': content = dynamic_source else: content = dynamic_source return content

Maybe take a look and see if that achieves the same functionality? (apart from not saving it under the "text" extractor).

nsamarin · 2022-04-20T20:02:57Z

polipy/extractors.py

+                #         if w in text:
+                #             substring += w + ','
+            result += substring + '\n'
+        print(result)


Don't forget to remove the print statements at some later point.

nsamarin · 2022-04-20T20:03:21Z

polipy/polipy.py

@@ -3,6 +3,7 @@
 from .constants import UTC_DATE, CWD
 from .exceptions import NetworkIOException, ParserException
 from .logger import get_logger
+from bs4 import BeautifulSoup


Is this import needed here?

nsamarin · 2022-04-20T20:07:17Z

polipy/polipy.py

@@ -39,6 +40,7 @@ def __init__(self, url):
        self.url['url'] = url
        self.url = self.url | parse_url(url)
        self.url['domain'] = self.url['domain'].strip().strip('.').strip('/')
+        self.html_file = html_file


Yeah, I think this is fine for now. Maybe ultimately we could merge the self.htm_file better into the Polipy object, as that's basically the exact same attribute as the self.source['dynamic_html'] after the scraping part. But otherwise it's good to have an option of either providing an URL (and then scraping) or the already-scraped HTML to the constructor.

Rudimentary working

6b14803

nsamarin reviewed Apr 20, 2022

View reviewed changes

Address comments, add rest of categories, csv output

8fcbd7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add rudimentary keyword search for CCPA policies using HTML file #5

Add rudimentary keyword search for CCPA policies using HTML file #5

objorkman commented Apr 19, 2022

nsamarin Apr 20, 2022

nsamarin Apr 20, 2022

nsamarin Apr 20, 2022

nsamarin Apr 20, 2022

Add rudimentary keyword search for CCPA policies using HTML file #5

Are you sure you want to change the base?

Add rudimentary keyword search for CCPA policies using HTML file #5

Conversation

objorkman commented Apr 19, 2022

nsamarin Apr 20, 2022

Choose a reason for hiding this comment

nsamarin Apr 20, 2022

Choose a reason for hiding this comment

nsamarin Apr 20, 2022

Choose a reason for hiding this comment

nsamarin Apr 20, 2022

Choose a reason for hiding this comment