Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add rudimentary keyword search for CCPA policies using HTML file #5

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

objorkman
Copy link


url = 'https://docs.github.com/en/github/site-policy/github-privacy-statement'  # Not used!
result = polipy.get_policy(url, html_file='policy.html')  # policy.html contains file to check

result.save(output_dir='.')

Output is currently printed and stored as a string

@@ -11,8 +12,10 @@ def extract(extractor, **kwargs):
content = extract_text(**kwargs)
return content

def extract_text(url_type, url=None, dynamic_source=None, static_source=None, **kwargs):
if url_type is None or url_type in ['html', 'other']:
def extract_text(url_type, url=None, dynamic_source=None, static_source=None, html_file=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I would do it this way so you can save the extracted keywords separately, as a separate extractor:

def extract(extractor, **kwargs):
    content = None
    if extractor == 'text':
        content = extract_text(**kwargs)
    elif extractor == 'keywords' and 'html_file' in kwargs and kwargs['html_file'] is not None:
        content = extract_ccpa_info(kwargs['html_file'])
    return content

def extract_text(url_type, url=None, dynamic_source=None, static_source=None, **kwargs):
    if url_type is None or url_type in ['html', 'other']:
        content = extract_html(dynamic_source, url)
    elif url_type == 'pdf':
        content = extract_pdf(static_source)
    elif url_type == 'plain':
        content = dynamic_source
    else:
        content = dynamic_source
    return content

Maybe take a look and see if that achieves the same functionality? (apart from not saving it under the "text" extractor).

# if w in text:
# substring += w + ','
result += substring + '\n'
print(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove the print statements at some later point.

polipy/polipy.py Outdated
@@ -3,6 +3,7 @@
from .constants import UTC_DATE, CWD
from .exceptions import NetworkIOException, ParserException
from .logger import get_logger
from bs4 import BeautifulSoup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this import needed here?

@@ -39,6 +40,7 @@ def __init__(self, url):
self.url['url'] = url
self.url = self.url | parse_url(url)
self.url['domain'] = self.url['domain'].strip().strip('.').strip('/')
self.html_file = html_file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is fine for now. Maybe ultimately we could merge the self.htm_file better into the Polipy object, as that's basically the exact same attribute as the self.source['dynamic_html'] after the scraping part. But otherwise it's good to have an option of either providing an URL (and then scraping) or the already-scraped HTML to the constructor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants