Skip to content

Commit

Permalink
Moving go license tools from KFP repo (kubeflow#540)
Browse files Browse the repository at this point in the history
* Moving go license tools from KFP repo

* Move go license tools to py folder

* Add __init__.py file

* Update setup.py

* Fix pylint format

* Fix lint errors
  • Loading branch information
Bobgy authored and k8s-ci-robot committed Dec 18, 2019
1 parent b0ec604 commit e94be0c
Show file tree
Hide file tree
Showing 7 changed files with 539 additions and 0 deletions.
46 changes: 46 additions & 0 deletions py/kubeflow/testing/go-license-tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# CLI tools to fetch go library's license info

## Why we need this?

When we release go library images (can be considered as redistributing third
party binary).

We need to do the following to be compliant:
* Put license declarations in the image for licences of all dependencies and transistive dependencies.
* Mirror source code in the image for code with MPL, EPL, GPL or CDDL licenses.

It's not an easy task to get license of all (transitive) dependencies of a go
library. Thus, we need these tools to automate this task.

## How to get all dependencies with license and source code?

1. Install CLI tools here: `python setup.py install`
1. Collect dependencies + transitive dependencies in a go library. Put them together in a text file called `dep.txt`. Format: each line has a library name. The library name should be a valid golang import module name.

Example ways to get it:
* argo uses gopkg for package management. It has a [Gopkg.lock file](https://github.com/argoproj/argo/blob/master/Gopkg.lock)
with all of its dependencies and transitive dependencies. All the name fields in this file is what we need. You can run `parse-toml-dep` to parse it.
* minio uses [official go modules](https://blog.golang.org/using-go-modules), there's a [go.mod file](https://github.com/minio/minio/blob/master/go.mod) describing its direct dependencies. Run command `go list -m all` to get final versions that will be used in a build for all direct and indirect dependencies, [reference](https://github.com/golang/go/wiki/Modules#daily-workflow). Parse its output to make a file we need.

Reminder: don't forget to put the library itself into `dep.txt`.
1. Run `get-github-repo` to resolve github repos of golang imports. Not all
imports can be figured out by my script, needs manual help for <2% of libraries.

For a library we cannot resolve, manually put it in `dep-repo-mapping.manual.csv`, so the tool knows how to find its github repo the next time.

Defaults to read dependencies from `dep.txt` and writes to `repo.txt`.
1. Run `get-github-license-info` to crawl github license info of these libraries. (Not all repos have github recognizable license, needs manual help for <2% of libraries)

Defaults to read repos from `repo.txt` and writes to `license-info.csv`. You
need to configure github personal access token because it sends a lot of
requests to github. Follow instructions in `get-github-license-info -h`.

For repos that fails to fetch license, it's usually because their github repo
doesn't have a github understandable license file. Check its readme and
update correct info into `license-info.csv`. (Usually, use its README file which mentions license.)
1. Edit license info file. Manually check the license file for all repos with a license categorized as "Other" by github. Figure out their true license names.
1. Run `concatenate-license` to crawl full text license files for all dependencies and concat them into one file.

Defaults to read license info from `license-info.csv`. Writes to `license.txt`.
Put `license.txt` to `third_party/library/license.txt` where it is read when building docker images.
1. Manually update a list of dependencies that requires source code, put it into `third_party/library/repo-MPL.txt`.
Empty file.
82 changes: 82 additions & 0 deletions py/kubeflow/testing/go-license-tools/concatenate_license.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import requests
import sys
import traceback

parser = argparse.ArgumentParser(
description='Generate dependencies json from license.csv file.')
parser.add_argument(
'license_info_file',
nargs='?',
default='license_info.csv',
help='CSV file with license info fetched from github using get-github-license-info CLI tool.'
+'(default: %(default)s)',
)
parser.add_argument(
'-o',
'--output',
dest='output_file',
nargs='?',
default='license.txt',
help=
'Concatenated license file path this command generates. (default: %(default)s)'
)
args = parser.parse_args()


def fetch_license_text(download_link):
response = requests.get(download_link)
assert response.ok, 'Fetching {} failed with {} {}'.format(
download_link, response.status_code, response.reason)
return response.text


def main():
with open(args.license_info_file,
'r') as license_info_file, open(args.output_file,
'w') as output_file:
repo_failed = []
for line in license_info_file:
line = line.strip()
[repo, license_link, license_name,
license_download_link] = line.split(',')
try:
print('Repo {} has license download link {}'.format(
repo, license_download_link),
file=sys.stderr)
license_text = fetch_license_text(license_download_link)
print(
'--------------------------------------------------------------------------------',
file=output_file,
)
print('{} {} {}'.format(repo, license_name, license_link),
file=output_file)
print(
'--------------------------------------------------------------------------------',
file=output_file,
)
print(license_text, file=output_file)
except Exception as e: # pylint: disable=broad-except
print('[failed]', e, file=sys.stderr)
traceback.print_exc(file=sys.stderr)
repo_failed.append(repo)
print('Failed to download license file for {} repos.'.format(len(repo_failed)), file=sys.stderr)
for repo in repo_failed:
print(repo, file=sys.stderr)


main()
107 changes: 107 additions & 0 deletions py/kubeflow/testing/go-license-tools/get_github_license_info.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Copyright 2019 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import requests
import sys
import traceback
from pathlib import Path

home = str(Path.home())
parser = argparse.ArgumentParser(
description='Get github license info from github APIs.')
parser.add_argument(
'repo_list',
nargs='?',
default='repo.txt',
help=
'Github repo list file with one line per github repo. Format: org/repo. (default: %(default)s)',
)
parser.add_argument(
'-o',
'--output',
dest='output_file',
nargs='?',
default='license_info.csv',
help=
'Output file with one line per github repo. Line format: '
+'org/repo,license_html_url,license_name,license_download_url. (default: %(default)s)',
)
parser.add_argument(
'--github-api-token-file',
dest='github_api_token_file',
default='{}/.github_api_token'.format(home),
help='You need to create a github personal access token at https://github.com/settings/tokens, '
+'because github has very strict limit on anonymous API usage. (default: %(default)s) Format: a '
+'text file with one line. '
+'"<40 characters string shown when a new personal access token is created>"'
)
args = parser.parse_args()


def main():
token = None
try:
with open(args.github_api_token_file, 'r') as token_file:
token = token_file.read().strip()
print('Read github API token from {}, length {}.'.format(
args.github_api_token_file, len(token)),
file=sys.stderr)
except FileNotFoundError:
raise Exception((
'Please put a github api token file at {}, or specify a different token file path by '
+'--github-api-token-file. Github API token is needed because anonymous API access limit '
+'is not enough.'
).format(args.github_api_token_file))

# github personal access token is always 40 characters long
assert len(token) == 40
# reference: https://developer.github.com/v3/#oauth2-token-sent-in-a-header
headers = {'Authorization': 'token {}'.format(token)}
with open(args.repo_list,
'r') as repo_list_file, open(args.output_file,
'w') as output_file:
repo_succeeded = []
repo_failed = []
for repo in repo_list_file:
repo = repo.strip()
print('Fetching license for {}'.format(repo), file=sys.stderr)
try:
url = 'https://api.github.com/repos/{}/license'.format(repo)
response = requests.get(url, headers=headers)
if not response.ok:
print('Error response content:\n{}'.format(response.content), file=sys.stderr)
raise Exception('fetching {} failed with {} {}'.format(
url, response.status_code, response.reason))
data = response.json()

download_url = data['download_url']
license_name = data['license']['name']
html_url = data['html_url']
print('{},{},{},{}'.format(repo, html_url, license_name, download_url), file=output_file)
repo_succeeded.append(repo)
except Exception as e: # pylint: disable=broad-except
print('[failed]', e, file=sys.stderr)
traceback.print_exc(file=sys.stderr)
repo_failed.append(repo)
print('Fetched github license info, {} succeeded, {} failed.'.format(
len(repo_succeeded), len(repo_failed)), file=sys.stderr)
if repo_failed:
print('The following repos failed:', file=sys.stderr)
for repo in repo_failed:
print(repo, file=sys.stderr)


if __name__ == '__main__':
main()
Loading

0 comments on commit e94be0c

Please sign in to comment.