Skip to content

Web Crawler for Goodinfo and Financial News of companies in Taiwan.

License

Notifications You must be signed in to change notification settings

allenyummy/GoodInfo

Repository files navigation

Web Crawler for GoodInfo and Financial News of Companies in Taiwan

It's a web crawler to get information of listed companies, including their english names, chinese names, abbreviation, codes, and important related people. Besides, it also gets news of these companies. My goal is to build up a dataset of financial news to help the development of Chinese Natural Language Processing.


Prepare virtual environment

$ conda create --name GoodInfo-py38 python=3.8
$ conda activate GoodInfo-py38
$ git clone https://github.com/allenyummy/GoodInfo.git

Install basic dependencies

If just use the repo as a service, it's enough to only install basic dependencies. There are two ways to achieve the goal.

  • Run commands with poetry install

    $ pip install poetry
    $ poetry install --no-dev
    
  • Run commands with pip install

    $ pip install -r requirements.txt
    

Install basic and dev dependencies (optional)

If wanna modify some codes, it's highly recommended to install both basic and dev dependencies.

  • Run command with poetry install
    $ pip install poetry
    $ poetry install
    $ pre-commit install
    

Run Code

Set uo

Run below command to make sure there is "./" dir in PYTHONPATH.

$ export PYTHONPATH=./

After doing this, make sure working directory is xxx/GoodInfo/.

Get company info

  • Get entire basic information
    • Run with command line interface

      $ python src/entry_goodinfo.py \
          -o $(outfile).json \
          -c $(cachefile).json
      

      Noted:

      • Caution

        GoodInfo may recognize the program as automated robots and block the IP; therefore, connect hotspot from iphone to run the program. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.).

    • Run with python package

      from src.crawler.goodinfo.goodinfo import get_code_name, get_basic
      
      data = get_code_name()
      for d in data:
          stock_code = d["股票代號"]
          d_basic = get_basic(stock_code)
      

Get google news info

  • Get google news

    • Run with command line interface

      $ python src/crawler/googlenews/gnews.py \
          -iq $(input_query_file).txt \
          -o $(outfile).json \
          -c $(cachefile).json \
      

      Noted:

      • $(input_query_file).txt contains queries for src/crawler/googlenews/gnews.py. One query per line and if a query contains multiple keywords, keywords should be concatenated with space. For example:

        台積電 2330
        聯發科 2454
        ...
        
      • $(outfile) and $(cachefile) could be same or different. Both of them must be json file.

      • The results should contain title, description, media, datetime, link. See more details in src/utils/struct::GoogleNewsStruct.

      • Caution

        Google may recognize the program as automated robots and block the IP, using cloud server and fetching data with high frequency will get higher chance to be blocked. Therefore, connect hotspot from iphone to run gnews.py. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.). Besides, I use time.sleep(random.uniform(30, 50)) but it may lower the possibility to get IP blocked.

        If got locked, you'll receive a message of HTTP Error 420: Too many requests. Just do the action mentioned above and you can run the program successfully.

    • Run with python package

      from src.crawler.googlenews.gnews import GoogleNewsCrawler
      
      gnc = GoogleNewsCrawler()
      results = gnc.getInfo(query="台積電 2330")
      

Get news (given a link or links)


Test

  • Test media crawler
    $ make run_test_all
    

About

Web Crawler for Goodinfo and Financial News of companies in Taiwan.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published