It's a web crawler to get information of listed companies, including their english names, chinese names, abbreviation, codes, and important related people. Besides, it also gets news of these companies. My goal is to build up a dataset of financial news to help the development of Chinese Natural Language Processing.
$ conda create --name GoodInfo-py38 python=3.8
$ conda activate GoodInfo-py38
$ git clone https://github.com/allenyummy/GoodInfo.git
If just use the repo as a service, it's enough to only install basic dependencies. There are two ways to achieve the goal.
-
Run commands with
poetry install
$ pip install poetry $ poetry install --no-dev
-
Run commands with
pip install
$ pip install -r requirements.txt
If wanna modify some codes, it's highly recommended to install both basic and dev dependencies.
- Run command with
poetry install
$ pip install poetry $ poetry install $ pre-commit install
Run below command to make sure there is "./" dir in PYTHONPATH.
$ export PYTHONPATH=./
After doing this, make sure working directory is xxx/GoodInfo/
.
- Get entire basic information
-
Run with command line interface
$ python src/entry_goodinfo.py \ -o $(outfile).json \ -c $(cachefile).json
Noted:
-
Caution
GoodInfo may recognize the program as automated robots and block the IP; therefore, connect hotspot from iphone to run the program. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.).
-
-
Run with python package
from src.crawler.goodinfo.goodinfo import get_code_name, get_basic data = get_code_name() for d in data: stock_code = d["股票代號"] d_basic = get_basic(stock_code)
-
-
Get google news
-
Run with command line interface
$ python src/crawler/googlenews/gnews.py \ -iq $(input_query_file).txt \ -o $(outfile).json \ -c $(cachefile).json \
Noted:
-
$(input_query_file).txt
contains queries forsrc/crawler/googlenews/gnews.py
. One query per line and if a query contains multiple keywords, keywords should be concatenated with space. For example:台積電 2330 聯發科 2454 ...
-
$(outfile)
and$(cachefile)
could be same or different. Both of them must be json file. -
The results should contain title, description, media, datetime, link. See more details in
src/utils/struct::GoogleNewsStruct
. -
Caution
Google may recognize the program as automated robots and block the IP, using cloud server and fetching data with high frequency will get higher chance to be blocked. Therefore, connect hotspot from iphone to run
gnews.py
. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.). Besides, I usetime.sleep(random.uniform(30, 50))
but it may lower the possibility to get IP blocked.If got locked, you'll receive a message of
HTTP Error 420: Too many requests
. Just do the action mentioned above and you can run the program successfully.
-
-
Run with python package
from src.crawler.googlenews.gnews import GoogleNewsCrawler gnc = GoogleNewsCrawler() results = gnc.getInfo(query="台積電 2330")
-
-
Get news details
-
Run with command line interface
$ python src/entry_media.py \ -m $(media) \ -l $(link) \ -o $(outfile).json \ -c $(cachefile).json \
Noted:
-
$(link)
could be one or multiple. If multiple links, spearated them by white space. -
$(outfile)
and$(cachefile)
could be same or different. Both of them must be json file. -
The results should contain title, content, keywords, category, media, datetime, link. See more details in
src/utils/struct::NewsStruct
. -
$(media)
args could be as follows:media supported main url appledaily o 蘋果日報 bcc o 中國廣播公司 bnext o 數位時代
Meet創業小聚chinatimes o 中時新聞網 cmmedia o 信傳媒 cna o 中央社 cnews o 匯流新聞網 ctee o 工商時報 ctitv o 中天新聞 cts o 華視新聞 ctv x
(audio-visual)中視新聞 cynes o 鉅亨網
鉅亨新聞網digitimes o DigiTimes ebc x
(contents are locked)東森財經新聞 epochtimes o 大紀元 era o 年代新聞 ettoday o ETtoday新聞雲
ETtoday財經雲ftv o 民視新聞 kairos o 風向新聞 ltn o 自由時報電子報
自由財經mirror o 鏡週刊 moneydj o MoneyDJ理財網 moneyudn o 經濟日報 mypeoplevol o 民眾新聞網 newtalk o 新頭殼 nownews o 今日新聞 pchome o PChome新聞 peoplenews o 民報 pts o 公視新聞 rti o 中央廣播電臺 setn o 三立新聞 sina o 新浪新聞 storm o 風傳媒 taiwanhot o 台灣好新聞 taronews o 芋傳媒 technews o 科技新報
財經新報thenewslens o 關鍵評論網 ttv o 台視新聞 tvbs o TVBS udn o 聯合新聞網 upmedia o 上報 ustv o 非凡新聞 wealth o 財訊 worldjournal o 世界新聞網 yahoo o Yahoo奇摩新聞
Yahoo奇摩股市 -
run example code for specific media, and it will generate a
out/sample_$(media).json
in local.$ make run_example media=$(media)
-
-
Run with python package
from src.crawler.media.factory import MediaNewsCrawlerFactory nc = MediaNewsCrawlerFactory(media_name="appledaily") result = nc.getInfo(link="xxxxxx")
-
- Test media crawler
$ make run_test_all