Web Crawler for GoodInfo and Financial News of Companies in Taiwan

It's a web crawler to get information of listed companies, including their english names, chinese names, abbreviation, codes, and important related people. Besides, it also gets news of these companies. My goal is to build up a dataset of financial news to help the development of Chinese Natural Language Processing.

Prepare virtual environment

$ conda create --name GoodInfo-py38 python=3.8
$ conda activate GoodInfo-py38
$ git clone https://github.com/allenyummy/GoodInfo.git

Install basic dependencies

If just use the repo as a service, it's enough to only install basic dependencies. There are two ways to achieve the goal.

Run commands with poetry install

$ pip install poetry
$ poetry install --no-dev

Run commands with pip install
```
$ pip install -r requirements.txt
```

Install basic and dev dependencies (optional)

If wanna modify some codes, it's highly recommended to install both basic and dev dependencies.

Run command with poetry install

$ pip install poetry
$ poetry install
$ pre-commit install

Run Code

Set uo

Run below command to make sure there is "./" dir in PYTHONPATH.

$ export PYTHONPATH=./

After doing this, make sure working directory is xxx/GoodInfo/.

Get company info

Get entire basic information
- Run with command line interface
```
$ python src/entry_goodinfo.py \
    -o $(outfile).json \
    -c $(cachefile).json
```
  Noted:
  - Caution
    
    GoodInfo may recognize the program as automated robots and block the IP; therefore, connect hotspot from iphone to run the program. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.).
- Run with python package
```
from src.crawler.goodinfo.goodinfo import get_code_name, get_basic

data = get_code_name()
for d in data:
    stock_code = d["股票代號"]
    d_basic = get_basic(stock_code)
```

Get google news info

Get google news
- Run with command line interface
```
$ python src/crawler/googlenews/gnews.py \
    -iq $(input_query_file).txt \
    -o $(outfile).json \
    -c $(cachefile).json \
```
  Noted:
  - $(input_query_file).txt contains queries for src/crawler/googlenews/gnews.py. One query per line and if a query contains multiple keywords, keywords should be concatenated with space. For example:
```
台積電 2330
聯發科 2454
...
```
  - $(outfile) and $(cachefile) could be same or different. Both of them must be json file.
  - The results should contain title, description, media, datetime, link. See more details in src/utils/struct::GoogleNewsStruct.
  - Caution
    
    Google may recognize the program as automated robots and block the IP, using cloud server and fetching data with high frequency will get higher chance to be blocked. Therefore, connect hotspot from iphone to run gnews.py. Once your IP address is blocked by Google, you can turn on flight mode and then turn off, and re-open your hotspot to change your IP address (p.s., hotspot ip address is floating.). Besides, I use time.sleep(random.uniform(30, 50)) but it may lower the possibility to get IP blocked.
    
    If got locked, you'll receive a message of HTTP Error 420: Too many requests. Just do the action mentioned above and you can run the program successfully.
- Run with python package
```
from src.crawler.googlenews.gnews import GoogleNewsCrawler

gnc = GoogleNewsCrawler()
results = gnc.getInfo(query="台積電 2330")
```

Get news (given a link or links)

Get news details

Run with command line interface

$ python src/entry_media.py \
    -m $(media) \
    -l $(link) \
    -o $(outfile).json \
    -c $(cachefile).json \

Noted:

$(link) could be one or multiple. If multiple links, spearated them by white space.
$(outfile) and $(cachefile) could be same or different. Both of them must be json file.
The results should contain title, content, keywords, category, media, datetime, link. See more details in src/utils/struct::NewsStruct.

$(media) args could be as follows:

media	supported	main url
appledaily	o	蘋果日報
bcc	o	中國廣播公司
bnext	o	數位時代 Meet創業小聚
chinatimes	o	中時新聞網
cmmedia	o	信傳媒
cna	o	中央社
cnews	o	匯流新聞網
ctee	o	工商時報
ctitv	o	中天新聞
cts	o	華視新聞
ctv	x (audio-visual)	中視新聞
cynes	o	鉅亨網鉅亨新聞網
digitimes	o	DigiTimes
ebc	x (contents are locked)	東森財經新聞
epochtimes	o	大紀元
era	o	年代新聞
ettoday	o	ETtoday新聞雲 ETtoday財經雲
ftv	o	民視新聞
kairos	o	風向新聞
ltn	o	自由時報電子報自由財經
mirror	o	鏡週刊
moneydj	o	MoneyDJ理財網
moneyudn	o	經濟日報
mypeoplevol	o	民眾新聞網
newtalk	o	新頭殼
nownews	o	今日新聞
pchome	o	PChome新聞
peoplenews	o	民報
pts	o	公視新聞
rti	o	中央廣播電臺
setn	o	三立新聞
sina	o	新浪新聞
storm	o	風傳媒
taiwanhot	o	台灣好新聞
taronews	o	芋傳媒
technews	o	科技新報財經新報
thenewslens	o	關鍵評論網
ttv	o	台視新聞
tvbs	o	TVBS
udn	o	聯合新聞網
upmedia	o	上報
ustv	o	非凡新聞
wealth	o	財訊
worldjournal	o	世界新聞網
yahoo	o	Yahoo奇摩新聞 Yahoo奇摩股市

run example code for specific media, and it will generate a out/sample_$(media).json in local.
```
$ make run_example media=$(media)
```

Run with python package

from src.crawler.media.factory import MediaNewsCrawlerFactory

nc = MediaNewsCrawlerFactory(media_name="appledaily")
result = nc.getInfo(link="xxxxxx")

Test

Test media crawler
```
$ make run_test_all
```

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github/workflows		.github/workflows
log		log
out		out
script		script
src		src
tests/crawler/media		tests/crawler/media
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler for GoodInfo and Financial News of Companies in Taiwan

Prepare virtual environment

Install basic dependencies

Install basic and dev dependencies (optional)

Run Code

Set uo

Get company info

Get google news info

Get news (given a link or links)

Test

About

Releases

Packages

Languages

License

allenyummy/GoodInfo

Folders and files

Latest commit

History

Repository files navigation

Web Crawler for GoodInfo and Financial News of Companies in Taiwan

Prepare virtual environment

Install basic dependencies

Install basic and dev dependencies (optional)

Run Code

Set uo

Get company info

Get google news info

Get news (given a link or links)

Test

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages