This project had the aim to apply different textual analytics techniques and algorithms in order to identify a) emerging industries and b) emerging technologies used in the aforementioned industries. The approach can be simplified as follows:
Preprocessing using regexpr
Benchmarking different preprocessing assumptions
Corpus creation and boundary testing
LDA training
Identifying emerging topics by emergence analysis
Verification of LDA approach with test/train KWIC analysis
Using bigrams to identify specific complicated industries
A result presentation (19th of Novermber, 2019) with more details, can be downloaded here:
LDA training with 75 topics and their assignment to the individual companies by gamma, led to 7 emerging topics: