Hi everyone,
We (@reneedaos, k3nn.eth, and @snowy_coast) conducted some preliminary natural language processing (NLP) topic modeling analysis of the SCRF forum and would like to get feedback on how this approach could be further developed to benefit SCRF.
We pulled 3703 posts from the SCRF website using a fork of the Sourcecred Discourse plug-in and removed punctuation, symbols, and common English “stop-words”. A word cloud of the posts (total of 4,191,546 words) depicts the frequency of common words in the posts:
We ran two different unsupervised topic modeling analyses in Python, a Latent Dirichelet Allocation (LDA) model based on the gensim package, and a model based on the LDA Mallet package, with the number of latent topics set at 10. The LDA Mallet package obtained better topic coherence (c_v score 0.71 of _ vs. 0.36). The topic distances obtained from LDA Mallet are visualized below:
Skimming over the latent topics, it looks like the second is about DeFi, the fourth is about central bank digital currency, the fifth is about security, the sixth is about community culture, etc . The numbers represent probabilities that a term is associated with a particular topic:
[(0,
'0.013*“aside” + 0.013*“avatar” + 0.011*“lazy” + 0.010*“quote” + ’
'0.009*“store” + 0.009*“commodity” + 0.009*“title” + 0.009*“address” + ’
‘0.008*“wallet” + 0.007*“private”’),
(1,
'0.030*“dollars” + 0.024*“span” + 0.021*“market” + 0.014*“liquidity” + ’
'0.014*“use” + 0.014*“true” + 0.013*“tokens” + 0.012*“function” + ’
‘0.012*“strong” + 0.012*“fa”’),
(2,
'0.026*“investment” + 0.019*“speculative” + 0.018*“value” + 0.015*“logical” ’
'+ 0.013*“completely” + 0.011*“opportunities” + 0.008*“seem” + 0.008*“tied” ’
‘+ 0.007*“may” + 0.007*“system”’),
(3,
'0.074*“currency” + 0.059*“money” + 0.045*“naira” + 0.035*“government” + ’
'0.022*“nigerian” + 0.022*“national” + 0.017*“flash” + 0.016*“value” + ’
‘0.013*“loan” + 0.013*“maximize”’),
(4,
'0.018*“mention” + 0.017*“people” + 0.017*“think” + 0.015*“want” + ’
'0.014*“good” + 0.012*“community” + 0.011*“culture” + 0.011*“like” + ’
‘0.010*“research” + 0.009*“looking”’),
(5,
'0.014*“strong” + 0.014*“security” + 0.012*“contracts” + 0.012*“network” + ’
'0.010*“paper” + 0.010*“privacy” + 0.008*“data” + 0.007*“users” + ’
‘0.007*“transactions” + 0.006*“authors”’),
(6,
'0.019*“math” + 0.018*“challenges” + 0.016*“election” + 0.015*“big” + ’
'0.010*“fundamental” + 0.010*“em” + 0.007*“knowledge” + 0.007*“test” + ’
‘0.006*“game” + 0.006*“cold”’),
(7,
'0.019*“data” + 0.016*“country” + 0.010*“services” + 0.008*“outside” + ’
'0.008*“decentralized” + 0.008*“ideal” + 0.007*“enter” + ’
‘0.007*“conversation” + 0.007*“shared” + 0.007*“government”’),
(8,
'0.024*“transaction” + 0.019*“transactions” + 0.013*“energy” + 0.012*“block” ’
'+ 0.012*“fee” + 0.012*“inflation” + 0.011*“repeating” + 0.010*“rate” + ’
‘0.009*“synthetic” + 0.009*“nodes”’),
(9,
'0.048*“currencies” + 0.028*“strong” + 0.023*“governance” + 0.017*“country” ’
'+ 0.015*“anchor” + 0.015*“digital” + 0.013*“gdp” + 0.009*“amp” + ’
‘0.009*“control” + 0.008*“technology”’)]
This type of NLP analysis could be extended in many ways that would potentially benefit the users of SCRF. For example, we could:
- Create concurrence matrices of latent topics with meta-data like category tags and visualize number of likes to help us better understand what topics and categories are driving engagement on the forum.
- Identify common noun and verb phrases to get a better idea of the precise ideas that are central to engagement in SCRF.
- Fine-tune the latent topic modeling to determine the optimal number of latent topics; we set the number of latent topics in the above models to 10, but the optimal number (as determined by a measure such as coherence) could be quite different.
- Convert the unsupervised topic clouds to real topics by feeding gpt3 to the top words in the topic to provide better human readability.
- Employ newer topic modeling packages like BerTopic that allow guided topic modeling to get better topic suggestions.
- Pursue the development of commercial software such as a DAO governance tool, knowledge graph management system, and research chat bot.
We welcome collaboration as well as input from the wonderful folks here to suggest fruitful avenues for NLP research and applications that would benefit SCRF We are particularly interested in joint ventures with the SCRF community to build out data analysis tools and products for DAO tooling