Proposal Exploration: NLP Analysis of SCRF Forum

Hi everyone,

We (@reneedaos, k3nn.eth, and @snowy_coast) conducted some preliminary natural language processing (NLP) topic modeling analysis of the SCRF forum and would like to get feedback on how this approach could be further developed to benefit SCRF.

We pulled 3703 posts from the SCRF website using a fork of the Sourcecred Discourse plug-in and removed punctuation, symbols, and common English “stop-words”. A word cloud of the posts (total of 4,191,546 words) depicts the frequency of common words in the posts:

word cloud scrf 2022-12-13 071926

We ran two different unsupervised topic modeling analyses in Python, a Latent Dirichelet Allocation (LDA) model based on the gensim package, and a model based on the LDA Mallet package, with the number of latent topics set at 10. The LDA Mallet package obtained better topic coherence (c_v score 0.71 of _ vs. 0.36). The topic distances obtained from LDA Mallet are visualized below:

Skimming over the latent topics, it looks like the second is about DeFi, the fourth is about central bank digital currency, the fifth is about security, the sixth is about community culture, etc . The numbers represent probabilities that a term is associated with a particular topic:

'0.013*“aside” + 0.013*“avatar” + 0.011*“lazy” + 0.010*“quote” + ’
'0.009*“store” + 0.009*“commodity” + 0.009*“title” + 0.009*“address” + ’
‘0.008*“wallet” + 0.007*“private”’),
'0.030*“dollars” + 0.024*“span” + 0.021*“market” + 0.014*“liquidity” + ’
'0.014*“use” + 0.014*“true” + 0.013*“tokens” + 0.012*“function” + ’
‘0.012*“strong” + 0.012*“fa”’),
'0.026*“investment” + 0.019*“speculative” + 0.018*“value” + 0.015*“logical” ’
'+ 0.013*“completely” + 0.011*“opportunities” + 0.008*“seem” + 0.008*“tied” ’
‘+ 0.007*“may” + 0.007*“system”’),
'0.074*“currency” + 0.059*“money” + 0.045*“naira” + 0.035*“government” + ’
'0.022*“nigerian” + 0.022*“national” + 0.017*“flash” + 0.016*“value” + ’
‘0.013*“loan” + 0.013*“maximize”’),
'0.018*“mention” + 0.017*“people” + 0.017*“think” + 0.015*“want” + ’
'0.014*“good” + 0.012*“community” + 0.011*“culture” + 0.011*“like” + ’
‘0.010*“research” + 0.009*“looking”’),
'0.014*“strong” + 0.014*“security” + 0.012*“contracts” + 0.012*“network” + ’
'0.010*“paper” + 0.010*“privacy” + 0.008*“data” + 0.007*“users” + ’
‘0.007*“transactions” + 0.006*“authors”’),
'0.019*“math” + 0.018*“challenges” + 0.016*“election” + 0.015*“big” + ’
'0.010*“fundamental” + 0.010*“em” + 0.007*“knowledge” + 0.007*“test” + ’
‘0.006*“game” + 0.006*“cold”’),
'0.019*“data” + 0.016*“country” + 0.010*“services” + 0.008*“outside” + ’
'0.008*“decentralized” + 0.008*“ideal” + 0.007*“enter” + ’
‘0.007*“conversation” + 0.007*“shared” + 0.007*“government”’),
'0.024*“transaction” + 0.019*“transactions” + 0.013*“energy” + 0.012*“block” ’
'+ 0.012*“fee” + 0.012*“inflation” + 0.011*“repeating” + 0.010*“rate” + ’
‘0.009*“synthetic” + 0.009*“nodes”’),
'0.048*“currencies” + 0.028*“strong” + 0.023*“governance” + 0.017*“country” ’
'+ 0.015*“anchor” + 0.015*“digital” + 0.013*“gdp” + 0.009*“amp” + ’
‘0.009*“control” + 0.008*“technology”’)]

This type of NLP analysis could be extended in many ways that would potentially benefit the users of SCRF. For example, we could:

  • Create concurrence matrices of latent topics with meta-data like category tags and visualize number of likes to help us better understand what topics and categories are driving engagement on the forum.
  • Identify common noun and verb phrases to get a better idea of the precise ideas that are central to engagement in SCRF.
  • Fine-tune the latent topic modeling to determine the optimal number of latent topics; we set the number of latent topics in the above models to 10, but the optimal number (as determined by a measure such as coherence) could be quite different.
  • Convert the unsupervised topic clouds to real topics by feeding gpt3 to the top words in the topic to provide better human readability.
  • Employ newer topic modeling packages like BerTopic that allow guided topic modeling to get better topic suggestions.
  • Pursue the development of commercial software such as a DAO governance tool, knowledge graph management system, and research chat bot.

We welcome collaboration as well as input from the wonderful folks here to suggest fruitful avenues for NLP research and applications that would benefit SCRF :slight_smile: We are particularly interested in joint ventures with the SCRF community to build out data analysis tools and products for DAO tooling :slight_smile:


Thank you @snowy_coast for posting! I’m very excited about this project and the potential it has for both the SCRF and talentDAO community.

I look forward to everyone’s comments and feedback :slightly_smiling_face:

Please do share your thoughts on

  1. What research questions do you have about this data? What can we learn from analyzing the SCRF forum?

  2. What products/services could be built with this analysis?

  3. What DAO governance tooling might we build from this?


It’ll really be nice if all your ideas on this project were to come through as planned/prepared… the possibilities it holds for the talentDAO and SCRF communities Is just exciting… Nice work @snowy_coast


Hi @snowy_coast, thanks for this post. It will be valuable to the community.

As BerTopic is new to me, I had to research it. I found out that, on Google, BerTopic can be used to detect searcher’s intent even before they type all sentences into the search box. A sort of autocomplete.

I don’t know if BerTopic will be serving the same function here on SCRF. If it’s the latter then I believe it will give easier access to source for information in the forum.

Reading the post quoted above, I have two questions about NLP:

  1. Will NLP be used on the Forum for correction of phrases and spellings, etc

  2. Can NLP be used to detect plagiarized contents on the forum?


Interesting work. You can also analyze “expertise” by creating a two-mode matrix (LDA topic - contributor) and through this see how many generalist can specialist exists in the community. Of course there are limitations (contributing != expertise). If this sounds interesting, I can dig up the paper + python package for it.


@snowy_coast I must commend you for this wonderful analysis

The NLP extension in favor of the above statement will be useful for an effective discussion on this platform


Hi @yeoriton56,

"Will NLP be used on the Forum for correction of phrases and spellings, etc

Can NLP be used to detect plagiarized contents on the forum?"

We hadn’t looked into these particular use cases but would be happy to explore them if there’s interest :grin:


Hi @katerinabohlec,

Yes, that sounds intriguing. Please forward the relevant paper and Python package when you have a chance, thanks!

1 Like

@snowy_coast, I’m also interested in seeing how this works out. It will be nice to see how it plays out here at SCRF. I hope my vote counts towards helping you consider adding it☺.

1 Like