Proposal Exploration: NLP Analysis of SCRF Forum

snowy_coast · December 15, 2022, 3:32pm

Hi everyone,

We (@reneedaos, k3nn.eth, and @snowy_coast) conducted some preliminary natural language processing (NLP) topic modeling analysis of the SCRF forum and would like to get feedback on how this approach could be further developed to benefit SCRF.

We pulled 3703 posts from the SCRF website using a fork of the Sourcecred Discourse plug-in and removed punctuation, symbols, and common English “stop-words”. A word cloud of the posts (total of 4,191,546 words) depicts the frequency of common words in the posts:

word cloud scrf 2022-12-13 071926

We ran two different unsupervised topic modeling analyses in Python, a Latent Dirichelet Allocation (LDA) model based on the gensim package, and a model based on the LDA Mallet package, with the number of latent topics set at 10. The LDA Mallet package obtained better topic coherence (c_v score 0.71 of _ vs. 0.36). The topic distances obtained from LDA Mallet are visualized below:

Skimming over the latent topics, it looks like the second is about DeFi, the fourth is about central bank digital currency, the fifth is about security, the sixth is about community culture, etc . The numbers represent probabilities that a term is associated with a particular topic:

[(0,
'0.013*“aside” + 0.013*“avatar” + 0.011*“lazy” + 0.010*“quote” + ’
'0.009*“store” + 0.009*“commodity” + 0.009*“title” + 0.009*“address” + ’
‘0.008*“wallet” + 0.007*“private”’),
(1,
'0.030*“dollars” + 0.024*“span” + 0.021*“market” + 0.014*“liquidity” + ’
'0.014*“use” + 0.014*“true” + 0.013*“tokens” + 0.012*“function” + ’
‘0.012*“strong” + 0.012*“fa”’),
(2,
'0.026*“investment” + 0.019*“speculative” + 0.018*“value” + 0.015*“logical” ’
'+ 0.013*“completely” + 0.011*“opportunities” + 0.008*“seem” + 0.008*“tied” ’
‘+ 0.007*“may” + 0.007*“system”’),
(3,
'0.074*“currency” + 0.059*“money” + 0.045*“naira” + 0.035*“government” + ’
'0.022*“nigerian” + 0.022*“national” + 0.017*“flash” + 0.016*“value” + ’
‘0.013*“loan” + 0.013*“maximize”’),
(4,
'0.018*“mention” + 0.017*“people” + 0.017*“think” + 0.015*“want” + ’
'0.014*“good” + 0.012*“community” + 0.011*“culture” + 0.011*“like” + ’
‘0.010*“research” + 0.009*“looking”’),
(5,
'0.014*“strong” + 0.014*“security” + 0.012*“contracts” + 0.012*“network” + ’
'0.010*“paper” + 0.010*“privacy” + 0.008*“data” + 0.007*“users” + ’
‘0.007*“transactions” + 0.006*“authors”’),
(6,
'0.019*“math” + 0.018*“challenges” + 0.016*“election” + 0.015*“big” + ’
'0.010*“fundamental” + 0.010*“em” + 0.007*“knowledge” + 0.007*“test” + ’
‘0.006*“game” + 0.006*“cold”’),
(7,
'0.019*“data” + 0.016*“country” + 0.010*“services” + 0.008*“outside” + ’
'0.008*“decentralized” + 0.008*“ideal” + 0.007*“enter” + ’
‘0.007*“conversation” + 0.007*“shared” + 0.007*“government”’),
(8,
'0.024*“transaction” + 0.019*“transactions” + 0.013*“energy” + 0.012*“block” ’
'+ 0.012*“fee” + 0.012*“inflation” + 0.011*“repeating” + 0.010*“rate” + ’
‘0.009*“synthetic” + 0.009*“nodes”’),
(9,
'0.048*“currencies” + 0.028*“strong” + 0.023*“governance” + 0.017*“country” ’
'+ 0.015*“anchor” + 0.015*“digital” + 0.013*“gdp” + 0.009*“amp” + ’
‘0.009*“control” + 0.008*“technology”’)]

This type of NLP analysis could be extended in many ways that would potentially benefit the users of SCRF. For example, we could:

Create concurrence matrices of latent topics with meta-data like category tags and visualize number of likes to help us better understand what topics and categories are driving engagement on the forum.
Identify common noun and verb phrases to get a better idea of the precise ideas that are central to engagement in SCRF.
Fine-tune the latent topic modeling to determine the optimal number of latent topics; we set the number of latent topics in the above models to 10, but the optimal number (as determined by a measure such as coherence) could be quite different.
Convert the unsupervised topic clouds to real topics by feeding gpt3 to the top words in the topic to provide better human readability.
Employ newer topic modeling packages like BerTopic that allow guided topic modeling to get better topic suggestions.
Pursue the development of commercial software such as a DAO governance tool, knowledge graph management system, and research chat bot.

We welcome collaboration as well as input from the wonderful folks here to suggest fruitful avenues for NLP research and applications that would benefit SCRF We are particularly interested in joint ventures with the SCRF community to build out data analysis tools and products for DAO tooling

reneedaos · December 15, 2022, 3:54pm

Thank you @snowy_coast for posting! I’m very excited about this project and the potential it has for both the SCRF and talentDAO community.

I look forward to everyone’s comments and feedback

Please do share your thoughts on

What research questions do you have about this data? What can we learn from analyzing the SCRF forum?
What products/services could be built with this analysis?
What DAO governance tooling might we build from this?

Amazingdez · December 15, 2022, 8:12pm

It’ll really be nice if all your ideas on this project were to come through as planned/prepared… the possibilities it holds for the talentDAO and SCRF communities Is just exciting… Nice work @snowy_coast

Yeoriton56 · December 16, 2022, 6:00am

Hi @snowy_coast, thanks for this post. It will be valuable to the community.

As BerTopic is new to me, I had to research it. I found out that, on Google, BerTopic can be used to detect searcher’s intent even before they type all sentences into the search box. A sort of autocomplete.

I don’t know if BerTopic will be serving the same function here on SCRF. If it’s the latter then I believe it will give easier access to source for information in the forum.

Reading the post quoted above, I have two questions about NLP:

Will NLP be used on the Forum for correction of phrases and spellings, etc
Can NLP be used to detect plagiarized contents on the forum?

katerinabohlec · December 20, 2022, 6:42pm

Interesting work. You can also analyze “expertise” by creating a two-mode matrix (LDA topic - contributor) and through this see how many generalist can specialist exists in the community. Of course there are limitations (contributing != expertise). If this sounds interesting, I can dig up the paper + python package for it.

Raphking · December 22, 2022, 11:47am

@snowy_coast I must commend you for this wonderful analysis

The NLP extension in favor of the above statement will be useful for an effective discussion on this platform

snowy_coast · December 24, 2022, 5:22am

Hi @yeoriton56,

"Will NLP be used on the Forum for correction of phrases and spellings, etc

Can NLP be used to detect plagiarized contents on the forum?"

We hadn’t looked into these particular use cases but would be happy to explore them if there’s interest

snowy_coast · December 24, 2022, 5:25am

Hi @katerinabohlec,

Yes, that sounds intriguing. Please forward the relevant paper and Python package when you have a chance, thanks!

Ulysses · December 24, 2022, 11:25am

@snowy_coast, I’m also interested in seeing how this works out. It will be nice to see how it plays out here at SCRF. I hope my vote counts towards helping you consider adding it☺.

Topic		Replies	Views
SCRF Contributor Recognition and Rewards Initiative Community	13	3942	September 2, 2022
New to the forum? Start here Community about	99	7274	May 7, 2023
Proposal to add a Community Category on the forum Meta	13	1493	August 23, 2022
Notable Works in Tooling and Languages Tooling and Languages notable-works	2	932	June 16, 2021
Onboarding to SCRF - Call for Research Community	0	701	June 29, 2022

Proposal Exploration: NLP Analysis of SCRF Forum

Related Topics