KLUSTER
Kluster stands as an innovative project focused on enhancing the way we search for knowledge. By integrating advanced Natural Language Processing (NLP) techniques and clustering algorithms, we aim to offer a more nuanced exploration of information. Collaborating with a graph-based search interface, Kluster strives to provide a decentralized and well-rounded searching experience.
Introduction
While traditional search engines like Google and Bing excel at providing rapid results through algorithms such as PageRank, there can be limitations in accessing lesser-known, yet significant information.
Consider the query "the most renowned Asian actress of all time" on a conventional search engine. It's no surprise that names like Lucy Liu and Gong Li dominate the top results, potentially overshadowing historically important figures like Anna May Wong, the first Chinese actress in Hollywood, buried deeper in the rankings. Sometimes, valuable insights might only appear as far down as the 26th position on the first search page.
My analysis has pinpointed two primary factors that contribute to the challenge of accessing critical marginal knowledge in such searches.
Firstly, the PageRank algorithm inherently centralizes popular information, resulting in a skewed distribution of search outcomes.
Secondly, the linear indexing format used on search pages assigns varying importance to different pieces of information.
With these considerations in mind, our project sets out to develop a design solution that promotes a more decentralized and equitable search experience. Kluster represents a step towards a future where information's richness is more fully accessible, allowing users to unearth hidden gems within the vast landscape of knowledge.
Design Solutions
Design Problems
-
Limited exposure to diverse perspectives and alternative viewpoints during the search experience.
-
Conventional information/knowledge ranking algorithms perpetuate centralized information.
-
Information is presented in a linear and unbiased fashion.
Tech solution
In order to enhance the search algorithm for decentralized results, I have opted for the Cluster algorithm, which takes into account the density of information relationships, as opposed to the ranking algorithm that emphasizes centralized ranking priorities.
Design Challenges
-
Find out and imply a new way of organizing knowledge in a decentralized way
-
Create a more impartial and non-linear experience of searching and viewing the knowledge
Visual Solution
Recognizing the inherent disparities in exposure coefficients stemming from a linear indexing format, I have opted for a graph-based presentation format for the search results.
Product Positioning
Under what circumstances could this search experience be beneficial? We do not propose this design as a replacement for traditional search engines. However, a graph-based decentralized search experience could serve as a valuable tool for users without a specific search goal, or those seeking exposure to a diverse range of information. For instance, consider a design student in search of a meaningful project topic, but one that avoids overly popular subjects.
Workflow
Text File Dataset
This text file dataset is a example of the topic "philosophy".
philosophy_and_the_social_problem.txt
the_consolation_of_philosophy.txt
initiation_into_philosophy.txt
High frequency Entities
Frequent words 13000
(philosophy, Plato, human, morality, intelligence.....)
Language Model
We got a language model (philosophy.kv) based on training on the text file dataset. This model contains a text understanding in the context of philosophy
Example:
{ stories: [-3.468945 -0.08118384 -1.0659883 -1.6832864 -2.128204 -0.6112681 -1.1394693 -0.97605973 1.4505659 -0.15622564 2.0152066 0.7203029 -1.7843394 .... ],
human:[array...] }
{Embeded Dataset}
Entities_1 : [Vectors]
Entities_3 : [Vectors]
Entities_5 : [Vectors]
Entities_2 : [Vectors]
Entities_4 : [Vectors]
Entities_6 : [Vectors]
This part is the original data processing where we trained text dataset into vector graph.
Model Training
At this part, we applied cluster algorithm (DBSCAN) to further processed the data into groups.
DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
Finds core samples in regions of high density and expands
clusters from them.
pyplot visulization of the 3000/13000 dataset before clustering
perplexity 40
Cluster Results
This example illustrates a portion of a cluster result obtained when appropriate parameters are applied. Words enclosed within braces {} are grouped together based on their hyperdimensional distances, indicating potential relationships within the context of "philosophy."
When varying parameters are employed, multiple results are generated. Within each result, words are grouped according to different potential relationships.
Belowed figure showes how the word "ego" is grouped with different words with different cluster parameters.
Constomized Dataset
Thereby, we have a customized (based on training) searching dataset in which different information are clustered inside different group of multiple results.
Here we have how user searching behavior interactes with the cluster dataset.
User Interaction
The searched word
1. check database
2. Return results
Clustering
Output
Interface
search bar
slider control display density
word calculation search
cluster result list display
a cluster of results
closer to center means closer relationship