KLUSTER

Kluster stands as an innovative project focused on enhancing the way we search for knowledge. By integrating advanced Natural Language Processing (NLP) techniques and clustering algorithms, we aim to offer a more nuanced exploration of information. Collaborating with a graph-based search interface, Kluster strives to provide a decentralized and well-rounded searching experience.

Check app video

Introduction

While traditional search engines like Google and Bing excel at providing rapid results through algorithms such as PageRank, there can be limitations in accessing lesser-known, yet significant information.

Consider the query "the most renowned Asian actress of all time" on a conventional search engine. It's no surprise that names like Lucy Liu and Gong Li dominate the top results, potentially overshadowing historically important figures like Anna May Wong, the first Chinese actress in Hollywood, buried deeper in the rankings. Sometimes, valuable insights might only appear as far down as the 26th position on the first search page.

My analysis has pinpointed two primary factors that contribute to the challenge of accessing critical marginal knowledge in such searches.

Firstly, the PageRank algorithm inherently centralizes popular information, resulting in a skewed distribution of search outcomes.

Secondly, the linear indexing format used on search pages assigns varying importance to different pieces of information.

With these considerations in mind, our project sets out to develop a design solution that promotes a more decentralized and equitable search experience. Kluster represents a step towards a future where information's richness is more fully accessible, allowing users to unearth hidden gems within the vast landscape of knowledge.

Design Solutions

Design Problems

Limited exposure to diverse perspectives and alternative viewpoints during the search experience.
Conventional information/knowledge ranking algorithms perpetuate centralized information.
Information is presented in a linear and unbiased fashion.

Tech solution

In order to enhance the search algorithm for decentralized results, I have opted for the Cluster algorithm, which takes into account the density of information relationships, as opposed to the ranking algorithm that emphasizes centralized ranking priorities.

Design Challenges

Find out and imply a new way of organizing knowledge in a decentralized way
Create a more impartial and non-linear experience of searching and viewing the knowledge

Visual Solution

Recognizing the inherent disparities in exposure coefficients stemming from a linear indexing format, I have opted for a graph-based presentation format for the search results.

Screen Shot 2023-05-02 at 13.02_edited.jpg

Product Positioning

Under what circumstances could this search experience be beneficial? We do not propose this design as a replacement for traditional search engines. However, a graph-based decentralized search experience could serve as a valuable tool for users without a specific search goal, or those seeking exposure to a diverse range of information. For instance, consider a design student in search of a meaningful project topic, but one that avoids overly popular subjects.

Yinghou_4_sections_storyboard_pencil_sketch_style_a_student_sea_650a1f5d-808d-443c-9f6b-6f

Workflow

Text File Dataset

This text file dataset is a example of the topic "philosophy".

philosophy_and_the_social_problem.txt
the_consolation_of_philosophy.txt
initiation_into_philosophy.txt

High frequency Entities

Frequent words 13000
(philosophy, Plato, human, morality, intelligence.....)

Language Model

We got a language model (philosophy.kv) based on training on the text file dataset. This model contains a text understanding in the context of philosophy

Example:
{ stories: [-3.468945 -0.08118384 -1.0659883 -1.6832864 -2.128204 -0.6112681 -1.1394693 -0.97605973 1.4505659 -0.15622564 2.0152066 0.7203029 -1.7843394 .... ],
human:[array...] }

{Embeded Dataset}

Entities_1 : [Vectors]

Entities_3 : [Vectors]

Entities_5 : [Vectors]

Entities_2 : [Vectors]

Entities_4 : [Vectors]

Entities_6 : [Vectors]

This part is the original data processing where we trained text dataset into vector graph.

Model Training

At this part, we applied cluster algorithm (DBSCAN) to further processed the data into groups.

DBSCAN (Density-Based Spatial Clustering of
Applications with Noise)
Finds core samples in regions of high density and expands
clusters from them.

pyplot visulization of the 3000/13000 dataset before clustering
perplexity 40

Cluster Results

This example illustrates a portion of a cluster result obtained when appropriate parameters are applied. Words enclosed within braces {} are grouped together based on their hyperdimensional distances, indicating potential relationships within the context of "philosophy."

When varying parameters are employed, multiple results are generated. Within each result, words are grouped according to different potential relationships.

Belowed figure showes how the word "ego" is grouped with different words with different cluster parameters.

Constomized Dataset

Thereby, we have a customized (based on training) searching dataset in which different information are clustered inside different group of multiple results.

Here we have how user searching behavior interactes with the cluster dataset.

User Interaction

The searched word

1. check database

2. Return results

Clustering

Output

Interface

search bar

slider control display density

word calculation search

cluster result list display

a cluster of results

closer to center means closer relationship