The Great Compression

10 Oct

We are witnessing an era of information compression, spearheaded by large language models (LLMs) that proficiently process web text. These LLMs handle an inconceivably vast array of word combinations, reducing them to a mere trillion parameters. Embedding models, like text-embedding-ada-002, further condense this into 1536 dimensions. Reflect on this when using Retrieval-Augmented Generation (RAG): the essence of the web's information, distilled into 1536 coordinates.

Let's delve into why the web is so amenable to effective compression. Successful data optimisation relies on factors such as redundancy, regularity, predictability, and particularly lower entropy. Despite its apparent chaos, the web is intricately structured, a testament to the collective intellectual endeavour of billions, adhering to universal standards and utilising natural language with robust inherent semantic structures. These shared semantics enable the web's compressibility.

⚡Shared Semantics = Lower Data Entropy = Enhanced Compressibility⚡

The imperative now for each organisation is to compress its own data. However, organisations often juggle multiple databases, encompassing hundreds of thousands of columns with diverse terminologies, culminating in a high-entropy, disorganised data landscape that resists efficient compression. While entities with substantial resources and simpler data architectures might invest in a $3 million custom OpenAI model, the majority will need to consider alternative approaches.

A viable solution is to emulate the web’s efficiency within organisational structures. Envisage aligning the core data from various applications with shared ontologies and adopting standardised protocols, such as using URLs as universal identifiers. This method will create a more interconnected and coherent semantic framework. Organisational ontologies then become generalised abstraction spaces, enabling each organisation to effectively compress its data within its specific niche. This strategy not only enhances data optimisation but also empowers organisations to distil and clarify the essence of their core identity in a human-centric way.

⭕ Ontologies and LLMs:: https://www.knowledge-graph-guys.com/blog/llms-ontologies

⭕ Humans in the Loop: https://www.knowledge-graph-guys.com/blog/human-in-the-loop

⭕ The Free Energy Principle & Data Connectivity: https://www.knowledge-graph-guys.com/blog/data-connectivity-and-the-free-energy-principle

Tony Seale

The Great Compression

Vectors & Graphs

Data Connectivity and the Free Energy Principle

info@knowledge-graph-guys.com