2025-12-15

The Function Graph: Mapping What Runs the World

#ai-training#infrastructure#code-graph#scaling#open-source

The Function Graph

The idea: There are millions of repositories on GitHub. But the number of unique functions — the actual operations that run the world's software each day — is finite and mappable.

If we built a code-graph of all unique functions, we could focus model training and framework development on the code that actually matters.


The Insight

┌─────────────────────────────────────────────────────────────────┐
│  GITHUB TODAY                                                   │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  330+ million repositories                                      │
│  100+ million developers                                        │
│  Billions of files                                              │
│                                                                 │
│  But how many UNIQUE FUNCTIONS?                                 │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐ │
│  │  The world runs on a surprisingly small set of patterns:  │ │
│  │                                                           │ │
│  │  • HTTP request handling                                  │ │
│  │  • Database queries                                       │ │
│  │  • Auth/session management                                │ │
│  │  • File I/O                                               │ │
│  │  • String manipulation                                    │ │
│  │  • Array/list operations                                  │ │
│  │  • Date/time handling                                     │ │
│  │  • JSON parsing                                           │ │
│  │  • Error handling                                         │ │
│  │  • Logging                                                │ │
│  │                                                           │ │
│  │  These patterns repeat across MILLIONS of repos.          │ │
│  └───────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Code Graph

┌─────────────────────────────────────────────────────────────────┐
│  THE FUNCTION GRAPH                                             │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Extract unique function signatures across all of GitHub:       │
│                                                                 │
│  ┌─────────────────────┐                                        │
│  │  parse_json(str)    │──┬──> Used in 47M repos               │
│  └─────────────────────┘  │                                     │
│                           │                                     │
│  ┌─────────────────────┐  │                                     │
│  │  http_get(url)      │──┼──> Used in 31M repos               │
│  └─────────────────────┘  │                                     │
│                           │                                     │
│  ┌─────────────────────┐  │                                     │
│  │  db_query(sql)      │──┼──> Used in 28M repos               │
│  └─────────────────────┘  │                                     │
│                           │                                     │
│  ┌─────────────────────┐  │                                     │
│  │  hash_password(pw)  │──┴──> Used in 12M repos               │
│  └─────────────────────┘                                        │
│                                                                 │
│  Each node = unique function pattern                            │
│  Edges = dependency/call relationships                          │
│  Weight = usage frequency across repos                          │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Application

┌─────────────────────────────────────────────────────────────────┐
│  WHERE TO FOCUS EFFORT                                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  1. MODEL TRAINING                                              │
│  ────────────────                                               │
│  Train on the functions that matter most.                       │
│  If parse_json() appears in 47M repos, make sure the model      │
│  is EXCELLENT at generating and understanding JSON parsing.     │
│                                                                 │
│  2. FRAMEWORK DEVELOPMENT                                       │
│  ────────────────────────                                       │
│  Build frameworks around the highest-frequency functions.       │
│  Don't reinvent — identify the patterns that already run        │
│  the world and make them better.                                │
│                                                                 │
│  3. SECURITY FOCUS                                              │
│  ────────────────                                               │
│  The most-used functions are the highest-risk attack surface.   │
│  Focus security audits on the code-graph hotspots.              │
│                                                                 │
│  4. DOCUMENTATION                                               │
│  ─────────────                                                  │
│  Document the patterns that run everything.                     │
│  Make the most-used functions the best-documented.              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Math

┌─────────────────────────────────────────────────────────────────┐
│  CONCENTRATION OF IMPACT                                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Pareto applies:                                                │
│                                                                 │
│    ┌──────────────────────────────────────────────────────┐    │
│    │                                                      │    │
│    │   ~1% of unique functions                            │    │
│    │   run in ~80% of production software                 │    │
│    │                                                      │    │
│    └──────────────────────────────────────────────────────┘    │
│                                                                 │
│  If we identify that 1%, we can:                                │
│  • Optimize AI training for maximum real-world impact           │
│  • Focus security resources where they matter                   │
│  • Build frameworks that handle the common cases excellently    │
│  • Reduce the surface area of "code that runs the world"        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The Vision

┌─────────────────────────────────────────────────────────────────┐
│  FROM REPOS TO FUNCTIONS TO PATTERNS                            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  LAYER 1: REPOS                                                 │
│  330M+ repositories — too many to comprehend                    │
│                                                                 │
│            │                                                    │
│            v                                                    │
│                                                                 │
│  LAYER 2: FUNCTIONS                                             │
│  Extract unique function signatures — maybe 10M patterns        │
│                                                                 │
│            │                                                    │
│            v                                                    │
│                                                                 │
│  LAYER 3: CORE PATTERNS                                         │
│  The ~100K functions that run 80% of everything                 │
│                                                                 │
│            │                                                    │
│            v                                                    │
│                                                                 │
│  LAYER 4: PRIMITIVES                                            │
│  The ~1K operations that are truly fundamental                  │
│  (parse, query, auth, io, transform, validate, log, handle)     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Connection to World State

This extends the Repo as World State pattern:

  • Repo as World State = each repo is a readable world
  • Function Graph = the aggregate of all worlds forms a map of what runs everything

If every repo follows AI-native patterns, the function graph becomes:

  • Extractable — clear function signatures, documented intent
  • Navigable — AI can traverse the graph
  • Trainable — models can learn from the most important code

Why This Matters

  1. Focus AI training on high-impact code patterns
  2. Direct framework development to the functions that run the world
  3. Concentrate security efforts on the most critical paths
  4. Reduce redundancy — why reinvent parse_json() 47 million times?
  5. Map infrastructure — understand what code actually runs each day

The Question

What are the 1000 functions that run 80% of the world's software?

If we could answer that, we could focus AI development on the code that actually matters.


Published to the Digital Library as a sharable idea.