2025-12-15
The Function Graph: Mapping What Runs the World
#ai-training#infrastructure#code-graph#scaling#open-source
The Function Graph
The idea: There are millions of repositories on GitHub. But the number of unique functions — the actual operations that run the world's software each day — is finite and mappable.
If we built a code-graph of all unique functions, we could focus model training and framework development on the code that actually matters.
The Insight
┌─────────────────────────────────────────────────────────────────┐
│ GITHUB TODAY │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 330+ million repositories │
│ 100+ million developers │
│ Billions of files │
│ │
│ But how many UNIQUE FUNCTIONS? │
│ │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ The world runs on a surprisingly small set of patterns: │ │
│ │ │ │
│ │ • HTTP request handling │ │
│ │ • Database queries │ │
│ │ • Auth/session management │ │
│ │ • File I/O │ │
│ │ • String manipulation │ │
│ │ • Array/list operations │ │
│ │ • Date/time handling │ │
│ │ • JSON parsing │ │
│ │ • Error handling │ │
│ │ • Logging │ │
│ │ │ │
│ │ These patterns repeat across MILLIONS of repos. │ │
│ └───────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
The Code Graph
┌─────────────────────────────────────────────────────────────────┐
│ THE FUNCTION GRAPH │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Extract unique function signatures across all of GitHub: │
│ │
│ ┌─────────────────────┐ │
│ │ parse_json(str) │──┬──> Used in 47M repos │
│ └─────────────────────┘ │ │
│ │ │
│ ┌─────────────────────┐ │ │
│ │ http_get(url) │──┼──> Used in 31M repos │
│ └─────────────────────┘ │ │
│ │ │
│ ┌─────────────────────┐ │ │
│ │ db_query(sql) │──┼──> Used in 28M repos │
│ └─────────────────────┘ │ │
│ │ │
│ ┌─────────────────────┐ │ │
│ │ hash_password(pw) │──┴──> Used in 12M repos │
│ └─────────────────────┘ │
│ │
│ Each node = unique function pattern │
│ Edges = dependency/call relationships │
│ Weight = usage frequency across repos │
│ │
└─────────────────────────────────────────────────────────────────┘
The Application
┌─────────────────────────────────────────────────────────────────┐
│ WHERE TO FOCUS EFFORT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ 1. MODEL TRAINING │
│ ──────────────── │
│ Train on the functions that matter most. │
│ If parse_json() appears in 47M repos, make sure the model │
│ is EXCELLENT at generating and understanding JSON parsing. │
│ │
│ 2. FRAMEWORK DEVELOPMENT │
│ ──────────────────────── │
│ Build frameworks around the highest-frequency functions. │
│ Don't reinvent — identify the patterns that already run │
│ the world and make them better. │
│ │
│ 3. SECURITY FOCUS │
│ ──────────────── │
│ The most-used functions are the highest-risk attack surface. │
│ Focus security audits on the code-graph hotspots. │
│ │
│ 4. DOCUMENTATION │
│ ───────────── │
│ Document the patterns that run everything. │
│ Make the most-used functions the best-documented. │
│ │
└─────────────────────────────────────────────────────────────────┘
The Math
┌─────────────────────────────────────────────────────────────────┐
│ CONCENTRATION OF IMPACT │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Pareto applies: │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ │ │
│ │ ~1% of unique functions │ │
│ │ run in ~80% of production software │ │
│ │ │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ If we identify that 1%, we can: │
│ • Optimize AI training for maximum real-world impact │
│ • Focus security resources where they matter │
│ • Build frameworks that handle the common cases excellently │
│ • Reduce the surface area of "code that runs the world" │
│ │
└─────────────────────────────────────────────────────────────────┘
The Vision
┌─────────────────────────────────────────────────────────────────┐
│ FROM REPOS TO FUNCTIONS TO PATTERNS │
├─────────────────────────────────────────────────────────────────┤
│ │
│ LAYER 1: REPOS │
│ 330M+ repositories — too many to comprehend │
│ │
│ │ │
│ v │
│ │
│ LAYER 2: FUNCTIONS │
│ Extract unique function signatures — maybe 10M patterns │
│ │
│ │ │
│ v │
│ │
│ LAYER 3: CORE PATTERNS │
│ The ~100K functions that run 80% of everything │
│ │
│ │ │
│ v │
│ │
│ LAYER 4: PRIMITIVES │
│ The ~1K operations that are truly fundamental │
│ (parse, query, auth, io, transform, validate, log, handle) │
│ │
└─────────────────────────────────────────────────────────────────┘
Connection to World State
This extends the Repo as World State pattern:
- Repo as World State = each repo is a readable world
- Function Graph = the aggregate of all worlds forms a map of what runs everything
If every repo follows AI-native patterns, the function graph becomes:
- Extractable — clear function signatures, documented intent
- Navigable — AI can traverse the graph
- Trainable — models can learn from the most important code
Why This Matters
- Focus AI training on high-impact code patterns
- Direct framework development to the functions that run the world
- Concentrate security efforts on the most critical paths
- Reduce redundancy — why reinvent parse_json() 47 million times?
- Map infrastructure — understand what code actually runs each day
The Question
What are the 1000 functions that run 80% of the world's software?
If we could answer that, we could focus AI development on the code that actually matters.
Published to the Digital Library as a sharable idea.