White Paper · 01
THE PLANNING MIND
How Grid Mathematics, Weight Learning, and Attention Can Build a Thinking Map of the World
A Technical-Conceptual Paper for Physical Planners, Engineers, Economists and Curious Minds
Plans, settlements and infrastructure works are large chunks of geometries where each element has some relationship to the other element, mostly intrinsic in nature. We plan our geometries based on some parameters, with an intent of improving overall quality of life and bringing growth. This requires a framework where planning can be measured for its intended performance even before the execution of plan selection has happened — enabling us to select the best alternative, the most optimal alignment, or to find the minimal interventions that would bring about the maximum intended change.
Part 1: Problem Statement and the Proposed Idea
What creates the difference between the quality of life between two neighbourhoods, two villages, two towns? It would take at least a reconnaissance survey to get some flimsy ideas, or it might take a month's stay and interactions at both places to understand the difference. Or, we can overlay the entire area with an imaginary grid of chosen resolution on top and assign each grid box with measurable quantities (how many trees, how many roads, how many people, what the land costs, presence of water on surface, ground water, ground water quality, electricity supply, sewer connection… and many more), allow those boxes to make sense of each other in a progressive manner, and let mathematics be their language.
| Fact (Dimension) | Number |
|---|---|
| Forest cover (%) | 0.34 (34%) |
| Population (people) | 4,200 |
| Road density (km) | 2.1 |
| Land price (INR/m2) | 8,500 |
| Flood risk (0-1) | 0.72 (high risk) |
As we conclusively know the values of these quantities and the values of the derived parametric quantities, which are born out of the overall assembly of elements present in each box, we can play the prediction game with this mathematical system. We allow the system to predict, then input the known value. We highlight the error and create the mechanism for the system to learn with each such iteration. LLMs have done it for language. In this paper we are doing it for spatial geometries at a scale of human settlements.
Part 2: What Is a Grid? The Foundation of Everything
The grid is an imaginary set of orthogonal lines in XY plane, overlayed over the entire geography of interest, as if someone has laid down a transparent graph paper over the entire area. Each box of the grid becomes a container that holds information about that piece of geography. We get a current state set of measurable tangible quantities (built mass, carriageway, intersection, no. of floors, height from ground, ground water level, ground water quality etc.) and a set of non-tangible derived parameters (residential land value, income group, median income level, gross production, commercial land value, rate of industrialisation, level of urbanisation etc.), whose values we know.
Each grid cell is described as a vector:
g = [ 0.34, 4200, 2.1, 8500, 0.72, ... ]
A spatial region overlayed with a grid of resolution 10m × 10m for a selected area of 1 km² gives 10,000 grid boxes, each containing 50 dimensions — that is 500,000 numbers describing one region. This is the data foundation of the whole system.
Part 3: Numbers Need to Be Comparable — Normalisation
'Population = 4,200' and 'Forest cover = 0.34' are on completely different scales. We force every number to fit between 0 and 1:
x_normalised = (x - x_minimum) / (x_maximum - x_minimum)
x_normalised = (4200 - 0) / (50000 - 0) = 0.084
4,200 people becomes 0.084. Now it is comparable to flood risk, forest cover etc. We do this for every quantity.
Part 4: The Weight Matrix — How Dimensions Talk to Each Other
We are an ecology where everything is connected. Road infrastructure can be linked to economic activities and accidents at the same time. In the world of neural networks, these dimensions are connected to each other through weights — numbers assigning the "how much" of interrelations. For a 50-dimension vector system, we get a table of 50 × 50 = 2,500 weights. A representative weight w_ij represents: 'a slight change in dimension i represents what corresponding change in j.'
output_j = SUM (g_i × w_ij) for all i from 1 to 50
land_price = (0.084 × 5.2) + (0.34 × -1.8) + (0.72 × -3.1) = -2.407
The weights are not assigned by us — the iterative predictive and data input function is teaching the system. That learning process is described in Part 8.
Part 5: The Activation Function — Adding Bends to Straight Lines
The real world is not linear. If we keep increasing road width, adjacent land price will not keep increasing indefinitely — there is always a local maxima or minima. We rectify this linear relationship using activation functions:
ReLU(x) = max(0, x) — no negatives allowed
sigma(x) = 1 / (1 + e^(-x)) — squeezes any number to (0,1)
Part 6: Layers — Stacking Simple Steps to Get Complex Understanding
Physical planning is too complex to be learned through one layer. We need a stack of multiple layers. Each layer learns increasingly abstract patterns:
Layer 1: 'high roads + high population = urban area'
Layer 2: 'urban area near water = flood risk premium'
Layer 3: 'flood risk premium + low literacy = underserved urban area'
h(1) = ReLU( W(1) . g )
h(2) = ReLU( W(2) . h(1) )
y_hat = sigma( W(3) . h(2) )
Part 7: The Loss Function — Measuring How Wrong We Are
Once we have a prediction y_hat and the real answer y (the actual land price measured in the field), we apply Mean Squared Error (MSE):
L = (1/N) × SUM [ (y_hat_k - y_k)² ] for all k cells
When L = 0, no error, perfect prediction. The square ensures the model is punished disproportionately for large mistakes.
Part 8: Gradient Descent — Teaching the Weights to Be Better
We want to reduce the Loss L as much as possible. The most logical direction is to reduce it where the gradient between the weights is maximum:
dL/dw = 2 × (error) × (input)
w_new = w_old - alpha × (dL/dw)
Where alpha is the learning rate — a small number like 0.001 that controls how big a step we take. The minus sign is devised to reduce the loss function.
Part 9: Backpropagation — Spreading the Error Blame Backwards
In a multi-layer network, gradients for every weight in every layer are needed — called backpropagation, the chain rule of calculus applied repeatedly backwards through the network:
dL/dw = (dL/dh) × (dh/dw)
dL/dw(1) = (dL/dy_hat) × (dy_hat/dh(2)) × (dh(2)/dh(1)) × (dh(1)/dw(1))
Part 10: Queries, Keys, and Values — The Mathematics of Attention
Physical planning is all about adjacencies. This relational binding is described as attention — a powerful idea introduced in the famous 2017 Google Brain paper 'Attention Is All You Need' by Vaswani et al. For any given grid cell, attention computes a weighted average of all other cells' information, where the weights are determined by how relevant each other cell is to this one.
Three projections of each cell's vector are created by multiplying by three learned weight matrices:
Query (Q): 'What is this cell looking for?'
Key (K): 'What does this cell advertise about itself?'
Value (V): 'What information does this cell actually share?'
q_i = W_Q . g_i k_j = W_K . g_j v_j = W_V . g_j
score(i,j) = (q_i . k_j) / sqrt(D) = SUM[ q_i_d × k_j_d ] / sqrt(D)
Softmax: Converting Scores to Probabilities
alpha_ij = e^score(i,j) / SUM[ e^score(i,k) ] for all k cells
z_i = SUM[ alpha_ij × v_j ] for all j cells
Attention(Q, K, V) = Softmax( Q × K_transposed / sqrt(D) ) × V
Part 11: Why Attention Is Perfect for Planning
The attention weights alpha_ij are content-dependent, not just distance-dependent. A cell 50 km away with a port may matter more than a cell 2 km away with nothing. They are learned, not set by a human — the model figures out what adjacency patterns matter for each type of prediction. And they are different for each head, so infrastructure adjacency and ecological adjacency are computed separately and then combined.
Part 12: Multi-Head Attention — Many Types of Looking
One attention head asks one type of question. But planning has many different types of relationships simultaneously — economic, environmental, infrastructural. We run H separate attention operations in parallel, each with its own W_Q, W_K, W_V:
MultiHead(Q,K,V) = Concat( head_1, ..., head_H ) × W_O
Each head specialises — through training — in a different kind of spatial relationship. The model discovers these specialisations on its own, from the data.
Part 13: Positional Encoding
For geographic data, we encode latitude and longitude using sine and cosine waves of different frequencies:
p_(phi,lambda) = [ sin(phi/10000^(0/D)), cos(phi/10000^(0/D)), sin(lambda/10000^(1/D)), ... ]
tilde = g_i + p_i
Cells that are far apart have very different encodings (low dot product). Cells that are close have similar ones (high dot product). Position and content information are added together; the network learns to separate the two during training.
Part 14: Putting Everything Together — The Complete Spatial Planning Transformer
Input: N grid cells, each 50-dimensional feature vector
Normalise: Squish all features to [0,1]
Positional encoding: Add geographic coordinates as sinusoidal vectors
Embedding layer: Project 50D → D=256 or 512
Transformer blocks: Multi-head attention → residual + norm → FFN → residual + norm
Output head: Project to prediction target (land price, flood risk, etc.)
Loss: MSE (continuous) or Cross-Entropy (categories)
Training: Backpropagation + gradient descent
The residual connection adds the original input back after each attention block:
z_i_out = z_i_attention + g_i
Part 15: The Training Process — Teaching With Real Places
Step 1: Collect ground truth (satellite data + census + land records)
Step 2: Initialise weights randomly
Step 3: Forward pass — run calculations to get predictions y_hat
Step 4: Compute loss: L = (1/32) × SUM (y_hat - y)²
Step 5: Backward pass — compute gradients via chain rule
Step 6: Update weights: w = w - alpha × (dL/dw)
Step 7: Repeat for thousands of batches across many regions
Part 16: Making Predictions on New, Unseen Places
The model outputs predictions for land price, urbanisation potential, economic growth, flood risk, traffic scenario, accident potential and whatever else we trained it on. The model has learned the grammar of how planning parameters work — not from one specific place, but from patterns that appear across all training locations. A new region that 'looks like' a combination of patterns the model has seen will get accurate predictions, even though the model has never visited that exact place.
Part 17: What Questions Can This System Answer?
Given a trained model, a physical planner can ask:
- If we build a new highway through these 12 grid cells, how does the economy get impacted?
- Which 5 cells in this region have the highest economic development potential given current constraints?
- This proposed alignment passes through cells with these features — what is the predicted resettlement requirement and environmental impact?
- Which cells in this new region most resemble cells in Pune that developed into successful industrial clusters?
- What combination of interventions — road, power, water — would have the highest impact per rupee of investment?
Summary: The Mathematics of Adjacency Holds the Key
y_hat_i = f( g_i, SUM[ alpha_ij × g_j ] ) for all j ≠ i
Planning is a heavily adjacency-dependent subject. The mathematics of attention captures this perfectly.
| Mathematical Component | Planning Equivalent |
|---|---|
| Grid cell vector g | Every measurable fact about 1 km² of land |
| Forward pass (matrix multiply + activate) | Making a prediction from known features |
| Gradient dL/dw | Which direction to adjust each weight |
| Gradient descent update | Improving the model one small step at a time |
| Multi-head attention | Multiple types of spatial relationships simultaneously |
| Residual connection | Never forgetting what you originally knew |
| Mixture of experts | Specialised models for roads, ecology, urban growth |
Tabulating the discussed concepts and their planning equivalents — to be updated with test results from secondary source data available for different geographies.