Update: The code is here https://github.com/farnasirim/max-flow-optimization.
WhatThere was this particular weighted matching problem that I needed to solve some time ago. The problem specification is not relevant here but its got only a few more constraints than a normal maximum weighted matching and instead of actually thinking, I decided to reduce the whole thing to min-cost flow, similar to what one would do for a normal max matching.
For my actual usage, everything was fun and games. My C++ implementation would give me the answer almost instantly on a synthesized small input and in a few minutes on my actual data.
Now for some reason, I needed to implement the same thing in Python too. You see where this is going, right?
This is basically my status report while trying to understand why the exact same code Runs 1000s (not kidding) of times slower in Python.
Some backgroundI used the augmenting path algorithm for finding a min-cost flow in the graph. On the very high level, it looks like this:
Your algorithm for the shortest path subtask has to support negative edges (or you can use potential functions to handle those separately). I used D´Esopo-Pape algorithm there.
While we haven't reached the desired flow: find the cheapest path from source to sink send flow along the path
On my small input, the C++ version takes about
0.1sand the Python version takes about
Here’s a part of the Python version:
def min_cost_flow(self, desired_flow, src, sink): # Initialization self.adj = [] * self.n self.cost = [ * self.n for i in range(self.n)] self.capacity = [ * self.n for i in range(self.n)] # Some boring init code omitted... flow, cost = 0, 0 d, p = ,  while flow < desired_flow: d, p = self.shortest_paths_from(src) f = desired_flow - flow cur = sink while cur != src: f = min(f, self.capacity[p[cur]][cur]) cur = p[cur] # Send the flow flow += f cost += f * d[sink] cur = sink while cur != src: self.capacity[p[cur]][cur] -= f self.capacity[cur][p[cur]] += f cur = p[cur] return flow, cost;
def shortest_paths_from(self, v0): d = [INF] * self.n p = [-1] * self.n d[v0] = 0 m =  * self.n q = collections.deque() q.append(v0) while len(q) > 0: u = q.popleft() m[u] = 0; for v in self.adj[u]: if self.capacity[u][v] > 0 and \ d[v] > d[u] + self.cost[u][v]: d[v] = d[u] + self.cost[u][v] p[v] = u if m[v] == 2: m[v] = 1 q.append(v) elif m[v] == 0: m[v] = 1 q.appendleft(v) return d, p
Show me the slowWell, I would guess that the
shortest_paths_fromis the suspect here and has the most share in the total running time. Let’s verify that. I wrote this decorator to measure the time a piece of code takes to run:
global_info = collections.Counter() def measure_time_by_key(time_key): def func_decorator(func): def decorator(*args, **kwargs): st = time.process_time_ns() return_value = func(*args, **kwargs) fin = time.process_time_ns() global_info[time_key] += fin - st return return_value return decorator return func_decorator
This is pretty handy as you can measure time on a finer grain (than functions) too. For example to measure the time it takes to manipulate the paths and apply the flow inside the
min_cost_flow, we can indent the relevant lines into a local closure, capture the local variables, and measure the time it takes for the closure run:
while ...: ... f = desired_flow - flow @measure_time_by_key("applying the flow") def _internal_apply_flow(): nonlocal flow, cost, f cur = sink while cur != src: f = min(f, self.capacity[p[cur]][cur]) cur = p[cur] # Send the flow flow += f cost += f * d[sink] cur = sink while cur != src: self.capacity[p[cur]][cur] -= f self.capacity[cur][p[cur]] += f cur = p[cur] _internal_apply_flow()
Anyway, Here’s the result in our case (for a VERY small input):
shortest_paths_from took 1.428 seconds min_cost_flow took 1.431 seconds
Well, not really. In hindsight though, this is obvious. Each operation on the queue may be expensive, but their count is way less than the number of times we just check weather or not we can update the shortest path using an edge:
deque operations took 0.012 seconds shortest_paths_from took 1.800 seconds path_stuff took 0.000 seconds min_cost_flow took 1.804 seconds
All I really saw at the first glance was the number of LOAD operations. I’m sure you now agree with a completely unbiased opinion.
37 94 SETUP_LOOP 162 (to 258) 96 LOAD_FAST 0 (self) 98 LOAD_ATTR 7 (adj) 100 LOAD_FAST 6 (u) 102 BINARY_SUBSCR 104 GET_ITER >> 106 FOR_ITER 148 (to 256) 108 STORE_FAST 7 (v) 38 110 LOAD_FAST 0 (self) 112 LOAD_ATTR 8 (capacity) 114 LOAD_FAST 6 (u) 116 BINARY_SUBSCR 118 LOAD_FAST 7 (v) 120 BINARY_SUBSCR 122 LOAD_CONST 2 (0) 124 COMPARE_OP 4 (>) 126 POP_JUMP_IF_FALSE 106 39 128 LOAD_FAST 2 (d) 130 LOAD_FAST 7 (v) 132 BINARY_SUBSCR 134 LOAD_FAST 2 (d) 136 LOAD_FAST 6 (u) 138 BINARY_SUBSCR 140 LOAD_FAST 0 (self) 142 LOAD_ATTR 9 (cost) 144 LOAD_FAST 6 (u) 146 BINARY_SUBSCR 148 LOAD_FAST 7 (v) 150 BINARY_SUBSCR 152 BINARY_ADD 154 COMPARE_OP 4 (>) 156 POP_JUMP_IF_FALSE 106 40 158 LOAD_FAST 2 (d) 160 LOAD_FAST 6 (u) 162 BINARY_SUBSCR 164 LOAD_FAST 0 (self) 166 LOAD_ATTR 9 (cost) 168 LOAD_FAST 6 (u) 170 BINARY_SUBSCR 172 LOAD_FAST 7 (v) 174 BINARY_SUBSCR 176 BINARY_ADD 178 LOAD_FAST 2 (d) 180 LOAD_FAST 7 (v) 182 STORE_SUBSCR 41 184 LOAD_FAST 6 (u) 186 LOAD_FAST 3 (p) 188 LOAD_FAST 7 (v) 190 STORE_SUBSCR 42 192 LOAD_FAST 4 (m) 194 LOAD_FAST 7 (v) 196 BINARY_SUBSCR 198 LOAD_CONST 3 (2) 200 COMPARE_OP 2 (==) 202 POP_JUMP_IF_FALSE 224 43 204 LOAD_CONST 4 (1) 206 LOAD_FAST 4 (m) 208 LOAD_FAST 7 (v) 210 STORE_SUBSCR 44 212 LOAD_FAST 5 (q) 214 LOAD_METHOD 4 (append) 216 LOAD_FAST 7 (v) 218 CALL_METHOD 1 220 POP_TOP 222 JUMP_ABSOLUTE 106 45 >> 224 LOAD_FAST 4 (m) 226 LOAD_FAST 7 (v) 228 BINARY_SUBSCR 230 LOAD_CONST 2 (0) 232 COMPARE_OP 2 (==) 234 POP_JUMP_IF_FALSE 106 46 236 LOAD_CONST 4 (1) 238 LOAD_FAST 4 (m) 240 LOAD_FAST 7 (v) 242 STORE_SUBSCR 47 244 LOAD_FAST 5 (q) 246 LOAD_METHOD 10 (appendleft) 248 LOAD_FAST 7 (v) 250 CALL_METHOD 1 252 POP_TOP 254 JUMP_ABSOLUTE 106 >> 256 POP_BLOCK >> 258 JUMP_ABSOLUTE 64 >> 260 POP_BLOCK
Now can we reduce the number of LOADs? Turns out we can. If we start thinking like the compiler and keep an open mind about losing code beauty while accounting for the shortcomings of the actual compiler, we can see that there are a few extra LOADs as the variable
uis invariant throughout the inner loop. Therefore we can change this
Let’s see the results:
Unreal! The compiler hasn’t been doing any of these things. That’s about a 2x improvement in performance in the small synthesized test case from the original code (about 1.5 seconds).
real 0m0.790s user 0m0.776s sys 0m0.010s
On the bigger test case, this improves the running time from ~15 minutes to ~7 minutes.
Bottom lineUnfortunately I did lots more without anything to show for it. Almost everything else that I did (from using c native types to fooling around with how I stored things) resulted in degrading the performance.
I ended up achieving the performance that I was looking for by using another algorithm, but the question of why this (still) runs thousands of times slower than the cpp version remains open for me.