Résumé IA
Cette tutorial détaille la mise en œuvre d'un pipeline de traitement de graphes à grande échelle dans NetworKit 11.2.1, mettant l'accent sur la vitesse, l'efficacité mémoire et les API version-sûres. Il génère un grand graphe libre, extrait le composant connexe le plus grand, calcule des signaux structurels via la décomposition en noyau k et le classement de centralité, détecte des communautés avec PLM, évalue la qualité en utilisant la modularité, estime la structure de distance avec des diamètres efficaces et estimés, puis sparsifie le graphe pour réduire les coûts tout en préservant les propriétés clés. Le graphe sparsifié est exporté en tant que liste d'arêtes pour réutilisation dans des workflows downstream, benchmarking ou pré-traitement de données pour l'apprentissage automatique sur graphes.
Impact France/UEAucun impact direct - L'article se concentre sur une tutorial de codage pour des analyses de graphes à grande échelle en utilisant NetworKit 11.2.1, sans lien établi avec des entreprises françaises ou européennes spécifiques, des lois (AI Act, RGPD) ou des secteurs directement.
In this tutorial, we implement a production-grade, large-scale graph analytics pipeline in NetworKit , focusing on speed, memory efficiency, and version-safe APIs in NetworKit 11.2.1. We generate a large-scale free network, extract the largest connected component, and then compute structural backbone signals via k-core decomposition and centrality ranking. We also detect communities with PLM and quantify quality using modularity; estimate distance structure using effective and estimated diameters; and, finally, sparsify the graph to reduce cost while preserving key properties. We export the sparsified graph as an edgelist so we can reuse it in downstream workflows, benchmarking, or graph ML preprocessing. Copy Code Copied Use a different Browser !pip -q install networkit pandas numpy psutil import gc, time, os import numpy as np import pandas as pd import psutil import networkit as nk print("NetworKit:", nk.__version__) nk.setNumberOfThreads(min(2, nk.getMaxNumberOfThreads())) nk.setSeed(7, False) def ram_gb(): p = psutil.Process(os.getpid()) return p.memory_info().rss / (1024**3) def tic(): return time.perf_counter() def toc(t0, msg): print(f"{msg}: {time.perf_counter()-t0:.3f}s | RAM~{ram_gb():.2f} GB") def report(G, name): print(f"\n[{name}] nodes={G.numberOfNodes():,} edges={G.numberOfEdges():,} directed={G.isDirected()} weighted={G.isWeighted()}") def force_cleanup(): gc.collect() PRESET = "LARGE" if PRESET == "LARGE": N = 120_000 M_ATTACH = 6 AB_EPS = 0.12 ED_RATIO = 0.9 elif PRESET == "XL": N = 250_000 M_ATTACH = 6 AB_EPS = 0.15 ED_RATIO = 0.9 else: N = 80_000 M_ATTACH = 6 AB_EPS = 0.10 ED_RATIO = 0.9 print(f"\nPreset={PRESET} | N={N:,} | m={M_ATTACH} | approx-betweenness epsilon={AB_EPS}") We set up the Colab environment with NetworKit and monitoring utilities, and we lock in a stable random seed. We configure thread usage to match the runtime and define timing and RAM-tracking helpers for each major stage. We choose a scale preset that controls graph size and approximation knobs so the pipeline stays large but manageable. Copy Code Copied Use a different Browser t0 = tic() G = nk.generators.BarabasiAlbertGenerator(M_ATTACH, N).generate() toc(t0, "Generated BA graph") report(G, "G") t0 = tic() cc = nk.components.ConnectedComponents(G) cc.run() toc(t0, "ConnectedComponents") print("components:", cc.numberOfComponents()) if cc.numberOfComponents() > 1: t0 = tic() G = nk.graphtools.extractLargestConnectedComponent(G, compactGraph=True) toc(t0, "Extracted LCC (compactGraph=True)") report(G, "LCC") force_cleanup() We generate a large Barabási–Albert graph and immediately log its size and runtime footprint. We compute connected components to understand fragmentation and quickly diagnose topology. We extract the largest connected component and compact it to improve the rest of the pipeline’s performance and reliability. Copy Code Copied Use a different Browser t0 = tic() core = nk.centrality.CoreDecomposition(G) core.run() toc(t0, "CoreDecomposition") core_vals = np.array(core.scores(), dtype=np.int32) print("degeneracy (max core):", int(core_vals.max())) print("core stats:", pd.Series(core_vals).describe(percentiles=[0.5, 0.9, 0.99]).to_dict()) k_thr = int(np.percentile(core_vals, 97)) t0 = tic() nodes_backbone = [u for u in range(G.numberOfNodes()) if core_vals[u] >= k_thr] G_backbone = nk.graphtools.subgraphFromNodes(G, nodes_backbone) toc(t0, f"Backbone subgraph (k>={k_thr})") report(G_backbone, "Backbone") force_cleanup() t0 = tic() pr = nk.centrality.PageRank(G, damp=0.85, tol=1e-8) pr.run() toc(t0, "PageRank") pr_scores = np.array(pr.scores(), dtype=np.float64) top_pr = np.argsort(-pr_scores)[:15] print("Top PageRank nodes:", top_pr.tolist()) print("Top PageRank scores:", pr_scores[top_pr].tolist()) t0 = tic() abw = nk.centrality.ApproxBetweenness(G, epsilon=AB_EPS) abw.run() toc(t0, "ApproxBetweenness") abw_scores = np.array(abw.scores(), dtype=np.float64) top_abw = np.argsort(-abw_scores)[:15] print("Top ApproxBetweenness nodes:", top_abw.tolist()) print("Top ApproxBetweenness scores:", abw_scores[top_abw].tolist()) force_cleanup() We compute the core decomposition to measure degeneracy and identify the network’s high-density backbone. We extract a backbone subgraph using a high core-percentile threshold to focus on structurally important nodes. We run PageRank and approximate betweenness to rank nodes by influence and bridge-like behavior at scale. Copy Code Copied Use a different Browser t0 = tic() plm = nk.community.PLM(G, refine=True, gamma=1.0, par="balanced") plm.run() toc(t0, "PLM community detection") part = plm.getPartition() num_comms = part.numberOfSubsets() print("communities:", num_comms) t0 = tic() Q = nk.community.Modularity().getQuality(part, G) toc(t0, "Modularity") print("modularity Q:", Q) sizes = np.array(list(part.subsetSizeMap().values()), dtype=np.int64) print("community size stats:", pd.Series(sizes).describe(percentiles=[0.5, 0.9, 0.99]).to_dict()) t0 = tic() eff = nk.dist