DataHaskell
diff --git a/‎_posts/2025-11-25-a-tale-of-two-kernels.markdown‎
Lines changed: 0 additions & 2 deletions b/‎_posts/2025-11-25-a-tale-of-two-kernels.markdown‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎_posts/2025-12-26-exploring-ghc-profiling-data-in-jupyter.markdown‎
Lines changed: 154 additions & 0 deletions b/‎_posts/2025-12-26-exploring-ghc-profiling-data-in-jupyter.markdown‎
Lines changed: 154 additions & 0 deletions
diff --git a/‎_sass/_layout.scss‎
Lines changed: 2 additions & 1 deletion b/‎_sass/_layout.scss‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎_sass/_normalize.scss‎
Lines changed: 2 additions & 2 deletions b/‎_sass/_normalize.scss‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎assets/css/main.css‎
Lines changed: 4 additions & 4 deletions b/‎assets/css/main.css‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎assets/sass/main.scss‎
Lines changed: 3 additions & 3 deletions b/‎assets/sass/main.scss‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎blog.html‎
Lines changed: 11 additions & 7 deletions b/‎blog.html‎
Lines changed: 11 additions & 7 deletions
diff --git a/‎css/main.scss‎
Lines changed: 1 addition & 1 deletion b/‎css/main.scss‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎images/aggregate_residency.png‎
71.3 KB b/‎images/aggregate_residency.png‎
71.3 KB
diff --git a/‎images/cost_center_breakdown_eventlog2html.png‎
41.1 KB b/‎images/cost_center_breakdown_eventlog2html.png‎
41.1 KB
@@ -5,8 +5,6 @@ date:   2025-11-25 20:36:38 +0100
 categories: blog
 ---
 
-### Overview
-
 For developers integrating Haskell into data science workflows or interactive documentation, the Jupyter notebook is the standard interface. Currently, there are two primary ways to run Haskell in Jupyter: [IHaskell](https://github.com/IHaskell/IHaskell) and [xeus-haskell](https://github.com/jupyter-xeus/xeus-haskell).
 
 While both achieve the same end user experience (executing Haskell code in cells) their internal architectures represent fundamentally different engineering trade-offs:
 
@@ -0,0 +1,154 @@
+---
+layout: post
+title:  "Exploring GHC profiling data in Jupyter"
+date:   2025-12-26 14:01:38 +0100
+categories: blog
+---
+
+
+Exploratory data analysis (EDA) isn't just for data scientists. Anyone that uses a system that emits data can benefit from the tools of EDA. And since charity begins at home, what better way to motivate this than a short post using DataHaskell tools to analyse GHC profiling logs.
+
+By treating performance analysis as a data exploration problem, we may unlock insights that might be difficult to see in a static report.
+
+This article is for you if:
+
+* • You've profiled Haskell programs before and want a more flexible way to explore the data beyond what static tools provide: filtering, joining, and comparing runs programmatically.  
+* • You're comfortable with data tools (pandas, R, SQL) but haven't worked with GHC's profiling infrastructure this will show you how to bridge that gap.  
+* • You want to compare performance across code changes by diffing two profiling runs to see exactly what got better or worse, attributed to specific cost centres.  
+* • Or you're just curious about DataHaskell!
+
+## The example program
+
+For our scenario today we're going to compare the memory behaviour of two variants of a simple summing function. The first is a textbook strict fold:
+
+```haskell
+sumFast :: B.ByteString -> Int  
+sumFast bs =  
+  let xs = parseAll bs  
+   in foldlStrict (\ !acc x -> acc + x) 0 xs
+```
+
+The second version is almost identical, but it retains a sample of partial sums as it runs—a controlled memory leak that accumulates data over time:
+
+```haskell
+data Acc = Acc !Int !Int !SampleList
+
+sumLeaky :: B.ByteString -> Int  
+sumLeaky bs =  
+  let xs = parseAll bs  
+      Acc s _ samples = foldlStrict step (Acc 0 0 Nil) xs  
+   in sampleLength samples `seq` s  
+  where  
+    step (Acc s i history) x =  
+      let !s' = s + x  
+          !i' = i + 1  
+          !history' = if i' `rem` 50000 == 0  
+                      then Cons s' history  
+                      else history  
+       in Acc s' i' history'
+```
+
+Both compute the same result. Both have similar runtime. But one retains memory that the other doesn't—and we want to see that difference in the profiling data.
+
+To generate eventlogs, you need to:
+
+1. Compile with profiling enabled:  `cabal build --enable-profiling --profiling-detail=all-functions -O2`
+2. Run with the right RTS flags: `cabal run your-program -- +RTS -hc -l-agu -RTS`
+
+Here `-hc` requests heap profiling by cost centre (we'll explain what that means shortly), and `-l-agu` enables the eventlog while disabling some less relevant event types. This produces a `.eventlog` file alongside your executable. For our example, we generate two: `fast.eventlog` and `leaky.eventlog`.
+
+The standard tool for visualising these logs is eventlog2html, which produces beautiful interactive HTML reports.
+
+![A heap usage graph generated by eventlog2html](/images/heap_usage_eventlog2html.png)
+![A cost center usage graph generated by eventlog2html](/images/cost_center_breakdown_eventlog2html.png)
+
+It's excellent for getting a quick overview—but sometimes you need more. You might want to filter to specific cost centres, compare two runs side-by-side, compute derived metrics, or ask questions the tool's authors didn't anticipate.
+
+We'd like to see if we can get this and then more using a SQL-like API to explore our eventlog data. In this blog post we'll do two things: firstly, we'll show how to regenerate the heap usage chart from eventlog2html. Secondly we'll show how to use operations like join to diff two runs.
+
+## Generating eventlogs
+
+GHC can emit detailed runtime information into eventlog files. These binary logs contain a stream of timestamped events: garbage collection statistics, heap samples, cost centre attributions, and more.
+
+## From eventlogs to dataframes
+
+An eventlog is just a pile of raw facts with disjointed descriptions of what happened and when it happened. To turn it into answers, we need tools that can ingest event streams, normalise timestamps, join disparate metrics, and visualise relationships.
+
+We'll use the ghc-events library to parse the binary format, then load the data into a DataFrame. This gives us a familiar columnar interface. If you've used pandas or dplyr, the operations should feel natural.
+
+We write some [custom code to parse eventlog files](https://gist.github.com/mchav/ae93567450075fb65d0579254f7dc406) into structures that we'll eventually turn into dataframes.
+
+![Reading eventlogs into a heap profile dataframe](/images/read_into_dataframe.png)
+
+After parsing, we have a table with three columns: time, cc_label (the cost centre label), and residency (bytes retained):
+
+![Sampling the heap dataframe](/images/sample_heapdf.png)
+
+### A quick note on cost centres
+
+A cost centre is GHC's unit of attribution for profiling. When you compile with `-profiling-detail=all-functions`, GHC inserts cost centres at function boundaries. Each heap sample then records how much memory is attributed to each cost centre. 
+Cost centre labels look like `sumLeaky.step.history' (Main)`. That's the function history' defined inside step inside sumLeaky, in the Main module. These hierarchical names let you trace allocations back to specific expressions in your code.
+
+## Aggregation
+
+Raw performance data is noisy. The runtime might emit a "Heap Size" event every few microseconds, creating thousands of data points per second of runtime. Using DataFrame functions, we can group these by a coarser time grain and calculate the mean, effectively "smoothing" the signal.
+
+![Aggregate heap by cost center](/images/aggregate_residency.png)
+
+![A sample of the aggregated residency dataframe](/images/sample_aggregate_residency.png)
+
+Now we have one row per cost centre, with summary statistics. But the real power comes when we compare runs.
+
+## Joining two profiling runs
+
+To diff the fast and leaky versions, we perform a full outer join on the cost centre label:
+
+![A diff of memory usage between runs](/images/diff_residency_of_runs.png)
+
+The result tells us exactly where the memory went. The history' binding where we cons onto the sample list accounts for the retained memory.
+
+## Correlating different metrics
+
+Eventlogs contain multiple types of events. Beyond heap samples by cost centre, we get:
+
+* • HeapLive: actual live bytes after GC  
+* • HeapSize: total heap size requested from the OS  
+* • BlocksSize: memory in block allocator
+
+We can parse these into separate DataFrames, aggregate by time, and join them together:
+
+![Join the different memory usages](/images/join_heap_bytes.png)
+
+This gives us a wide table where each row is a time bucket, and columns show different metrics for each run:
+
+![A sample of the joined heap bytes](/images/sample_join_heap_bytes.png)
+
+## Plotting the results
+
+Now we can go ahead and plot the famous eventlog2html heap usage chart with both runs in the same chart.
+
+![A recreation of the eventlog2html plot but with two runs](/images/full_heap_chart.png)
+
+We can "zoom in" on the chart by taking the first 5 rows and see the difference in live bytes.
+
+![A chart showing just the first few seconds of the run](/images/zoomed_in_chart.png)
+
+The leaky version shows steadily increasing live bytes; the fast version stays flat. But more importantly, we can now query this data: what's the rate of growth? When does the leak become significant? Does block size track live bytes, or does the allocator hold onto memory after GC reclaims it?
+
+As an added bonus, We can run our profiling code in GHCi/a regular script as well and produce similar output.
+
+![A profiling chart in the terminal](/images/terminal_run.png)
+
+## What we learned
+
+The notebook captures the entire pipeline from raw eventlog to final visualisation. Re-run it after a code change and you get an updated diff automatically. Because everything is in DataFrames, we can filter to specific time ranges, compute ratios, correlate with GC events, or export to CSV for further analysis in other tools.
+
+This gives us a flexible way of exploring performance.
+
+The full notebook is available to explore in the [DataHaskell playground](https://ulwazi-exh9dbh2exbzgbc9.westus-01.azurewebsites.net/lab/tree/examples/performance_exploration.ipynb).
+
+## What's next?
+
+We are currently working on turning this workflow into a dedicated library. We're working with developers to see what data and charts will be the most useful for understanding performance. Our goal is to provide a seamless bridge between the GHC RTS and high-level analysis tools. Along with the library, we will be providing guides on how to use these data-science techniques to diagnose specific performance pathologies.
+
+As always watch this space and if this sort of work is interesting to you, hop over to the DataHaskell discord to get in on the action.
@@ -12,7 +12,7 @@
 
 .site-title {
     font-size: 26px;
-    font-weight: 300;
+    font-weight: 400;
     line-height: 56px;
     letter-spacing: -1px;
     margin-bottom: 0;
@@ -215,6 +215,7 @@
 
 .post-content {
     margin-bottom: $spacing-unit;
+    padding: 1em;
 
     h2 {
         font-size: 32px;
 
@@ -114,7 +114,7 @@ abbr[title] {
 
 b,
 strong {
-  font-weight: bold;
+  font-weight: bolder;
 }
 
 /**
@@ -404,7 +404,7 @@ textarea {
  */
 
 optgroup {
-  font-weight: bold;
+  font-weight: bolder;
 }
 
 /* Tables
 
@@ -1569,7 +1569,7 @@
 		color: black;
 		font-family: 'Lato', sans-serif;
 		font-size: 15pt;
-		font-weight: 300;
+		font-weight: 400;
 		letter-spacing: 0.025em;
 		line-height: 1.75em;
 	}
@@ -1589,7 +1589,7 @@
 		}
 
 	strong, b {
-		font-weight: 400;
+		font-weight: 900;
 	}
 
 	p, ul, ol, dl, table, blockquote {
@@ -1598,7 +1598,7 @@
 
 	h1, h2, h3, h4, h5, h6 {
 		color: inherit;
-		font-weight: 300;
+		font-weight: 900;
 		line-height: 1.75em;
 		margin-bottom: 1em;
 		text-transform: uppercase;
@@ -2202,7 +2202,7 @@
 		}
 
 			#header h1 span {
-				font-weight: 300;
+				font-weight: 900;
 			}
 
 		#header nav {
 
@@ -70,7 +70,7 @@
 		color: _palette(fg);
 		font-family: 'Lato', sans-serif;
 		font-size: 15pt;
-		font-weight: 300;
+		font-weight: 400;
 		letter-spacing: 0.025em;
 		line-height: 1.75em;
 	}
@@ -96,7 +96,7 @@
 
 	h1, h2, h3, h4, h5, h6 {
 		color: inherit;
-		font-weight: 300;
+		font-weight: 400;
 		line-height: 1.75em;
 		margin-bottom: 1em;
 		text-transform: uppercase;
@@ -503,7 +503,7 @@
 			}
 
 			th {
-				font-weight: 400;
+				font-weight: 500;
 				padding: 0.5em 1em 0.5em 1em;
 				text-align: left;
 			}
 
@@ -6,16 +6,20 @@
 
 		<div class="posts">
             {% for post in site.posts %}
-                <article class="post">
+				<div style="padding: 1em; margin: 1em; border-style: solid; border-radius: 1em; border-color: grey; border-width: 0.1em;">
+					<hr>
+					<article class="post">
 
-                <h1><a href="{{ site.baseurl }}{{ post.url }}">{{ post.title }}</a></h1>
+					<h1><a href="{{ site.baseurl }}{{ post.url }}">{{ post.title }}</a></h1>
 
-                <div class="entry">
-                    {{ post.excerpt }}
-                </div>
+					<div class="entry">
+						{{ post.excerpt }}
+					</div>
 
-                <a href="{{ site.baseurl }}{{ post.url }}" class="read-more">Read More</a>
-                </article>
+					<a href="{{ site.baseurl }}{{ post.url }}" class="read-more">Read More</a>
+					</article>
+					<hr>
+				</div>
             {% endfor %}
         </div>
 
 
@@ -8,7 +8,7 @@
 // Our variables
 $base-font-family: "Helvetica Neue", Helvetica, Arial, sans-serif;
 $base-font-size:   16px;
-$base-font-weight: 400;
+$base-font-weight: 500;
 $small-font-size:  $base-font-size * 0.875;
 $base-line-height: 1.5;
Original file line number	Diff line number	Diff line change
`@@ -114,7 +114,7 @@ abbr[title] {`
`114`	`114`
`115`	`115`	`b,`
`116`	`116`	`strong {`
`117`		`- font-weight: bold;`
	`117`	`+ font-weight: bolder;`
`118`	`118`	`}`
`119`	`119`
`120`	`120`	`/**`
`@@ -404,7 +404,7 @@ textarea {`
`404`	`404`	`*/`
`405`	`405`
`406`	`406`	`optgroup {`
`407`		`- font-weight: bold;`
	`407`	`+ font-weight: bolder;`
`408`	`408`	`}`
`409`	`409`
`410`	`410`	`/* Tables`
Original file line number	Diff line number	Diff line change
`@@ -1569,7 +1569,7 @@`
`1569`	`1569`	`color: black;`
`1570`	`1570`	`font-family: 'Lato', sans-serif;`
`1571`	`1571`	`font-size: 15pt;`
`1572`		`- font-weight: 300;`
	`1572`	`+ font-weight: 400;`
`1573`	`1573`	`letter-spacing: 0.025em;`
`1574`	`1574`	`line-height: 1.75em;`
`1575`	`1575`	`}`
`@@ -1589,7 +1589,7 @@`
`1589`	`1589`	`}`
`1590`	`1590`
`1591`	`1591`	`strong, b {`
`1592`		`- font-weight: 400;`
	`1592`	`+ font-weight: 900;`
`1593`	`1593`	`}`
`1594`	`1594`
`1595`	`1595`	`p, ul, ol, dl, table, blockquote {`
`@@ -1598,7 +1598,7 @@`
`1598`	`1598`
`1599`	`1599`	`h1, h2, h3, h4, h5, h6 {`
`1600`	`1600`	`color: inherit;`
`1601`		`- font-weight: 300;`
	`1601`	`+ font-weight: 900;`
`1602`	`1602`	`line-height: 1.75em;`
`1603`	`1603`	`margin-bottom: 1em;`
`1604`	`1604`	`text-transform: uppercase;`
`@@ -2202,7 +2202,7 @@`
`2202`	`2202`	`}`
`2203`	`2203`
`2204`	`2204`	`#header h1 span {`
`2205`		`- font-weight: 300;`
	`2205`	`+ font-weight: 900;`
`2206`	`2206`	`}`
`2207`	`2207`
`2208`	`2208`	`#header nav {`