The Known Unknowns

Thinking about OKRs
At Google, like any big company, we did OKRs every quarter. We all knew doing this exercise was important, and we all knew that it would determine our team’s performance at the end of the cycle. What I failed to deeply understand is the true meaning behind OKRs, the history of it and why we need to spend high quality time in crafting the OKRs. I could give you the textbook answer for why we need OKRs, but I don’t think I ever truly got it, until now. I often like to re-derive existing concepts or re-solve already solved problems myself from the ground-up. This helps me then better grasp and appreciate the original. So I asked myself: How would I design the OKR process myself from the ground-up? And here’s what I came up with: There are 5 phases in the OKR lifecycle: Formation, Communication, Execution, Evaluation, Retrospection.

Formation: This phase can be structured into the following steps:
- #1 Pull: from higher level OKRs and business goals. Sometimes these will be set by your product owner or your parent org.
- #2 Depixelate: by diving into the next level of detail and flesh out granular OKRs for your org that will serve the top level OKRs. This is where you, as the leader, will make strategic prioritization decisions, including saying No to a lot of things.
- #3 Align: with stakeholders like managers and ICs on your team, x-functional partners, x-org, partners, leadership, etc.
Iterate quickly through #1 #2 #3 until all stakeholders parties sign off
- #4 Percolate: into your org by asking managers/leaders reporting to you, to form their org’s OKRs
Communication: This is probably the most important step for having truly “aligned” organizations where everyone is rowing in one direction with clear purpose and connection to the bigger mission. It is your job as the leader, to first truly understand the connection and communicate it the best you can to your team and stakeholders. This is where your skills as an effective communicator will be crucial. Most leaders will share a deck and walk through the OKRs in a meeting, but good communicators will go a step further and connect the OKRs to a clear purpose. Note that I am choosing the work purpose, and not business goals or objectives. When the team has a purpose attached to the OKRs, it makes magic happen.

Execution: This is where you trust the team and let them run – always trust but verify. There are entire books written about execution, so I will not delve into that too deeply here. This is the longest phase of the OKRs lifecycle. In startups and smaller, fast moving teams, there needs to be a tight feedback loop with senior leadership during this phase, so that OKRs can be tweaked mid-cycle if needed.

Evaluation: In this phase the OKRs get evaluated from top to bottom based on whatever grading rubric your company uses. In Google, OKRs are graded on a scale of 0-1. OKRs are set to be a bit of a stretch so that a score of 0.7 is considered acceptable. How you set and grade OKRs will define the output and impact of your org.

Retrospection: This step needs to happen recursively in your sub-teams and orgs as well. One of the biggest challenges here is that not everyone is good at retrospectives, or are motivated to do it. It is extremely critical that you identify the right person upfront to lead these retros in each of your subgroups. In bigger companies, you will often have a x-functional TPM or PM partner who will be thrilled to help you with this. I’d highly recommend involving them in this exercise.

Relevant Reads

Andy Grove’s High Output Management

John Doerr’s Measure What Matters
August 3, 2025
Stack Split Dilemma

When a product is in early stage, team is small, sticking to a single language keeps the team in flow: builds are quick, hiring is simple, and nobody wrestles with two toolchains before coffee. Over time, a second language often slips in—maybe for an analytics job or a machine‑learning prototype. The trick isn’t adding that language; it’s how many people end up living in both worlds and paying the “overlap tax” in context‑switching.

Picture five engineers who speak TypeScript and five who speak Python. If only one person spends part of their week shuttling data across a thin API, everyone else stays in deep focus. But when half of each group are debugging build scripts in unfamiliar runtimes, what used to take minutes becomes a morning ritual of environment wrangling.

Let’s take an example of such a tradeoff in your messaging layer; choosing between Apache Pulsar and NATS JetStream.

Apache Pulsar is a powerhouse: tiered storage, geo‑replication, multi‑tenant isolation, and peak consumer throughputs measured in millions of messages per second—about twenty times what NATS JetStream can handle in public benchmarks (2.6 M  msg/s vs 160 K  msg/s)[ref]. Its rich feature set comes at the cost of extra services (ZooKeeper or Pulsar’s built‑in consensus plus BookKeeper) and native dependencies in its Node client.

NATS JetStream, by contrast, lives in a single 10 MB binary that you enable on your NATS server. You get persistence, replay, and simple stream abstractions without adding new runtimes. It delivers sub‑millisecond hops in small clusters and scales horizontally by adding nodes  [ref]. You trade off Pulsar’s long‑term retention and built‑in isolation, but you avoid extra binaries, CI gymnastics, and a second language at runtime.

Especially for small teams in early stage startups, I tend to lean toward the option that keeps overlap as low as possible: let the TypeScript folks stay in JetStream and the Python folks keep their backend pure, with a single, clear interface in between. One neat boundary beats a blurry Venn diagram every day.

A skeptic might point out that strict language silos can leave performance and reliability on the table. Polyglot teams can pick the best tool for each layer—Pulsar for high‑scale streaming, Rust for data crunching, Python for ML—rather than the tool that just matches the app stack. Modern containerized environments, monorepo tooling, and remote build caches can shrink context‑switching costs, and engineers comfortable in multiple languages can unlock Pulsar’s full power without hand‑tying architecture to a single stack.

Ultimately, the right choice depends on where your product is in its lifecycle and on the size and skill mix of your team. Early on, simplicity wins; at scale, raw features may justify a polyglot approach.

July 18, 2025

🧠 AI Research Assistants and Human Experts Will Be Best Friends!

Amplifying Human Expertise

Just finished reading a fascinating paper titled “Towards Artificial Intelligence Research Assistant for Expert-Involved Learning” (ARIEL), which explores how AI — particularly LLMs and LMMs — can act as scientific co-pilots in biomedical research. Some of its findings helped reinforce my assumptions, and others prompted me to think deeply about the impact of AI on scaling human expertise.

Here are a few reflections and takeaways:

🔬 The Two Datasets

The authors evaluated AI models across two distinct datasets:

Long-form biomedical articles (focused on COVID-19 and healthcare), to test summarization and reasoning abilities.
Biomedical figures from multiomics studies (e.g., genomics + proteomics), used to assess multimodal reasoning capabilities.

They benchmarked a variety of open- and closed-source models, with a clear focus on evaluating reasoning, summarization, and figure interpretation.

Task	Best Performer	Fine-tuned Open-source Performance	Takeaway
Text Summarization	ChatGLM-LoRA (open)	Better than GPT-4, Claude 3.5	Fine-tuning works extremely well
Figure Understanding	o1 (closed)	Open-source models far behind	Test-time computation helps

🧪 Fine-tuning Can Beat the Giants

The standout open-source model in their tests? ChatGLM, a 9B parameter model. After fine-tuning with domain-specific biomedical data using LoRA (low-rank adaptation), the upgraded ChatGLM-LoRA outperformed even closed-source SOTA models like GPT-4 and Claude 3.5 across multiple text summarization benchmarks.

This is huge: with access to the right data and tools, high-quality open-source models can not only match, but surpass, proprietary alternatives.

🧠 Augmenting Human Intelligence, Not Replacing It

One of the most interesting experiments involved comparing AI ensembles against human expert ensembles. Even the best AI ensemble couldn’t outperform the top-performing human. But here’s the twist: human performance improved significantly with AI assistance.

In essence, AI can elevate even average domain experts to the level of top experts — not by replacing them, but by augmenting their decision-making with rapid summarization, figure comprehension, and hypothesis generation.

📐 Evaluating Text with Embeddings

The authors used LLM-generated embeddings and cosine similarity to compare AI- and human-generated summaries — a useful reminder that this remains the de facto approach for evaluating free-form text in research and perhaps even clinical workflows.

🚀 Lessons for Healthcare + AI

Here are a few takeaways that resonated with me for real-world applications in healthcare and biomedical AI:

Fine-tuned open-source models can outperform closed-source models if the domain-specific data is strong.
Embedding-based comparison offers a scalable way to evaluate free-text clinical notes and AI summaries.
AI is a powerful amplifier, but it’s not a great equalizer. Those who learn to use these tools well will see massive productivity gains. Those who don’t will fall behind — widening, not narrowing, the performance gap.

🧭 Final Thought

I believe we’re at a stage where AI won’t flatten expertise, but rather magnify it. Those who master these tools will build, discover, and understand faster than ever — and it’s on all of us to stay on the right side of that curve.

May 19, 2025

Paper of the week: Reasoning Models Don’t Always Say What They Think.

Paper Link

The paper“Reasoning Models Don’t Always Say What They Think.” dives into how large language models do chain-of-thought (CoT) reasoning — basically, how they explain their thinking — and whether those explanations actually reflect what’s going on internally.

What stood out to me is that even when models are trained with outcome-based RL (just rewarding correct answers), they often learn to exploit shortcuts — like metadata or patterns in the prompt — and still give “rational-sounding” explanations that don’t match what they really did. In some cases, they get >99% reward by using hacks, but only verbalize what they did in <2% of cases. That feels like a problem if we want to trust what these systems are doing.

It got me thinking: humans aren’t totally transparent either. We don’t always know or share how we arrived at a decision. So should we expect more from LLMs?

I think – Yes, we kind of have to.

We can hold humans accountable in all sorts of ways — legally, socially, culturally. But we don’t have those levers with AI. If the model gives a polished explanation that’s not what it actually used to make the decision, we might have no way to catch it. So faithful CoTs might be one of the few tools we have for actually understanding and auditing model behavior.

And this matters a lot in domains like healthcare, finance, or law — where decisions can’t just look good on the surface. If a model recommends a treatment plan or flags a transaction, we need to know why — and make sure it’s for the right reasons.

Anyway — cool paper, kind of unsettling results, and a good reminder that reward ≠ alignment.

#ML #LLMs #ChainOfThought #AIAlignment #AISafety #Anthropic #HealthcareAI

May 10, 2025