My Notes on Chip Huyen’s ‘AI EngineeringReading Chip Huyen’s “AI Engineering”, I've learned a lot about the revolutionary foundation models which are changing software engineering and even the AI itself. I am now clear regarding how AI is going to turn the tech world upside down and evolve.
If you’re interested in the key insights of this book, this is what caught my eye. For those of you who have already read it, this should be a useful refresher.
Please note that I have focused virtually on the most interesting ideas, and skipping some background info.
♦image from O’reillyChapter 1: The Rise of Foundation ModelsI believe that foundation models have completely changed the whole scenery of AI. Nowadays, we don’t even start by designing our own models; we typically use pre-trained models. Chip Huyen explains how this move brings AI to experts, allowing them to harness powerful tools. This chapter is in itself a journey, in retrospect and through vision, with AI from its prehistory to the sophisticated systems of today. A major takeaway is the tangible impact of AI on society, just how critical the need for expertise in AI engineering has become.
The sheer versatility of foundation models is best exemplified by the Super-NaturalInstructions benchmark (Wang et al., 2022), which highlights the multiple functions the models are capable of conducting across translation, question answering, sentiment analysis, and others (see the following picture).
♦Image from AI Engineering by Chip HuyenChapter 2: Understanding Foundation ModelsWhen I learnt AI engineering, it became obvious that comprehension of foundation models isn’t optional — it’s a requirement. In this chapter, I learnt a more detailed and digestible version of the underpinnings of these models from Chip Huyen. If you’re tasked with AI systems building or fine-tuning this chapter underlines the most essential factors.
Starting with the basics: Training these models takes a lot of resources and expertise out of reach of many developers like me. There’s no need to construct all the pieces ourselves — we have models to use. We can use existing models. But, before we can safely pick and tweak models, we need to understand a few key things: Data models used for training, structure of data models, and adjustments made after training the models.
One of the central features discussed in this chapter is the notion of training data. The performance of a model depends on the data it used while training. A model without Vietnamese training data will bomb with Vietnamese text. When most of the training data emanates from Wikipedia, these models are not very good in technical areas such as Science or law. Knowing the data used to train a model, you are able to see its strengths and its weaknesses.
Transformers occupy a considerable part of the structure topic. Chip explains why the transformers became the reflection and why they still are the winning choice now. Still, she does not hide behind the questions: what amazing progress could await us after the transformer era? While the response is not entirely clear, this is a rather important trend for observation.
Size also plays a role. The chapter explains “scaling laws,” which help developers make principled choices as to the size of the model given historical and computational limitations. I see it as a triangle of trade-offs: parameters, tokens, and FLOPS (floating point operations). Building an overly large model if you do not have enough data or computational ability may simply make it redundant and useless.
♦Image from AI Engineering by Chip HuyenHow the model selects the next word is also an important factor; it is called sampling. Chip selects this as the top neglected concept in the sphere of AI, and this revelation really does speak to me now. Sampling approaches like temperature and top-k have a huge effect on the results generated by the model. Want fewer errors? Adjust your sampling.
Once the model has been trained, it is fine-tuned to fit human needs. More often than not, this translates to additional training, which is goal-oriented and where indications of end users are heard. Though there are several benefits, there is a risk of mistreating the performance in some of the domains from time to time. Developers have to balance this.
♦Image from AI Engineering by Chip HuyenThough this chapter does not promise to transform you into a foundational model pro, it does provide builders and tinkerers with basic strategies for decision-making.
With the growth of artificial intelligence in day-to-day tools and services, understanding the basics enables to creation of structures that are both creative and reliable, with the needs of the users at the forefront.
Chapter 3: Evaluation MethodologyThis chapter shifted my opinion on working with AI more than anything else. Although we could easily become excited about what they can offer, developers should check whether they accurately provide results, Chip Huyen argues. In this article, we discuss how to evaluate AI systems by taking a closer look at open-ended tools like chatbots and also creative platforms.
Assessment does not merely precede a final step in any process. It has to be used consistently and systematically from the onset to the end when establishing a system. What Chip had believed to be the key obstacle to the adoption of AI — evaluation — resonated with me after diving into this chapter. Outputs of AI systems that are unregulated pose real threats. The risks become apparent upon watching chatbots entertaining questionable guidance or AI tools creating false legal precedents.
What makes evaluation tough? Interestingly, foundation models are meant to function in an unfamiliar manner. Contrary to traditional models, which make a selection among a given set of outcomes, AI systems like the GPT generate unique variations in responses. The fact that many prompts have no single clear answer makes the process of evaluating metrics more complicated.
Chip presents a number of evaluation techniques that address the complexity of open-ended responses:
- Although it is easy to compute accuracy scores, it is best when conducting tasks that have fixed known correct answers.
- Subjective evaluation entails application of human assessors, entailing delays and high costs.
- AI-as-a-judge is an innovative method in which models of AI judges evaluate the quality of other AI generated results. Although it is fast and efficient, questions about possible biases and reliability issues between various AI judges surface.
♦Image from AI Engineering by Chip Huyen♦Image from AI Engineering by Chip HuyenAn important lesson learned is to understand potential failure points in your system. Chip supports architectural changes that make failure mechanisms transparent so that personalised evaluation efforts can be developed.
We should also consider sampling- how models pick a word to suggest next. This aspect is more than a technicality. Although it increases a model’s ability to produce new ideas and options, it may also result in mistakes. Through polishing your sampling processes, you can significantly amplify the reliability of your product.
Chapter 4: Evaluating Foundation Models : The Most Challenging Parts of AI EngineeringAfter reading and going through Chapter 4, it became clear that the reason why AI systems often fail in the real world is that they lack reliable evaluation. There is not much self-deception in her evaluation — the dearth of relevant evaluation is still a significant impediment to AI deployment.
Moving ahead, our fundamental problem is: open-endedness. Unlike standard ML problems with clear right or wrong answers, GPT and other generative models generate unpredictable outputs influenced by their context. This creates a challenge because metrics such as precision or accuracy aren’t sufficient; evaluation is a complicated, multi-layered challenge.
Chip introduces three primary evaluation strategies:
- Exact Evaluation: Accurate indicators like or BLEU score. Such metrics benefit structured tasks, like classification, but do not work so well when it comes to creativity.
- Subjective Evaluation: Human rating of outputs. It may be the preferred approach, but it is very resource-intensive and time-consuming.
- AI-as-a-Judge: Using automated approaches to evaluate the quality of AI-created content. It is helpful and applicable to many situations, where valid criticisms about possible biases can be raised.
♦Image from AI Engineering by Chip HuyenOne important takeaway for me was to emphasize careful tracking of experiments. We are prone to making changes to prompts or training data, without keeping track of them. Lacking systematic documentation, it will be so easy to encounter major complications later. To know what works Chip highlights the need to log all the variables — iterations of the prompt, rubric revisions, and demographics of users.
Another particularly interesting discovery that I made was: The log-priorities of the tokens generated by the model. Staring at a model’s confidence for individual tokens provides insights regarding its fluency, coherence, and possible truthfulness. I never would have guessed how much worth a logprob could add to a model output evaluation — this realization in and of itself is game-changing for the work that I do.
The topic also includes using the perplexity to rate fluency and the return of traditional NLG metrics such as relevance and faithfulness. However, such approaches are applicable only in special situations. Open-ended tasks frequently require customised scoring metrics that adapt to your very specific application.
A big eye-opener for me: Although helpful, MCQs do not make for an ideal metric to compare generative models. They test recognition, not generation. Still, there are a few research that remain with MMLU-style metrics ignoring the areas where models shine or struggle catastrophically in generation tasks.
Chip emphasizes multi-layered evaluation pipelines. You can utilize naive classifiers for detecting overall performance and then rely on more refined human or machine arbiters for all details. By dividing evaluation data according to user types or input categories, you might discover unanticipated biases or performance problems.
And don’t underestimate hallucination. It’s not random — it arises from models believing generated information; is accurate. In a case study from the book, it is shown how a model (with misinformation) judged a shampoo bottle to be a milk container. Why? It told lies to itself and treated them as actual facts.
Primarily, Chip argues for the evaluation to be embedded at every stage, from model selection, throughout development cycles, or even post-production. Establishing robust sets of annotations for tests should be an integral part of the process, not just nice-to-have frills and not an afterthought.
This chapter reshaped my approach. For AI applications where trust becomes a key factor, evaluation needs to be a continuous process, targeted, and core to our procedures. Measuring it as a QA checklist is a shallow approach: it’s the foundation on which real-world success is built.
Chapter 5: Prompt Engineering — The Art of Asking AI the Right WayPrompt engineering was the very first model adaptation technique I ever discovered — and the one I might underestimate. In chapter 5 of AI Engineering, it turns out prompt engineering is less about crafting witty questions and more about excelling at communicating with smart machines. Chip Huyen leads us on a technical and philosophical odyssey and makes “playing with prompts” an engineering practice.
We begin with the fundamentals: a prompt is a direction to a model. That might be a straightforward question, such as, “What is the capital of France?” or a multi-action assignment like, “Break down the following sales report and summarise its insights in bullet form.” Prompts may involve:
- Task specification (what is to be done),
- Examples of desired input-output behaviour (few-shot),
- Context or previous conversation (particularly for chat interfaces).
What is distinctive about prompt engineering is the fact that it doesn’t modify the model’s weights — it is completely input-centric. This makes it the best and least expensive method to fine-tune a foundation model.
Don’t let its simplicity fool you. Prompting has its nuances. The location of your instructions — beginning or end — can significantly impact performance. Chip has empirical data: models such as GPT-4 respond better to instructions at the start but others like LLaMA-3 might like them at the end.
We also learn to test robustness. If replacing “five” with “5” causes your model’s response to break, it’s not robust. Red flag. More robust models are less fragile to such perturbations to prompts, and so fiddling with prompts is a good proxy for model quality.
Chip believes in approaching prompt engineering scientifically. Keep track of experiments, test variations, and systematically optimise. It’s quite like A/B testing in product development.
One of the most relevant sections of interest here is prompt security. While prompts can also mislead models as much as they direct models, malicious users can cause models to ignore instructions by fooling them through prompt injection attacks. This is most dangerous in environments with multiple users like in finance or customer support. Some defensive methods are:
- Input/output filtering
- Escaping unsafe tokens
- Utilizing system prompts to constrain model behavior
- And more broadly, treating prompts as if they’re code: structured, vetted, and protected.
The chapter also discusses in-context learning:
- Zero-shot prompting
- Few-shot prompting
- Chain-of-Thought (CoT) prompting, which requires the model to reason in a step-by-step
♦Image from AI Engineering by Chip HuyenChip demonstrates tangible benchmarks: by utilising CoT, Gemini Ultra’s MMLU, score was boosted from 83.7% to 90.04%. That’s a testament to how influential structured prompting can be, even when contrasted with fine-tuning.
Ultimately, this chapter transformed the way I perceive prompt engineering. And when executed optimally, it enables us to harness a model’s full power without incurring a single cent on training costs.
Chapter 6: RAG and Agents — Providing Models with Memory and ToolsI had not yet realised how much context informs the behaviour of AI systems before reading this chapter. Chapter 6 of Chip’s AI Engineering is a technical deep dive into two of the most influential patterns used to scale AI applications: Retrieval-Augmented Generation (RAG) and agentic systems.
Foundation models are strong but forgetful. They are unable to recall document-length text or maintain a track of changing conversation. To solve this problem, we have to equip them with tools to retrieve and engage with related information — welcome to RAG and agents.
RAG: Retrieval-Augmented Generation♦Image from AI Engineering by Chip HuyenRAG alters the paradigm by allowing models to retrieve useful external information prior to generating a response. Rather than providing the model with a behemoth prompt containing all possible data, you can:
- Pull out only the relevant parts using techniques such as vector search.
- Pass it on to the prompt.
The consequence? More relevance and less hallucination.
RAG is analogous to feature engineering for foundation models. While traditional ML had you hand-engineer features, RAG has you engineer the appropriate context. Chip notes that this pattern excels when your app relies on private knowledge bases or domain-specific material.
The mechanisms of RAG include:- An embedding model which transforms documents and queries into vectors.
- A search system (e.g., Weaviate, FAISS) to retrieve the most relevant documents.
- A composer who combines all these into a request to the model.
One interesting real-life example I discovered: a user queries, “Can the Printer-A300 print 100 pages per second?” The system pulls in the manual section with specifications and includes it in the model’s prompt — providing grounded and accurate responses.
Agents: Equipping Models with Tools and Autonomy Agents bring it to the next level. Not merely fetching static information, agents are able to utilize tools, plan and execute actions. They’re how you transition from “static chatbot” to “AI assistant scheduling appointments, checking the weather, and following up.”
Chip refers to agents as models implemented using APIs or plugins — such as web search engines, scheduling applications, or CRM applications. The intrigue lies in the dynamic interactivity. Agents may:
- Think about their actions
- Make Multi-Step Decisions
- Use buffer memory
- Call outsource functions
But it’s not completely rosy. Agents are fragile. They call the wrong tool, don’t plan properly, or delude themselves into taking fictional steps. Managing their own memory is vital. Strategies are:
- FIFO memory: Store the most recent turns only
- Summary Memory: Recap past conversations
- Long-term vector memory: Retrieve by similarity
The architectural considerations are enormous. Your app architecture becomes a modular, dynamic system with layers of search, planning, and decision-making when using RAG and agents.
This chapter made me rethink the architecture of AI. Whether you are designing systems that require reliability, context comprehension, or autonomy, RAG and agents aren’t “nice to have” but are a necessity.
Chapter 7: Finetuning — Should You Train or Tune?I used to believe finetuning was a matter of having a large GPU and lots of data. That was all blown out of the water by chapter 7 of AI Engineering. Chip Huyen takes us through why and when and how to finetune foundation models — and how to do it without breaking the bank.
The chapter begins with a fundamental fact: you might never have to finetune at all. Prompt engineering and retrieval-augmented generation (RAG) already yield strong customization as it is, and finetuning is more a last resort than a first action. Chip provides a framework to guide you to decide: if your system requires behavioral change greater than prompting and RAG, and if you have good data, finetuning is the optimal choice.
Full Finetuning is Obsolete:Traditional finetuning finetuned all model weights, which was possible when models were tiny. But when models grew out of control, it became infeasible. Finetuning all parameters of a multi-billion-parameter model takes up too much memory and compute, something most practitioners do not even have.
The industry moved to more parameter-efficient fine-tuning (PEFT) methods instead.
♦Image from AI Engineering by Chip HuyenEnter PEFT and LoRA:Chip discusses several PEFT approaches, beginning with LoRA (Low-Rank Adaptation).
LoRA is a star among finetuning methods because it:
- Keeps the base model on hold.
- Injects light adapter modules
- Needs fewer resources,
- And enables modular deployment (e.g., changing adapters to accommodate different scenarios).
♦Image from AI Engineering by Chip HuyenShe even deconstructs the architecture of LoRA and how it differs from partial finetuning when a subset of layers are updated. Surprisingly, LoRA consistently beats partial finetuning on both sample and memory efficiency, but at the expense of a slightly greater inference latency.
This modularity also creates a system referred to as model merging — taking multiple specially fine-tuned adapters and merging them into a single model. This is particularly beneficial to deploy on edge devices or scenarios in which multiple features are to be packaged into a single model.
The Bottleneck Isn’t Finetuning. The real bottleneck is data. There’s a single thing to learn from this chapter: Finetuning is not difficult. Good data is.
Finetuning relies on instruction data to work properly. High-quality and neatly labeled data is costly and time-consuming to collect. This sets up a paradox: you can simply add a LoRA adapter to your model, but if your data aren’t clean, the output will be trash.
Trade-offs and Advice:Throughout the chapter, Chip highlights trade-offs:
- Parameter efficiency vs. latency
- Data Quality vs. Scale
- Customization versus generalizability
She also discusses quantized training and distillation, and how both of these methods complement the larger toolkit of tuning.
At the end of the chapter, I realized that finetuning is not about brute force anymore. Finetuning is about precision, minimalism, and informed decisions — that’s a mindset any modern AI engineer should adopt.
Chapter 8: Dataset Engineering — The Unseen Backbone of AI SuccessIf I’ve taken anything away from working with AI, it’s this: however advanced your model is, it’s as good as the data. Chapter 8 of AI Engineering by Chip Huyen delves deep into the fundamental reality and demonstrates why dataset engineering is the most underappreciated but essential skill in AI.
Why Data is Truly the Differentiator:As models become commoditized, firms are unable to count on model innovation anymore. Chip maintains that if it contains high-quality domain-specific or proprietary data, your dataset becomes your differentiator. Since compute is readily available and open-source models abound in the market, data is the moat now.
Still, dataset work is unglamorous. Chip is refreshingly blunt about it here: “Data will mostly just be toil, tears, and sweat.” She is correct. Good data work involves tedious and iterative work, but it’s also what distinguishes toy prototypes from solid AI products.
Key Pillars of Dataset Engineering:The chapter splits dataset work into three key activities:
- Curation — What data do I require? In how much quantity? Where do I get it? How do I maintain quality?
- Synthesis — Leveraging AI itself to produce annotated examples, particularly when it is too expensive or time-consuming to get it from humans.
- Processing — Cleaning and de-duplicating and formatting data to make it usable.
I was glad to see Chip’s emphasis on dataset life cycle. She discusses how pre-training and post-training call for varying data approaches:
- Pretraining: Emphasizes on breadth (expressed in tokens).
- Posttraining: Emphasizes depth and clarity (measured in examples).
♦Image from AI Engineering by Chip HuyenAnd it’s nonlinear — you’ll constantly go back and forth between curation, synthesis, and cleanup. Chip recommends approaching dataset creation as software development: version control, documentation, reproducibility.
Human vs. Synthetic Data Human-labeled data, particularly instructions and conversation, is still gold. It’s costly, however, Chip estimates a good (prompt, response) pair to cost $10 and thousands to train instruction-following models. No wonder firms like OpenAI hire graduate-degree-carrying professional annotators.
On the other hand, synthetic data (data created by AI itself) is taking hold. Faster, scalable, and inexpensive — but dangerous. Unless it’s carefully filtered, you’re left with self-reinforcing biases or poor quality signals. Nonetheless, lots of start-ups are using it to bootstrap new models successfully.
The Shifting Data Landscape:Another wake-up call: web data is less open than it was before. Chip mentioned earlier in the book how sites such as Reddit and Stack Overflow have imposed restrictions on data access and how copyright wars are intensifying. This has caused firms to enter into licensing arrangements with publishers or dig up internal corpora — emails, contracts, help tickets — to produce private datasets.
She also refers to a chilling trend: the web is filling up with content created by AI. Future models learn from this “echo chamber” and may perform worse as a consequence. Having a human-originated, high-quality dataset may become a luxury most firms are no longer in a position to afford.
Chapter 9: Inference Optimization — Making AI Cheaper, Faster, SmarterI read Chapter 9 and discovered the pulse of production AI: inference optimisation. Not how much it does — how fast and affordably it does it. No consumer will wait five seconds to hear back from a chatbot. No firm will pay $10,000 per day to avoid it. Chip Huyen puts this problem into stark relief and shows a toolkit of methods to make foundation models feasible at scale.
What is Inference?Chip first defines the distinction between training and inference. While training refers to the process of educating a model, inference refers to applying it to make predictions in real time. Most AI engineers work with inference more than training, particularly if you are using pre-trained models.
The inference server executes the model, dispatches user requests, manages hardware allocation, and returns responses. Speed is not the issue — it’s a coordination problem mixing model design, systems engineering, and hardware planning.
Bottlenecks and Metrics Inference are prone to fail because of two fundamental bottlenecks:
- Compute-bound operations: hampered by math-intensive operations (such as matrix multiplication).
- Tasks bound on memory bandwidth: hampered by data transfer between CPU/GPU.
♦Image from AI Engineering by Chip HuyenThey guide you in selecting the correct hardware or model settings. Chip takes us through latency measurements such as:
- Time to first token
- Time by token
- Total query latency
Understanding which of the following is most important to your user flow is vital. You may accept more verbose responses if the first token appears immediately, say.
Techniques to Optimize:Now to the point: how to make inference faster and more affordable.
- Quantization — Lower model precision (e.g., float32 → int8). Saves space, compute and money.
- Distillation — Train a smaller “student” model to approximate a large “teacher.” Faster but a bit less precise.
- Caching — Store results to avoid multiple querying. Straightforward and efficient.
- Batching — Run multiple requests together. Improves GPU utilization but adds waiting time.
- Early stopping — Implement constraints on how many tokens to produce, or when to halt on certain criteria.
They all have trade-offs: you may gain speed at the expense of small performance losses. Chip forces us to strike a balance between latency, price, and quality, particularly on user-facing products.
Model Parallelism:If a model is too large to fit on a single GPU, model parallelism divides it across devices:
- Tensor parallelism: divides math operations
- Pipeline parallelism: splits model stages
- Split the computation or input by function.
This applies more to teams having their own models. For others who are using APIs (OpenAI and Anthropic), the lesson is to understand what is underneath the hood — so in case it’s necessary to scale, you’ll make informed decisions.
Business Impact Chip ends with the harsh reality: optimisation is not a choice. Inferencing costs increase linearly as a function of usage. Every 10,000 users means a potential daily spend of thousands if you don’t optimise.
What hit me most was her framing: inference is not simply a back-end issue — it’s a product feature. Users experience it. Companies pay for it. And good engineers learn it.
Chapter 10: AI Engineering Architecture and User Feedback — From Prototype to ProductionAs I wrapped up AI Engineering, Chapter 10 pulled everything together. It’s not just about clever prompts or smart finetuning — it’s about how all these parts interact in a real system. This chapter walks us through the evolving architecture of AI applications and why user feedback is the backbone of iteration and trust.
Building Your AI Stack, One Layer At A Time
Chip starts with the simplest version of an AI app: a user sends a query, the model generates a response, and that’s it. But as any developer knows, this doesn’t scale. Real-world AI apps need guardrails, memory, context injection, and optimization. So the chapter introduces a modular architecture that evolves over time:
- Basic Model API Call — No augmentation or caching.
- Context Construction — Add external knowledge via RAG or tool use.
- Guardrails — Protect against harmful inputs and outputs.
- Routing and Gateways — Support multiple models and APIs.
- Caching — Speed up frequent queries.
- Write Actions and Agent Patterns — Let AI perform actions like booking or writing.
Each addition boosts capability — but also introduces complexity and failure points. Chip stresses the need for observability, with logging and metrics across all layers.
Guardrails: Safety Nets for AIAs models get smarter, the risks increase, especially with tools and write permissions. Guardrails are the protective layer:
- Input Guardrails detect and block harmful prompts (e.g., prompt injection).
- Output Guardrails check for toxic or unsafe model outputs.
- PII Handling masks sensitive data before sending it to external APIs.
♦Image from AI Engineering by Chip HuyenShe also dives into real-world risks: what if a user pastes private info into a prompt? What if an AI agent triggers a bank transfer? These aren’t hypotheticals anymore.
User Feedback: The Ultimate ModelOptimizer AI outputs don’t improve on their own. Feedback is how we learn what’s working and what’s not. But here’s the twist: natural language feedback is easy to give and hard to parse. Chip outlines methods to extract structured signals from conversations, like:
- Explicit thumbs-up/down
- Implicit signals (time spent, follow-up questions)
- Flagging problematic behaviour
She urges us to design feedback systems upfront, not as an afterthought. It should be a core part of your data pipeline and evaluation loop.
The Rise of Full-Stack AI EngineersThis chapter highlights a broader shift — AI engineering is merging with full-stack development. Thanks to APIs like OpenAI’s Node SDK and frameworks like LangChain.js, more frontend developers are entering AI. Those who can build fast, iterate fast, and collect feedback fast are the ones who’ll win.
What really stuck with me?
You don’t need a giant model or deep ML expertise. You need good systems thinking, strong UX, and a feedback loop. That’s the new stack.
ConclusionI hope this summary gives you a good overview of what to expect from “AI Engineering.” The book taught me valuable skills in AI, especially in understanding the full engineering lifecycle, not just building models. I’m still reading and learning the concepts in this book, but I wanted to share these insights as they’ve already transformed my understanding of the field. Whether you’re a developer, product manager, or just an AI learner, I hope my summary gives you a short explanation for why this book is worth your time.
♦Building Real AI Systems was originally published in Code Like A Girl on Medium, where people are continuing the conversation by highlighting and responding to this story.