Research Map
Research
A guided map of the themes connecting my papers, talks, projects, and field notes on AI systems.
This map is the guided route through the archive. Each theme starts with the research question, then points to the papers, talks, projects, and notes that carry the idea forward.
- Publications
- 39
- Talks
- 32
- Projects
- 29
- Posts
- 60
Theme 01
Behavior Shaping
How prompts, post-training, feedback, and reward design steer model behavior under real constraints.
Start here- Publication Prior Prompt Engineering for Reinforcement Fine-Tuning This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt--the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning--remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies--reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization--into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
- Publication Null-Shot Prompting: Rethinking Prompting Large Language Models With Hallucination This paper presents a series of investigations into an interesting phenomenon where we observe performance increases in large language models (LLMs) when providing a prompt that causes and exploits hallucination. We propose null-shot prompting, a counter-intuitive approach where we intentionally instruct LLMs to look at and utilize information from a null section. We investigate null-shot prompting on a wide range of tasks, including arithmetic reasoning, commonsense reasoning, and reading comprehension. We observe a substantial increase in performance in arithmetic reasoning tasks for various models, with up to a 44.62% increase compared to a baseline in one model. Therefore, we investigate deeper into this task by utilizing a more challenging mathematics problem-solving benchmark. We observe that LLMs benefit from hallucination in null-shot prompting in this task and discuss the mathematical topics that benefit the most from introducing hallucination in the prompt. We continue our investigation by evaluating hallucination detection abilities of the LLMs when using null-shot prompting. We find surprising results where hallucination in prompts can improve hallucination detection abilities of many LLMs. We also examine the effects of introducing both reasoning, which is known to mitigate hallucination, and hallucination simultaneously in the prompt and observe another surprising turn for the mathematics problem-solving benchmark with many performance improvements. We hope this paper will spark more interest, investigations, and discussions on how hallucination in prompts LLMs and even bolsters them in certain cases.
- Talk Prior Prompt Engineering for Reinforcement Fine-Tuning General audience and SuperAI Engineer Season 5 (SuperAI Exhibition) · on-site
- Project OpenRLHF (Contributor) Contributed to OpenRLHF, an easy-to-use, scalable, and high-performance RLHF framework.
Theme 02
Evaluation
Benchmarks, failure taxonomies, and practical checks for whether model outputs hold up beyond demos.
Start here- Publication BenchING: A Benchmark for Evaluating Large Language Models in Following Structured Output Format Instruction in Text-Based Narrative Game Tasks This paper presents BenchING, a new benchmark for evaluating large language models (LLMs) on their ability to follow structured output format instructions in text-based procedural content generation (PCG) tasks. The ability to condition LLMs to output in specified formats proves useful, as downstream components in LLM-integrated games often require structured outputs for exchanging information. However, there is a gap in evaluating this aspect of LLMs, especially in narrative PCG tasks, making it difficult to select LLMs and design games or applications integrating these LLMs. To demonstrate the potential of our benchmark, we evaluate nine LLMs for their ability to generate parseable formatted outputs using five selected text-based PCG tasks. We report on the performance of these LLMs on these tasks. Additionally, we categorize more detailed error types and propose solutions by utilizing LLMs to fix these errors. We also conduct a scaling study, investigating an emergent point of LLMs for their ability to fix malformed formatted content using eight quantized LLMs with varying original sizes from 0.62B to 72.3B. Furthermore, we perform a qualitative study to assess the quality of the generated content. We make our source code and raw data available for future research.
- Publication On the Robustness of Answer Formats in Medical Reasoning Models Medical reasoning models (MRMs) achieve superior performance on medical benchmarks compared to medical LLMs; however, high accuracy alone is insufficient for practical deployment. One of such requirements for real-world application is robustness to varying output constraints. Specifically, posing the same medical question while requesting different answer formats should not affect the underlying correctness of the response. We investigate this phenomenon in this paper, focusing on MRMs. To quantify this behavior, we propose the metric answer-format robustness: the ability to reliably generate correct outputs across varying specified formats. We examine three representative formats: multiple-choice, open-ended question-answering, and ranked lists. Across 15 proprietary and open-weight models, we observe substantial variation in format robustness (35-100%). Furthermore, we conduct controlled fine-tuning experiments on a shared backbone with matched training data to isolate the effects of the fine-tuning paradigm. We find that supervised fine-tuning yields more stable behavior across formats, whereas reinforcement fine-tuning often exhibits higher cross-format brittleness, with the degree of instability strongly dependent on reward design. Overall, answer-format robustness in MRMs is trainable yet brittle and requires careful evaluation for practical medical use.
- Project Themis Lightweight evaluation platform for LLM experiments.
- Project BenchING: Structured Output Benchmark for LLMs A benchmark and framework for evaluating how well LLMs follow structured output formats in narrative PCG tasks, with error taxonomy and scaling analysis.
Theme 03
Reasoning Models
Open reasoning models and analyses for Thai, medical, and domain-specific reasoning settings.
Start here- Publication Typhoon T1: An Open Thai Reasoning Model This paper introduces Typhoon T1, an open effort to develop an open Thai reasoning model. A reasoning model is a relatively new type of generative model built on top of large language models (LLMs). A reasoning model generates a long chain of thought before arriving at a final answer, an approach found to improve performance on complex tasks. However, details on developing such a model are limited, especially for reasoning models that can generate traces in a low-resource language. Typhoon T1 presents an open effort that dives into the details of developing a reasoning model in a more cost-effective way by leveraging supervised fine-tuning using open datasets, instead of reinforcement learning. This paper shares the details about synthetic data generation and training, as well as our dataset and model weights. Additionally, we provide insights gained from developing a reasoning model that generalizes across domains and is capable of generating reasoning traces in a low-resource language, using Thai as an example. We hope this open effort provides a foundation for further research in this field.
- Project Typhoon T1: Open Thai Reasoning Model An open Thai reasoning model (research preview) exploring test-time reasoning strategies and instruction-following for Thai language tasks.
- Project Typhoon-Si Med-Thinking 4B A 4B medical reasoning model from Typhoon and SiData+ that generates ranked diagnoses, capturing clinical uncertainty and outperforming larger models on major medical QA benchmarks.
- Writing The Current Landscape of Reasoning Model Development Latest insights on reasoning model development approaches.
- Writing Rethinking How Medical AI Reasons: Introducing the Typhoon–SiData+ Ranked-List Medical Reasoning Model A collaborative research project exploring whether small models can outperform frontier models—like Gemini 2.5 Pro—when trained to produce ranked lists that better reflect real clinical reasoning.
Theme 04
Agentic Systems
Tools, workflows, and applied systems that turn model behavior research into usable practice.
Start here- Writing Mastering Agentic Workflows - 20 Principles to Build Smarter AI Systems In recent years, large language models (LLMs) have evolved beyond text-based chatbots into agents capable of executing tools—functions that let them gather new information, interact with external systems, or even take actions that affect the real world.
- Talk Open Models, Smarter Agents: Practical Lessons from Modern Agentic Workflows FOSSASIA Summit 2026 · hybrid
- Talk Agentic AI With Context Engineering Agentic AI For Healthcare <Hackathon> · on-site
- Project Typhoon Application Week Built and shipped 7+ web apps integrating LLM capabilities as part of a rapid prototyping initiative.