How to build function calling and JSON mode for open-source and fine-tuned LLMs
NVIDIA has announced support for function calling and structured output for LLMs deployed with its TensorRT-LLM Engine Builder, adding model server level support for two key features. Function calling allows users to pass a set of defined tools to an LLM as part of the request body, while structured output enforces an output schema defined as part of the LLM input. These features are built into NVIDIA's customized version of Triton inference server and use logit biasing to ensure valid tokens are generated during LLM inference. The implementation has minimal latency impact after the first call with a given schema is completed, allowing for efficient use of these new features.
Company
Baseten
Date published
Sept. 12, 2024
Author(s)
Bryce Dubayah, Philip Kiely
Word count
1339
Language
English
Hacker News points
1