Extracting entities from text¶
extract function is a robust tool for pulling lists of structured entities from text. It is designed to identify and retrieve many types of data, ranging from primitive data types like integers and strings to complex custom types and Pydantic models. It can also follow nuanced instructions, making it a highly versatile tool for a wide range of extraction tasks.
What it does
extract function pulls lists of structured entities from text.
Extract product features from user feedback:
Suppose you want to extract any people mentioned in some text
How it works
Marvin creates a schema from the provided type and instructs the LLM to use the schema to format its JSON response. Unlike casting, the LLM is told not to use the entire text, but rather to look for any mention that satisfies the schema and any additional instructions.
extract supports almost all builtin Python types, plus Pydantic models, Python's
TypedDict. Pydantic models are especially useful for specifying specific features of the generated data, such as locations, dates, or more complex types. Builtin types are most useful in conjunction with instructions that provide more precise criteria for generation.
To specify the output type, pass it as the
target argument to
extract. The function will always return a list of matching items of the specified type. If no target type is provided,
extract will return a list of strings.
To extract multiple types in one call, use a
| in Python 3.10+). Here's a simple example for combining float and int values, but you could do the same for any other types:
LLMs perform best with clear instructions, so compound types may require more guidance as the type itself isn't sending as clear a signal.
extract will always return a list of type you provide.
When extracting entities, it is often necessary to give detailed guidance about either the criteria for extraction or the format of the output. For example, you may want to extract all numbers from a text, or you may want to extract all numbers that represent prices, or you may want to extract all numbers that represent prices greater than $100. You may want to extract all dates, or you may want to extract all dates that are in the future. You may want to extract all locations, or you may want to extract all locations that are in the United States.
For this purpose, extract accepts a
instructions argument, which is a natural language description of the desired output. The LLM will use these instructions, in addition to the provided type, to guide its extraction process. Instructions are especially important for types that are not self documenting, such as Python builtins like
Here are the above examples, illustrated with appropriate instructions. First, extracting different sets of numerical values:
Next, extracting specific dates:
from pydantic import BaseModel
text = 'I live in New York, but I am visiting London next week.'
# [Location(city="New York", country="US"), Location(city="London", country="UK")]
extract(text, Location, instructions='all locations in the United States')
# [Location(city="New York", country="US")]
Sometimes the cast operation is obvious, as in the "big apple" example above. Other times, it may be more nuanced. In these cases, the LLM may require guidance or examples to make the right decision. You can provide natural language
instructions when calling
cast() in order to steer the output.
In a simple case, instructions can be used independent of any type-casting. Here, we want to keep the output a string, but get the 2-letter abbreviation of the state.
You can pass parameters to the underlying API via the
model_kwargs argument of
extract. These parameters are passed directly to the API, so you can use any supported parameter.
If you are using Marvin in an async environment, you can use