
Training Data Requirements in Generative AI

Training Data Requirements in Generative AI
Understand the guidelines for creating training data for fine-tuning the pretrained
 models in OCI
Generative AI.
Custom models accept only one training dataset file in a JSONL (JSON Lines)
 format. The file must have a minimum of 32 prompt/completion pair examples per file. This
 dataset is randomly split to a 80:20 ratio for training and validation. There's no maximum
 number of sentences for the training file, but large datasets take longer to train.
About JSONL
A JSONL file contains a new JSON value or object on
 each line. The file isn't evaluated as a whole, like a regular JSON
 file. Instead, each line is treated as if it is a separate JSON file.
 This format is ideal for storing a set of inputs in JSON format. 
The OCI
Generative AI service accepts a JSONL
 file for fine-tuning custom models in the following format:
{"prompt": "<first prompt>", "completion": "<expected completion given first prompt>"}
{"prompt": "<second prompt>", "completion": "<expected completion given second prompt>"}
.
.
.
JSONL Example
{"prompt": "What is the capital of France?", "completion": "The capital of France is Paris."}
{"prompt": "What is the smallest state in the USA?", "completion": "The smallest state in the USA is Rhode Island."}
 Note Ensure that each JSONL dataset file that you create for Generative AI has the following properties: 
The file is UTF-8 encoded.
Each line item contains a valid JSON object.
Each JSON object has two properties: "prompt" and "completion".
Each JSON object is entered in a new line or followed by a newline
 character (\n).
After you create the JSONL file, add your dataset to an Object Storage bucket.
