FAQ about Ocsai 1.5: languages and new tasks
A few months ago, we put up the Ocsai 1.5 model, which is multi-lingual and multi-test. I didn’t want to want until my sabbatical ended and papers written to make it available. It was trained with a randomized masking method where the model sometimes didn’t get all available information, to encourage more flexibility and interpretative ability in the model.
Here is the direct Pearson correlation performance, Ocsai 1.5 with multi-judge ground truth, per language and task:
Alternate Uses Task
language | r |
---|---|
ara | 0.273 |
chi | 0.543 |
dut | 0.726 |
eng | 0.736 |
fre | 0.722 |
ger | 0.754 |
heb | 0.463 |
ita | 0.602 |
pol | 0.672 |
rus | 0.614 |
spa | 0.603 |
Other tasks, English
type | r |
---|---|
complete the sentence | 0.860 |
consequences | 0.560 |
instances | 0.917 |
metaphors | 0.704 |
The performance is generally mid-to-high correlation with human judges. The primary concern is right-to-left languages, which needs attention in future work.
Training was done on a variety of open datasets (Acar et al., 2024; Beaty et al., 2018; Beaty & Silvia, 2012; DiStefano et al., 2024; Dumas et al., 2020; Hass, 2017; Hass et al., 2018; Hofelich Mohr et al., 2016; Patterson et al., 2023; Silvia et al., 2008, 2009, 2017; Yang et al., 2023).
The training for this model was done with more deduplication than in past work. Earlier work just de-duplicated by exact duplicates, folding together identical responses to ensure that some are aren't in training data while others are artificially inflating performance in the evaluation dataset. Ocsai 1.5 does a more detailed text fingerprint, so 'paperweight', 'Paperweight', and 'Weight (paper)' are considered the same response.
What we don’t want to measure in evaluation is ‘how well does this system score things it has seen before’, because it’s not informative: we know it will nearly always be right on those cases! What’s more interesting is whether a system can be taught to understand the nature of originality enough to mimic a human judge on unseen responses, items, or even tasks.
“Okay, you called it 1.5 because there’s clearly a 2, right?”
Yes! But I’ll share that later. Ocsai 2 standardizes response scores across languages and will supercede Ocsai 1.5 in most cases. It works a bit differently, because addressing cross-lingual normalization meant addressing the challenge of ‘fuzzy duplication’ even more - where two responses are written differently but have the same semantic meaning.
How do the original English-only models perform on the same evaluation data?
Other researchers have noted that the original Ocsai didn’t do too badly with other languages. Here is the evaluation, on ocsai-davinci2
:
Multi-lingual Performance, Ocsai 1
Language | r |
---|---|
ara | 0.244 |
chi | 0.387 |
dut | 0.538 |
fre | 0.560 |
ger | 0.574 |
heb | 0.304 |
ita | 0.505 |
pol | 0.487 |
rus | 0.443 |
spa | 0.631 |
Yes, it did surprisingly well, but in extremely unsurprising news, training in individual languages improves performance in those languages (except Spanish here 🤷). The caveat is that it’s not just the languages that Ocsai 1.5 knows about, but also (different responses from) the same datasets. The gulf might be narrower with, say, a completely new Polish dataset judged by completely different human judges. Ocsai 2 will give a slightly better comparison here, smoothing more over dataset- and judge-specific biases.
I also evaluated the old model on the new evaluation data in English, excluding anything that had been in the ocsai-davinci2 training dataset. It performed at r=0.601 (vs r=0.773 for Ocsai 1.5), partially reflecting the fact that the newly added English data is a bit more diverse, and partially the stricter de-duplication.
Full question vs prompt
Part of Ocsai 1.5 is the option to describe items in a full question rather than just a short ‘prompt’.
For example: “What is a surprising use for a paperclip”, rather than just “paperclip”. For uses, instances, and complete the sentence, the full question is unnecessary - something like ‘paperclip’ will suffice. For more complex questions, like consequences prompts, writing prompts, or problem-solving items, the full question can help the model.
More Languages and Task Types
A less-noticed feature of Ocsai 1.5 is that you can type in your own tasks and languages in the dropdown.
For tasks, it’s better to be a bit more verbose if they’re not part of the dropdown. For example, “Write a creative short story” explains to model what the task goal is, much more than, say, ‘writing’.
For languages, the model expects the three-character ISO-639-2 language code, though it can figure out other forms.
How well Ocsai 1.5 works on previously-unseen tasks or languages - the ones not in the dropdown - is not yet evaluated. If you have or know of additional datasets with human-judged originality, particularly for tasks/languages not in the above evaluation, let me know. Ocsai is trained on open data, so I encourage you to share with everybody, not just us!
Which one should I choose: 1.5 vs ocsai-chatgpt?
For non-English tasks or custom tasks, use ocsai-1.5
. The performance on the English-language consequences task is better with ocsai-davinci3
or ocsai-chatgpt2
. For English-language Alternate Uses, either can be justified for common items, though 1.5 seems better for less typical items.
Bug Fixes and Updates
From last newsletter, famous last words: “I was surprised how well Python has improved in it’s handling of asynchronous code”. Some bugs revealed themselves after last week’s update - a big headache because they were inconsistent and hard-to-replicate. I think it’s fixed now, for the most part; thank you to Pier-Luc de Chantal and Cortney Rodet for helping me find the issue.
I also updated the secondary large-file scoring system. The speed-up in the main system makes the backup site less important, though it still shows progress better for large datasets. If scoring less than a few thousand responses, use the main site, since it gets more attention from us.
— Peter