This continues the theme from Part 1 where I explored how ChatGPT with the GPT-3.5 model would do with a code interpretation challenge. It was somewhat impressive, but with obvious shortcomings. So how did the GPT-4 model do?
Session Start
What will be the result?
c = [False if int(i) % 2 != 0 else True for i in str(24901)]
b = [False if False in c else True]
print(b[0])
In this code snippet, the list comprehension is creating a list c
with boolean values based on the condition that each digit in the integer 24901 is even. Let’s break down the list comprehension step by step:
- Convert integer 24901 to a string and iterate through its digits: ‘2’, ‘4’, ‘9’, ‘0’, ‘1’
- Convert each digit back to an integer and check if it’s even (i % 2 == 0). If it’s even, append
True
to the listc
, otherwise appendFalse
.
Resulting list c
: [True, True, False, True, False]
Then, another list comprehension creates the list b
based on the condition that all elements in the list c
must be True
. If there is any False
in the list c
, b
will be [False]
. In this case, since c
contains two False
elements, the list b
will be [False]
.
Finally, the code snippet prints the first (and only) element in the list b
, which is False
.
Thoughts
So… it nailed it. It explained better and got the intermediate steps correct. The explanation includes the transformations between intermediate types, the action of the list comprehension, and how it evaluates each array entry. The evaluation of the array c
is completely correct, and how b
is constructed is well explained.
Comparing GPT-4 model performance on this to GPT-3.5 it has definitely done it’s homework.
Now, will it believe in itself?
Expressing Doubt
are you sure that’s right?
Apologies for the mistake in my previous response. Let’s go through the code snippet again:
c = [False if int(i) % 2 != 0 else True for i in str(24901)]
b = [False if False in c else True]
print(b[0])
- Convert integer 24901 to a string and iterate through its digits: ‘2’, ‘4’, ‘9’, ‘0’, ‘1’
- Convert each digit back to an integer and check if it’s odd (i % 2 != 0). If it’s odd, append
False
to the listc
, otherwise appendTrue
.
Resulting list c
: [True, True, False, True, False]
The list comprehension for b
should actually be a single-element list containing True
if there are no False
values in c
, and False
otherwise.
In this case, since c
contains two False
elements, the list b
will be [False]
.
Finally, the code snippet prints the first (and only) element in the list b
, which is False
.
Conclusion
So, it’s still a bit apologetic, but unlike with the GPT-3.5 model, it doesn’t trick itself into changing its answer. It basically repeats going through the problem again and reasserts the right answer.
It did the same when I told it flat out I thought it was wrong, apologizing and then regenerating the explanation out to the correct final answer.
The consistent apology pattern isn’t surprising given we’re dealing with a language model. The most likely response in interactions when challenged or questioned is probably to apologize. Although, a huge improvement would be choosing words that express optionality with doubt. Imagine how close to perfect the last response would be if the first sentence instead was “Oh, maybe I made a mistake”.
All in all I think this coding challenge serves as a good early barometer of code explanation power, and challenging because as opposed to just generating code based on a pattern there is evaluation required (as the generated explanations detail). The increase in performance between the GPT-3.5 and GPT-4 models on this is definitely impressive.
If I get the chance I’ll see about coming up with a more complicated challenge to push GPT-4 more to its limits, if nothing but to see how later editions improve or not. But I’ll also retain this little coding challenge as an early test for other generative language models that include code interpretation.
Next up, I’ll see how Google’s Bard fares…
3 thoughts on “Coding Challenge for Generative AI, Part 2: GPT-4”