Learning from the master: Using ChatGPT for reinforcement learning – part 3

Jokke Ruokolainen Machine Learning Architect, Solita

Published 02 Feb 2023

Reading time 6 min

In the final part of this series, we will be taking the Q-learning solution generated by ChatGPT and optimising it for maximum performance. In the previous posts, we demonstrated how ChatGPT can be used as a tool for data science workflows, including generating boilerplate solutions and addressing bugs and errors. Now, with a working solution in hand, we will explore various optimisation techniques to fine-tune and improve the performance of the Q-learning algorithm. From adjusting hyperparameters to implementing advanced techniques, this post will show you how to take your ChatGPT-generated solution to the next level.

In the previous parts of my blog series, I demonstrated how to use ChatGPT as a tool in your data science workflow and how much it helps, time-wise, compared to doing everything from scratch. In this third part, I will ask ChatGPT to optimise code in a working solution and comment on how good of a job it does. As previously source code is available here: solita/chatGPT_RL_blog3: All resources referenced in Solita chatGPR for RL blog part3

And as a treat, my assignment solution for the university is also available for you in the repository. Feel free to compare and evaluate both.

Optimising the agent

I’ll start with the actual neural network architecture. Let’s ask ChatGPT to suggest a better Deep Q-learning architecture.

Input: “Suggest a better deep q learning architecture than this one:

def build_model(self):

“””Build the neural network model for the DQN”””

model = Sequential()

model.add(Dense(24, input_dim=self.state_size, activation=’relu’))

model.add(Dense(24, activation=’relu’))

model.add(Dense(self.action_size, activation=’linear’))

model.compile(loss=’mse’, optimizer=Adam(learning_rate=self.learning_rate))


return model

explain why would this work better”

Output: One improvement to this architecture would be to use a convolutional neural network (CNN) instead of a fully connected network for the initial layers. This is because CNNs are well-suited for processing images and other grid-like data, which is often the type of input that a DQN receives.

Additionally, you can use a technique called experience replay which stores past experiences and random samples from them to improve the stability and efficiency of learning.

You can also add a target network that is used to compute the target Q-values during training. This can help to stabilise the training process and improve the overall performance of the DQN.

Another improvement would be to use a technique called Double DQN which addresses the problem of overestimation of Q values by having two different networks to estimate the Q values, one for action selection and the other for action evaluation.

Incorporating these techniques can lead to a better DQN architecture and improve the performance of the agent.

Input: Produce a model with the suggested improvements, keep in mind that the input data is 1-dimensional if you use convolutional layers.


Output text

A solid proposition for creating a better architecture for the neural network. Convolutional layers are used as a feature extraction layer and then passed to feed-forward neural network layers for estimating q-values. You can see from the notebook provided that this change had quite a drastic effect on the reward that the agent gets.

I’ll test another function from the agent class before moving to the environment class.

Input: Do you have propositions on how to make this function faster?

def get_action(self, state, possible_actions_index):

“””get action from model using epsilon-greedy policy”””


I transferred the epsilon decay method to the notebook.

The chatGPT generated function is only choosing a random action or the action with the highest predicted Q-value.

It should also be considering the possible actions that are available in the current state. Additionally, the function is only decreasing epsilon after each episode, while it should be decreasing epsilon after each sample.

I don’t want to pass the environment class as a parameter to access the env.requests() function. We’ll just pass the possible action indices and actions an rewrite this function.


if np.random.rand() <= self.epsilon:

# explore: choose a random action from possible actions

return random.choice(possible_actions_index)


# exploit: choose the action with the highest predicted Q-value

state = np.array(self.convert_state_to_vector(state)

).reshape(1, self.state_size)

q_vals = self.model.predict(state, verbose=0)

return possible_actions_index[np.argmax(q_vals[0][possible_actions_index])]


Output image

In all honesty, some of these don’t make much sense and this is due to the fact that ChatGPT didn’t get any context in the input, just the code.

  1. Would be valid if the possible_actions_index variable would be numpy array and not a list
  2. With refactoring it would be possible to use vectorised data, a good suggestion
  3. Preprocessing the requests could be done in this case since all requests could be pre-computed from the time matrix.
  4. Kind of makes this whole code pointless since the agents’ job is to estimate the q-values.
  5. Epsilon decay is done after each step.
  6. Prioritised replay buffer is a valid suggestion and wouldn’t be too complex to implement, but since it involves some extra computation and lookbacks during each step (compared to the vanilla replay buffer I implemented) it would make the solution slower albeit the Q-learning algorithm would, probably, converge faster.
  7. This is the architecture that I originally submitted myself, it makes training a little bit slower, but the model converges faster due to increased stability
  8. A valid option also

Based on this I would say that ChatGPT is actually making some nice propositions on a function level on how to optimise the agent class but not on a solution level since it’s lacking the context.

Here’s the code for the optimised DQN Agent

Let’s move on to the environment class, in the last part of this blog series I pointed out that there is a repeated code problem in the reward and next state functions. How will ChatGPT react to it?

Time to see how would ChatGPT optimise the environment class.

Input: Give a suggestion on how to optimise this class


Output image
  1. A suggestion is already in the code
  2. The second and third suggestions are good suggestions but with the state_space tuple, you must be careful since you need to add that state for an offline action. Once the tuple is created you cannot modify it.
  3. Suggestions 4. To 6. Resonate well with the previous suggestion of not calling the state vectorisation function. If the data would be in a NumPy format, we wouldn’t need to do the vectorisation calls. This would require extensive refactoring and take some time to achieve.
  4. The last suggestion is maybe too much for our use case.

Let’s check how ChatGPT optimises the reward_func and next_state_func where I pointed out that we’re making the same 4D array slicing in both functions.

Input: Explain how to make this code snippet computationally less expensive and easier to read. *the reward_func and next_state_func as input*


Output image

Good suggestions you can see how I implemented these in the repository: optimised environment class.


As a conclusion to this blog series, I have demonstrated how Data Scientists can use ChatGPT as a tool to streamline their working process and get suggestions for problems or code quality enhancements.

ChatGPT is quite proficient at suggesting good and mostly relevant changes to solutions that already work but it isn’t yet quite able to produce a working solution just by itself.

It will be quite exciting to start working on solving business problems with large-language-models and personally, I can’t wait to test the coding optimised version of GPT3.5 that Microsoft is now offering Azure OpenAI Service – Advanced Language Models | Microsoft Azure

If you would be interested in learning more about opportunities with generative models, and optimisation or are looking for a partner to help you with your use cases don’t hesitate to contact us.

  1. Data
  2. Tech