I skipped the weeknotes last week, popped up to London to see some friends. So so nice to give them a hug, have a beer, share dinner. I’ve been missing that most out in the countryside. I just got back from seeing my Grandparent’s on my Dad’s side for the first time in maybe a year and a half. It was a hot weekend, very good to see them. They’re getting on now, but doing remarkably well considering everything. We fixed a bunch of things and generally tried to make things a bit easier while we were there, good time spent with family.

While we were driving back something awful happened in the village, the social centre burned down. It housed a community pub, the post office, provided a rentable space to local groups, provided part of the heart of the village. Thinking about environments as a result of all of this. What I’m lucky to be in, where I would like to be, the things we build and ways in which they slowly or suddenly fall apart. Not thinking anything interesting about them, just feels like it’s weighing on my mind.

Tetris and my Dissertation

I’m working on the environment for my dissertation experiments at the moment. It’s a simplified version of Tetris where you indicate the piece orientation and column in order to drop it, essentially shortening and removing the trajectory, and allowing you to apply learning at a quicker rate. There are some other methods of reinforcement learning I’ve heard about more recently that seem like they might be promising for the full trajectory version but I don’t think I’m in the place to try out.

In particular “Exploration by Random Network Distillation” sounds like it might be applicable? I’m reasoning that the task of exploration in a full trajectory based game of Tetris, where the reward is only gained by clearing a row, this isn’t too dissimilar to Montezuma’s revenge where you also have a complex sequence of actions that have to occur before receiving any reward. This might be something to try out, but it sounds pretty high risk in terms of achieving results, so I’m going to work on the simpler goals from my project plan first. I have a similar plan with Monte-Carlo Tree Search strategies which would probably work, but might have a lot of awkward engineering practicalities that would prevent it. A lot of the excellent machine learning papers are in part just demonstrating really impressive feats of engineering with their training pipelines.

So far I’ve created the simplified environment (it needs more documenting) and a smaller scale 2 x 6 board that uses a reduced piece set styled after this version from S. Melax. When I threw the stable-baselines3 implementation of PPO at the smaller board environment I started to see some promising signs of learning pretty quickly! Hopefully that indicates PPO will be a promising application on the standard size board too.

Reading List