Monday, July 01, 2024

Google Datastore URL-safe keys: incompatibilities from db to ndb inaccessible to LLMs

This is a story about accidental, unnecessary platform migration incompatibilities, which turned into problems undiscovered by LLMs with access to the relevant corpus.

On Google App Engine, in Python 2.7 using webapp2 and db, I had to do this sort of thing, to get a list of the URL-safe keys for a set of entities:

return_string = ""
cities = db.GqlQuery("SELECT * FROM City ORDER BY name ASC")
for c in cities:
  return_string = return_string + str(c.key()) + ","
return return_string

Now, moving to python 3, with flask, and ndb:

return_string = ""
with client.context():
 cities = City.query().order(City.name).fetch()
 for c in cities:
       return_string = return_string +
  c.key.to_legacy_urlsafe(location_prefix="s~").decode('utf-8') + ","
return return_string

... and I'm sure this won't be the end of the saga.

Datastore, which is now the datastore mode of the firestore database at google, is accessed now through a library called ndb. But when you look at the database viewer on Google Console, the UUID that represents the data entity -- which is called the URL-safe key or reference -- is not the same as the value you extract normally using the programmatic interface of ndb. The numbers are just different. For the same entity.

I was quite surprised that neither Google's AI (Gemini) nor OpenAI's ChatGPT 4o could make heads, nor tails, of this problem. The code is Google's, and Google keeps it on Github, along with a long conversation thread about the problem by someone poking the datastore team until the "legacy" method was implemented. But Microsoft bought GitHub in order to feed all of this sort of information to OpenAI. So, why could it not solve the above problem?

Partly, it's because there was no real documentation of the problem: just a polite complaint that turned into a conversation which would be incomprehensible to an LLM without the context of actually building an application around this feature/bug, and having it fail. So, LLMs are still bound to have trouble from a lack of interaction with their fellow machines, and the lack of human narration for that interaction when it IS allowed.

In the meantime, their vast capacity for reading technical material cannot solve difficult problems like this for us, where the expectation (that a UUID would look the same no matter which library is looking at it) is just an obvious assumption, but not really described by anybody, hence inaccessible to an LLM.

No comments: