Thinking Of You

I built a tool that sends me sweet memories sourced from 25 years of incoming emails, using gpt-4 and Pinecone. Read on!

While I was adapting my text message corpus to train Joshbot, I started thinking about how my email archive is just as robust as my text message corpus — and equally under-utilized. I’ve always been a digital packrat, and I have emails dating as far back as ~1996, when I was in sixth grade. I saved my Eudora archive when I left middle school and eventually uploaded it (and everything that came after) into a dedicated archival Gmail account that I basically only use when I have to dredge up some old document (like a lease from 7 years ago) or some dusty fact (like what year my friend and I saw Andrew Bird at Carnegie Hall [it was 2009]).

The early years aren’t that interesting.

I’ve also always had a terrible memory. In my adult years, I’ve learned that this isn’t just a fun idiosyncracy. It turns out it’s a trauma response common to C-PTSD survivors such as myself. And as time goes by and my personal healing journey evolves, it’s become more of a fixation for me. It also becomes more poignant and urgent each time I lose someone I care about. When my friend Chloe died in 2014, I started thinking about how she – like many of my close friends – was the holder of the memories. Chloe had an incredible, unparalleled memory. She created monthly iTunes playlists and associated memories with songs and seasons (which she documented in Sound of Summer, a dataviz project that’s miraculously still online), so she always had a reference point. I have never been so lucky — but at least I have my archives.

Something else that stood out to me while playing with Joshbot is just how dangerous the misinformation is. Chatbot hallucination is somewhat old news by now, but here I’ve gone and trained a chatbot on my actual words, learning my tone-of-voice, imitating me better than anyone else. Yet it invents people, scenarios, gossip with ease. It knows just enough to be extremely damaging. And it has the potential to even trick me, with false statements that my real memory might not be able to disprove. Yikes.

Fortunately there’s a decent if hacky workaround to combat hallucination. Basically you pair the vibes-based power of ChatGPT with… actual facts, derived from some known source. A typical workflow might involve uploading your data to a vector database like Pinecone, then querying the database to find a result that’s semantically similar to the initial request. Then, passing the resulting search results back to ChatGPT and having it use these facts to formulate a final – truthier! – response that it sends back to the user. (It took me a while to wrap my mind around this as a non-data-scientist, but this post sums it up pretty clearly, this one gets into some fun examples, and this one gets really deep into the power of embeddings.)

This seemed like the perfect way to recapture, relive and re-encode old memories — with the bonus of helping me learn about vector databases at the same time.

After a couple days of work, I’ve got this rigged up to send me a sweet old message, every few hours. Sometimes it’s a compliment. Sometimes it’s a memory. Sometimes it’s a love letter. Sometimes it’s a dream someone had featuring me. I’ve always had too much internalized toxic shame to keep a brag folder, but this is an even better mood booster.

So, how does it work?

I prepared my dataset as follows:

  1. Run a query within my Gmail archive to isolate personal emails sent to me, then tag the resulting messages with a label
  2. Export emails matching that label into an mbox file using Google Takeout
  3. Run a script that converts my mbox file to JSONL (and includes the sender’s first name as metadata, while stripping out other weirdness from each email such as ‘—- Original message —-‘ or ‘Sent from my iPhone’)
  4. Manually run QA on the resulting JSONL file
  5. Run a script that converts each email into a chunk, converts each chunk to a vector embedding and inserts the vectors into Pinecone

Here’s my imperfect but mostly-effective Gmail query, designed to strip out commercial email:

-has:attachment -automated -confirm -"no-reply" -trademark -http -noreply -account -customer -category:social -category:promotions -copyright -css label:ArchivedIn -unsubscribe -"view in browser"

And here’s an example of an early test run, querying the Pinecone database for the word ‘menswear’ and getting results like ‘That too-short tie is horrendous.’ So delicious, stripped of sender and context.

So many use cases!

Following the rough workflow described earlier, my script does the following things (scheduled via a Google Cloud Functions cron job):

  1. Tasks gpt-4 with generating a random, fictive memory of me
  2. Convert the resulting fictive memory into an embedding using OpenAI’s embeddings API
  3. Passes that fictive memory to my Pinecone database as a semantic similarity query
  4. Passes the top 5 matching search results back to ChatGPT which then isolates the best verbatim
  5. Sends the verbatim to my iPhone as a push notification, via an IFTTT webhook

I’m still perfecting the two gpt-4 prompts, but they go something like this:

{"role": "user", "content": "Without any comment or subject line, generate a message capturing a scenario in which someone is reaching out to tell me they've been thinking of me due to a fictional memory or dream. Respond in 2-3 sentences with no carriage returns, and make sure your message is thoughtful, personal and describes a specific anecdote and compliment or positive attribute."}
{"role": "user", "content": f"From the following text chunks, select a single quote that's personal, directed to me, memorable, meaningful, not negative, and under 150 characters. Respond without comment or carriage returns. No need for quotation marks, but do not rephrase or edit the verbatim you've selected: \n\n{text_chunks}"}

It’s ironic that my workflow involves inventing fictional memories, but it seems like this is the best way to generate rich queries that can map to my inbox. It generates pretty fascinating and compelling narratives, such as:

Here are ten new fictional memories someone might recall of you:

  1. “Do you remember the time we tried to assemble that complicated piece of furniture without instructions? Our trial-and-error method made for a hilarious and memorable day.”
  2. “I often think of that chilly evening when we went stargazing. We didn’t spot any shooting stars, but your fascinating tales about constellations made the night magical.”
  3. “Our spontaneous salsa dancing lesson at the local fair is still fresh in my mind. We were clearly novices, but we laughed through every misstep.”
  4. “There’s that memory of us getting lost during our overseas trip, turning it into an unexpected adventure as we discovered hidden alleys and local eateries, making it one of our best days.”
  5. “I recall when you gave that touching toast at our friend’s wedding, capturing the essence of love so beautifully that there wasn’t a dry eye in the room.”
  6. “How can I forget our marathon board game night? We played until the early hours, fueled by snacks and endless jokes.”
  7. “That weekend when we volunteered at the animal shelter together always comes to mind. Watching you bond with the animals showcased your gentle and compassionate side.”
  8. “Your excitement was infectious when we visited the theme park and went on every single ride, reliving our childhood memories and making new ones.”
  9. “I reminisce about the time we went fruit picking at that orchard. We competed to find the juiciest fruits and ended the day with a delicious homemade pie.”
  10. “There’s that memorable afternoon when we tried our hand at pottery. While our creations were far from perfect, the process and the mess we made had us in stitches.”

These memories emphasize camaraderie, discovery, compassion, and shared experiences that bring joy and laughter.

Passed through my Pinecone database (in vector format), that fictive memory about going fruit-picking and eating a delicious pie yields the following actual memory of a road trip up the coast from Florida in 2009:

Search results for the query: "I reminisce about the time we went fruit picking at that orchard. We competed to find the juiciest fruits and ended the day with a delicious homemade pie."

Text chunk: "Today all I can think about are mangoes and pineapples and going to the beach near US1 and getting mussels and cold beer for lunch at nearby restaurant on the intercoastal. And a few minutes ago my sense memories were pumping hard about our trip back to New York last summer, in particular the Savannah leg. I smelled the hotel room smell of the morning, saw the hot highway blurring beneath us, remember the heat and sweat of line outside Auntie Wilkes', the glory awaiting within. Even the fried food we had in that bayside restaurant in the NC town when we couldn't make it to Ocracoke. Sense memories going crazy today."

Similarity score: 0.84610939

So, what now? Am I healed?

It turns out that one of the most surprisingly compelling use cases of this has been conversational fodder with people I don’t catch up with that often.

It’s been so meaningful to get a free excuse to catch up with old friends — and deeply poignant to breathe life into my memories with old friends like Chloe who aren’t here anymore. My aunt Eva died earlier this summer, and I kept thinking about how she was the last person in my immediate family with a direct connection to our history as a family of immigrants. She was born in Hungary, a few years after my grandmother returned home from Auschwitz, and just a few years before they’d again be displaced and find their way to what became Israel and eventually Brooklyn. She’s gone, but her memories aren’t, and I’m happy to have unlocked a few of them from old emails she sent me.

If you’d like to experiment with your own email corpus, the repo (which was heavily co-authored by gpt-4) can be found on github.