unsnap_biceps 12 hours ago

I'm curious as to why you choose to break out specific headers in the schema.

For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.

If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.

For example

  CREATE TABLE IF NOT EXISTS "messages" (
    "id" INTEGER NOT NULL PRIMARY KEY, -- internal id
    "message_id" TEXT NOT NULL, -- Gmail message id
    "thread_id" TEXT NOT NULL, -- Gmail thread id
    "headers" JSON NOT NULL, -- JSON object of { "header": value },
    "subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
    ...
  );
  CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as

  ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
  CREATE INDEX dkimidx on messages(dkim);
  SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
or whatever you want.
  • Hakkin 9 hours ago

    Note that you don't actually need the generated column either, SQLite supports indexes on expressions, so you can do, for example,

      CREATE INDEX subjectidx ON messages(json_extract(headers, '$.Subject'))
    
    and it will use this index anywhere you reference that expression.

    I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.

  • tqi 4 hours ago

    Adding indexes to support a one off query seems like bad practice?

    In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).

  • dotancohen 12 hours ago

    I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?

    • hun3 11 hours ago

      Probably something like

        Error: stepping, NOT NULL constraint failed: messages.dkim (19)
      
      because, unlike MySQL, SQLite apparently returns SQL NULL for JSON null value.
    • unsnap_biceps 9 hours ago

      As hun3 said, it would throw an error. My mistake in the quick example I put together. I just noticed an unbalanced () as well. Whoops.

  • xearl 12 hours ago

    TIL, thanks a lot!

    • formerly_proven 9 hours ago

      You can also create indices directly on expressions, including json_extract etc.

terhechte 13 hours ago

I build something to visualize huge amounts of email (such as from Gmail) some years ago:

https://github.com/terhechte/postsack

  • andai 13 hours ago

    Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.

    Is there a size option too? To see which senders are using most of my storage.

    (Also your website's SSL certificate has expired.)

    • terhechte 12 hours ago

      No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!

  • nijave 6 hours ago

    This looks interesting. I've DIY'd something similar with qdirstat before but you need to arrange your emails a certain way like dated folders and can't re-slice with different criteria.

    On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things

the_mitsuhiko 11 hours ago

I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.

  • sdoering 9 hours ago

    Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.

    Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.

    Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.

    • redeeman 7 hours ago

      step 1: extract data step 2: just dont use google shit anymore. Deal with it.

      you dont get it done by moping about it, but by doing

      • someguydave 5 hours ago

        It would also help if you did step 0: buy your own email domain

        • cowboylowrez 3 hours ago

          Personally, I love sending emails nobody will receive, it removes inhibitions and lets me speak my mind without regrets!

          • codazoda 15 minutes ago

            This isn’t as hard as you might think. I pay for https://mailwip.com because the founder helped me figure out mine. It was ultimately relatively strait-forward. I stay because I appreciate his work, my email is flawless, and I like the logs they provide.

          • acheong08 an hour ago

            I've been self hosting email for a few years at this point and haven't had any delivery issues. Just make sure you set up all your DNS correctly and avoid polluted IP ranges like DO or AWS

  • kasey_junk 6 hours ago

    Sorry, why do you consider app specific passwords an open standard but oauth not?

    • simonw 4 hours ago

      POP3/IMAP work with any client that supports those protocols.

      OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.

      Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!

      • sir 3 hours ago

        I built a proxy a while ago to make this easier - it lets you stick with IMAP/POP/SMTP as-is. No need for your client to even know that OAuth exists. See here: https://github.com/simonrob/email-oauth2-proxy

        • the_mitsuhiko 3 hours ago

          > No need for your client to even know that OAuth exists

          Yes, you can do that, however the problem is getting a client_id/client_secret in the first place. You need to register yourself for one, you need to (nowadays) whitelist every single account or go through a google verification process. At one point you could apply for a client_id that allowed anyone to use it, but that ship has sailed.

noer an hour ago

This is just a single table DB though? At that point, why not just export to a csv or dataframe or whatever and leverage analysis packages to analyze whatever you wanted to.

I admittedly might just not have or understand the use case nor have I thought about how large a Gmail account actually is so feel free to ignore if I'm missing something!

  • hiAndrewQuinn 37 minutes ago

    A couple of reasons which pop to mind:

    - Searching a plain text data file is O(n). Searching a SQLite database that has been properly indexed, which is very easy to do nowadays with FTS5, is O(log n) worst case scenario and O(1) in the best case. This doesn't explain why SQLite over a dataframe or anything, but it definitely justifies it over plain text for large email collections.

    - SQLite is really easy to write custom views and programs around. Virtually every major programming language can work with it without issue. See also: simonw's wonderful https://datasette.io/ .

    - SQLite is an accepted archival format by the Library of Congress, if you ever want to go down the rabbit hole of digital preservation.

gitroom 38 minutes ago

man, the whole gmail backup mess reminds me why i avoid locking myself into someone else's sandbox. figuring out what actually keeps me sticking with a platform even when i know all the downsides - is it just laziness or something deeper?

TekMol 13 hours ago

Shouldn't this be "imap to sqlite" or something? Why tie it to one specific email provider?

  • isaachinman 12 hours ago

    Because _it is_ specific to Gmail. It's using OAuth and presumable API access.

    IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.

    • pastage 8 hours ago

      Doing a mbox export with Google Takeout from gmail is pretty fast.

      • remram 5 hours ago

        What? You have to schedule it, they literally wait 3 days before they start it, and then it can take most of a day to get it ready for download. It is not fast.

        • crazygringo 2 hours ago

          I've never had that experience. You don't need to schedule anything, and it takes maybe part of an hour to be ready to download?

          Maybe there have been times when it was broken or under high demand though?

        • kilroy123 3 hours ago

          I've never had to wait that long. I usually can download within 20 minutes and it's 15 GB of data.

  • phh 7 hours ago

    FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.

    So I can understand why using Google's proprietary API might work better (or not, I don't know)

    Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.

    And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.

    • pertique 6 hours ago

      I had the same problem when I switched off Google. I didn't have a ton of data, and I just wanted content for past search purposes, so I didn't dig into how the data would be transformed but I can at least offer my scuffed solution.

      I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.

      I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.

yread 13 hours ago

Would be nice to enable fulltext search as well

  • padjo 12 hours ago

    Yes! I find gmail’s full text search surprisingly bad given it’s run by a search company.

    • NelsonMinar 5 hours ago

      It got a lot worse recently when they added bad AI to it. Now it does dumb synonyms. Like I search for "doctorate" and it starts highlighting every instance of D alone, like the word "he'd". (Presumably trying to pick up Ph.D.?) For awhile searches for "A" would have it highlighting "the", too.

    • porker 10 hours ago

      But not as bad as Outlook 365's search...

      • pastage 8 hours ago

        Outlook must be the worst email client there is. Something about least common denominator.

      • jbverschoor 9 hours ago

        Boot as bad as the Mail.app from iOS and macOS

    • jgalt212 5 hours ago

      It's the least bad search--better than Yahoo, and Thunderbird desktop in my direct experience. However, I don't download the full message into Thunderbird out of fear of blasting through gmail bandwidth limits.

flas9sd 6 hours ago

having sqlite exporters for platforms is great help for archiving, but also general questions: I used https://github.com/ltdangle/mail2db to see how much mail volume I still receive monthly on a mail account that I want to move away from. A top10 of senders directed my un- and resubscribe actions.

oulipo 8 hours ago

What's the best open-source GMail backup software that exists? Someone has setup something like that? (also archiving attachments, etc)

  • Leftium 7 hours ago

    https://github.com/GAM-team/got-your-back

    - Open source

    - Resume (so backups/restores will eventually complete)

    Honorable mention: https://www.mailstore.com/en/products/mailstore-home/

    - Not open source

    - GUI with index: nice for searching mail locally

    - Resume only for backup (so large restores generally fail)

  • sbarre 8 hours ago

    This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.

    I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.

  • nijave 6 hours ago

    You can also use an IMAP client and set it to offline/download mode so it downloads everything and saves it locally. I think "offline mode" is what it's called in Evolution--not sure what Thunderbird or other clients call it.

hamburglar 2 hours ago

This looks great and simple. I’ll likely try it out. Any chance you’re working on including attachment metadata (and/or broken out access to the attachments themselves) in the future?

jokoon 12 hours ago

I would have preferred a script that parses the mail backup Google sends you.

I think it's a big eml file.

  • ukuina 12 hours ago

    Google Takeout regularly fails to complete for me. Syncing via the API seems like a reasonable alternative.

pdimitar 9 hours ago

Would love a comparison to gbackup-rs[0].

To me having to install a tool through Python is a show-stopper.

[0] https://github.com/djipko/gbackup-rs

  • simonw 4 hours ago

    Does that tool still work? It's not had any updates in 3 years and it looks like it uses IMAP, which may not be available for new Gmail accounts now that they're moving away from even per-app passwords.

    • pdimitar 4 hours ago

      My account is very old and the tool works on it. Figured it will probably stop working at some point, thanks for the refresh.

      I have that tool activating once every 24h still, to this day.

einpoklum 11 hours ago

Let us stop using GMail:

* Google collects vast amounts of personal data, specifically through receiving all of your email and analyizing it.

* It builds elaborate user profiles and uses them to target you with ads designed to better influence you.

* Its hold on information (from different sources) has made it excessively powerful economically, and thus also politically.

* Google/Alphabet has long started to affect legislation, including through direct registered lobbying: ~15 Million USD in 2024 (opensecrets.org).

* It has been known to pass, and likely still passes, the information it collects - including copies of your email correspondence - on to the US government (Edward Snowden leaks).

and finally:

* There are multiple email providers, many of them quite good - both for pay and gratis. Naturally most of the gratis ones have their own interests, but nothing like Google.

  • evertedsphere 9 hours ago
    • pastage 8 hours ago

      This should be seen as a encouraging to switch to something else rather than defeatist. Many of my communications do not touch Google services, professionally it has been judged as too risky, personally I keep a google account but also others.

      Edit: You can create groups of people that are not affected by Google/Apple/Facebook, this should be seen as a goal.

    • devrandoom 3 hours ago

      This could be an interesting take, GDPR wise. Google handling personal emails of people that have no contract or business relationship with Google.

  • remram 5 hours ago

    What's a good replacement? Needs to work on web and mobile (or desktop and mobile), have search, have labels, have automated filters.

    • bob1029 5 hours ago

      I've been using AWS WorkMail since it was released. I prefer this arrangement because I can administer the related domain & DNS concerns in the same place.

      $4/m seems nominal for a 50GB mailbox with no weird adtech shit built in.

      • remram 5 hours ago

        Leaving Google for Amazon makes no sense to me, Amazon fits GP's list just as well.

        • bob1029 4 hours ago

          This elevation of products into their parent organizations makes no sense to me.

          Gmail != Google

          WorkMail != Amazon

          Gmail is targeted at consumers and is engineered to suck up your data to pay for itself. WorkMail is targeted at businesses and is engineered to not piss off IT administrators and middle management.

          • remram 2 hours ago

            That's fair, still if I put in the effort to migrate, I would rather not do it to another company whose business is selling eyeballs. Amazon's business (maybe not WorkMail? who knows) is to sell you a maximum of stuff on their marketplace, building a detailed profile of you to recommend you more stuff.

            I could also pay for a Google Workspace and stay with Gmail.

  • phantompeace 6 hours ago

    What are the best free replacements to gmail that are realistic to switch to? I.E well established and not poised to close down any time soon

    • justin_oaks 3 hours ago

      You may have to temper your expectations. Free usually means "sells/uses your data to offset costs". If you're OK with that, there's no need to switch off of GMail. If you're not OK with that, you'll have to pay.

      Also, hosting email under your own domain gives you the freedom to move from one email provider to another even if they do shut down.

      I put my money where my mouth is. I wanted to degoogle and so pay $50/year for Fastmail. One feature I like is automatically snoozing certain emails. Most of my non-personal email is automatically snoozed until 6pm every day. This way I don't get multiple notifications throughout the day for emails that aren't time sensitive.

  • mediumsmart 8 hours ago

    I was with you from day one and never started using gmail.