I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
CREATE TABLE IF NOT EXISTS "messages" (
"id" INTEGER NOT NULL PRIMARY KEY, -- internal id
"message_id" TEXT NOT NULL, -- Gmail message id
"thread_id" TEXT NOT NULL, -- Gmail thread id
"headers" JSON NOT NULL, -- JSON object of { "header": value },
"subject" TEXT GENERATED ALWAYS AS (json_extract("headers", '$.Subject')) VIRTUAL NOT NULL)
...
);
CREATE INDEX subjectidx on messages(subject);
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as
ALTER TABLE messages ADD dkim TEXT GENERATED ALWAYS AS (json_extract("headers", '$."Dkim-Signature"')) VIRTUAL NOT NULL);
CREATE INDEX dkimidx on messages(dkim);
SELECT dkim, COUNT(0) FROM messages GROUP BY dkim;
Adding indexes to support a one off query seems like bad practice?
In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
This looks interesting. I've DIY'd something similar with qdirstat before but you need to arrange your emails a certain way like dated folders and can't re-slice with different criteria.
On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things
I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.
Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
This isn’t as hard as you might think. I pay for https://mailwip.com because the founder helped me figure out mine. It was ultimately relatively strait-forward. I stay because I appreciate his work, my email is flawless, and I like the logs they provide.
I've been self hosting email for a few years at this point and haven't had any delivery issues. Just make sure you set up all your DNS correctly and avoid polluted IP ranges like DO or AWS
POP3/IMAP work with any client that supports those protocols.
OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.
Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!
I built a proxy a while ago to make this easier - it lets you stick with IMAP/POP/SMTP as-is. No need for your client to even know that OAuth exists. See here: https://github.com/simonrob/email-oauth2-proxy
> No need for your client to even know that OAuth exists
Yes, you can do that, however the problem is getting a client_id/client_secret in the first place. You need to register yourself for one, you need to (nowadays) whitelist every single account or go through a google verification process. At one point you could apply for a client_id that allowed anyone to use it, but that ship has sailed.
This is just a single table DB though? At that point, why not just export to a csv or dataframe or whatever and leverage analysis packages to analyze whatever you wanted to.
I admittedly might just not have or understand the use case nor have I thought about how large a Gmail account actually is so feel free to ignore if I'm missing something!
- Searching a plain text data file is O(n). Searching a SQLite database that has been properly indexed, which is very easy to do nowadays with FTS5, is O(log n) worst case scenario and O(1) in the best case. This doesn't explain why SQLite over a dataframe or anything, but it definitely justifies it over plain text for large email collections.
- SQLite is really easy to write custom views and programs around. Virtually every major programming language can work with it without issue. See also: simonw's wonderful https://datasette.io/ .
- SQLite is an accepted archival format by the Library of Congress, if you ever want to go down the rabbit hole of digital preservation.
man, the whole gmail backup mess reminds me why i avoid locking myself into someone else's sandbox. figuring out what actually keeps me sticking with a platform even when i know all the downsides - is it just laziness or something deeper?
What? You have to schedule it, they literally wait 3 days before they start it, and then it can take most of a day to get it ready for download. It is not fast.
FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
I had the same problem when I switched off Google. I didn't have a ton of data, and I just wanted content for past search purposes, so I didn't dig into how the data would be transformed but I can at least offer my scuffed solution.
I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.
I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.
It got a lot worse recently when they added bad AI to it. Now it does dumb synonyms. Like I search for "doctorate" and it starts highlighting every instance of D alone, like the word "he'd". (Presumably trying to pick up Ph.D.?) For awhile searches for "A" would have it highlighting "the", too.
It's the least bad search--better than Yahoo, and Thunderbird desktop in my direct experience. However, I don't download the full message into Thunderbird out of fear of blasting through gmail bandwidth limits.
having sqlite exporters for platforms is great help for archiving, but also general questions: I used https://github.com/ltdangle/mail2db to see how much mail volume I still receive monthly on a mail account that I want to move away from. A top10 of senders directed my un- and resubscribe actions.
This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
You can also use an IMAP client and set it to offline/download mode so it downloads everything and saves it locally. I think "offline mode" is what it's called in Evolution--not sure what Thunderbird or other clients call it.
This looks great and simple. I’ll likely try it out. Any chance you’re working on including attachment metadata (and/or broken out access to the attachments themselves) in the future?
Does that tool still work? It's not had any updates in 3 years and it looks like it uses IMAP, which may not be available for new Gmail accounts now that they're moving away from even per-app passwords.
* Google collects vast amounts of personal data, specifically through receiving all of your email and analyizing it.
* It builds elaborate user profiles and uses them to target you with ads designed to better influence you.
* Its hold on information (from different sources) has made it excessively powerful economically, and thus also politically.
* Google/Alphabet has long started to affect legislation, including through direct registered lobbying: ~15 Million USD in 2024 (opensecrets.org).
* It has been known to pass, and likely still passes, the information it collects - including copies of your email correspondence - on to the US government (Edward Snowden leaks).
and finally:
* There are multiple email providers, many of them quite good - both for pay and gratis. Naturally most of the gratis ones have their own interests, but nothing like Google.
This should be seen as a encouraging to switch to something else rather than defeatist. Many of my communications do not touch Google services, professionally it has been judged as too risky, personally I keep a google account but also others.
Edit: You can create groups of people that are not affected by Google/Apple/Facebook, this should be seen as a goal.
I've been using AWS WorkMail since it was released. I prefer this arrangement because I can administer the related domain & DNS concerns in the same place.
$4/m seems nominal for a 50GB mailbox with no weird adtech shit built in.
This elevation of products into their parent organizations makes no sense to me.
Gmail != Google
WorkMail != Amazon
Gmail is targeted at consumers and is engineered to suck up your data to pay for itself. WorkMail is targeted at businesses and is engineered to not piss off IT administrators and middle management.
That's fair, still if I put in the effort to migrate, I would rather not do it to another company whose business is selling eyeballs. Amazon's business (maybe not WorkMail? who knows) is to sell you a maximum of stuff on their marketplace, building a detailed profile of you to recommend you more stuff.
I could also pay for a Google Workspace and stay with Gmail.
You may have to temper your expectations. Free usually means "sells/uses your data to offset costs". If you're OK with that, there's no need to switch off of GMail. If you're not OK with that, you'll have to pay.
Also, hosting email under your own domain gives you the freedom to move from one email provider to another even if they do shut down.
I put my money where my mouth is. I wanted to degoogle and so pay $50/year for Fastmail. One feature I like is automatically snoozing certain emails. Most of my non-personal email is automatically snoozed until 6pm every day. This way I don't get multiple notifications throughout the day for emails that aren't time sensitive.
I'm curious as to why you choose to break out specific headers in the schema.
For example, you have recipients, subject, and sender as JSON fields, when you could have just a headers field with all of them, and even add the rest of the headers in the message.
If it's performance related, you can still have headers as a single json blob and then use generated columns for the specific fields.
For example
I've found this model really powerful, as it allows users to just alter table to add indexed generated columns as they need for their specific queries. For example, if I wanted to query dkim status, it's as simple as or whatever you want.Note that you don't actually need the generated column either, SQLite supports indexes on expressions, so you can do, for example,
and it will use this index anywhere you reference that expression.I find it useful to create indexes like this, then create VIEWs using these expressions instead of ALTER'ing the main table with generated columns.
And since view and indexes don't change the data, you can use tools like https://github.com/fsaintjacques/recordlite to automate schema management.
Adding indexes to support a one off query seems like bad practice?
In general I prefer break out columns that I expect to have/use consistently, especially for something as stable as email headers. Maybe schema changes are a bit easier with a headers column, but imo its just trading the pain on write for pain on read (while leaving the door open to stuff failing silently).
I see that you defined the `dkim` column as NOT NULL. So what happens when an email message does not contain the Dkim-Signature header?
Probably something like
because, unlike MySQL, SQLite apparently returns SQL NULL for JSON null value.As hun3 said, it would throw an error. My mistake in the quick example I put together. I just noticed an unbalanced () as well. Whoops.
TIL, thanks a lot!
You can also create indices directly on expressions, including json_extract etc.
I build something to visualize huge amounts of email (such as from Gmail) some years ago:
https://github.com/terhechte/postsack
Hey this is really neat! It's like those disk usage visualizers, except that it seems to focus on the total volume of the mail rather than the disk usage.
Is there a size option too? To see which senders are using most of my storage.
(Also your website's SSL certificate has expired.)
No currently not. It would be easy to add though. I haven't updated the tool in a while (after using it to clean up my Gmail inbox). Thanks for pointing out the certificate!
This looks interesting. I've DIY'd something similar with qdirstat before but you need to arrange your emails a certain way like dated folders and can't re-slice with different criteria.
On the other hand, qdirstat "cache" files are really easy to generate so can be used for visualizing a bunch of file-like things
Looks interesting, the link to gmvault in your readme is now a dead end, is this it https://github.com/gaubert/gmvault. Thanks!
I really lament that you cannot sign in even with an application specific password any more and you need to get an oauth client and go through an oauth flow. It’s my email, but Google takes away an open standard even for myself to access it.
Given the amount of spam I receive on my free Gmail addresses (compared to my paid for freelance one), and the amount of spam I receive from Gmail servers on my non Gmail-E-Mail accounts I get more and more inclined towards degoogling myself.
Especially as I receive more and more information that my freelance e-mail is put into spam by recipient systems.
Not sure how to get rid of my Google ecosystem routines, though. Feels daunting.
step 1: extract data step 2: just dont use google shit anymore. Deal with it.
you dont get it done by moping about it, but by doing
It would also help if you did step 0: buy your own email domain
Personally, I love sending emails nobody will receive, it removes inhibitions and lets me speak my mind without regrets!
This isn’t as hard as you might think. I pay for https://mailwip.com because the founder helped me figure out mine. It was ultimately relatively strait-forward. I stay because I appreciate his work, my email is flawless, and I like the logs they provide.
I've been self hosting email for a few years at this point and haven't had any delivery issues. Just make sure you set up all your DNS correctly and avoid polluted IP ranges like DO or AWS
Sorry, why do you consider app specific passwords an open standard but oauth not?
POP3/IMAP work with any client that supports those protocols.
OAuth really doesn't. Every OAuth integration I've ever built always feels like it needs a tiny bit of custom development.
Also the OAuth flow is usually absolutely horrible for when you're trying to get a token for accessing your own data. I've had to spin up a temporary web app to handle a hunch of redirects just to get my own token!
I built a proxy a while ago to make this easier - it lets you stick with IMAP/POP/SMTP as-is. No need for your client to even know that OAuth exists. See here: https://github.com/simonrob/email-oauth2-proxy
> No need for your client to even know that OAuth exists
Yes, you can do that, however the problem is getting a client_id/client_secret in the first place. You need to register yourself for one, you need to (nowadays) whitelist every single account or go through a google verification process. At one point you could apply for a client_id that allowed anyone to use it, but that ship has sailed.
This is just a single table DB though? At that point, why not just export to a csv or dataframe or whatever and leverage analysis packages to analyze whatever you wanted to.
I admittedly might just not have or understand the use case nor have I thought about how large a Gmail account actually is so feel free to ignore if I'm missing something!
A couple of reasons which pop to mind:
- Searching a plain text data file is O(n). Searching a SQLite database that has been properly indexed, which is very easy to do nowadays with FTS5, is O(log n) worst case scenario and O(1) in the best case. This doesn't explain why SQLite over a dataframe or anything, but it definitely justifies it over plain text for large email collections.
- SQLite is really easy to write custom views and programs around. Virtually every major programming language can work with it without issue. See also: simonw's wonderful https://datasette.io/ .
- SQLite is an accepted archival format by the Library of Congress, if you ever want to go down the rabbit hole of digital preservation.
man, the whole gmail backup mess reminds me why i avoid locking myself into someone else's sandbox. figuring out what actually keeps me sticking with a platform even when i know all the downsides - is it just laziness or something deeper?
Shouldn't this be "imap to sqlite" or something? Why tie it to one specific email provider?
Because _it is_ specific to Gmail. It's using OAuth and presumable API access.
IMAP is much harder, and much slower, and is bound by Google's bandwidth limits.
Doing a mbox export with Google Takeout from gmail is pretty fast.
What? You have to schedule it, they literally wait 3 days before they start it, and then it can take most of a day to get it ready for download. It is not fast.
I've never had that experience. You don't need to schedule anything, and it takes maybe part of an hour to be ready to download?
Maybe there have been times when it was broken or under high demand though?
I've never had to wait that long. I usually can download within 20 minutes and it's 15 GB of data.
FWIW, for several years I've tried backuping my gmail account with imap (including some stuff made specifically for gmail): It never succeeded. The best syncer were running for one month, and after one month it hit some mails that it simply couldn't retrieve? Like I guess it was in too cold storage and timeout-ed? I don't know.
So I can understand why using Google's proprietary API might work better (or not, I don't know)
Anyway, as a sibling says, nowadays Google Takeout includes mbox and work properly (and is pretty fast, like half a day), but doesn't allow continuous update.
And I migrated to another mail provider (infomaniak), and I've thanked myself for using my own mail domain name years earlier.
I had the same problem when I switched off Google. I didn't have a ton of data, and I just wanted content for past search purposes, so I didn't dig into how the data would be transformed but I can at least offer my scuffed solution.
I installed a third-party client (Thunderbird, but I imagine any would work) on a local box, signed in with both emails, and just copied the mail over from one to the other. Low-tech, but it worked quite well. I may have forced some local cache/download for the original email, but I can't recall. I'll check later if it preserves headers and the like. I assume it would, but it wasn't that important to me.
I actually thought about writing at some point about the process of getting off gmail and all the funny things I ran across.
Would be nice to enable fulltext search as well
Yes! I find gmail’s full text search surprisingly bad given it’s run by a search company.
It got a lot worse recently when they added bad AI to it. Now it does dumb synonyms. Like I search for "doctorate" and it starts highlighting every instance of D alone, like the word "he'd". (Presumably trying to pick up Ph.D.?) For awhile searches for "A" would have it highlighting "the", too.
But not as bad as Outlook 365's search...
Outlook must be the worst email client there is. Something about least common denominator.
Boot as bad as the Mail.app from iOS and macOS
Agreed! One of the reasons we started working on Marco.
https://marcoapp.io
It's the least bad search--better than Yahoo, and Thunderbird desktop in my direct experience. However, I don't download the full message into Thunderbird out of fear of blasting through gmail bandwidth limits.
having sqlite exporters for platforms is great help for archiving, but also general questions: I used https://github.com/ltdangle/mail2db to see how much mail volume I still receive monthly on a mail account that I want to move away from. A top10 of senders directed my un- and resubscribe actions.
What's the best open-source GMail backup software that exists? Someone has setup something like that? (also archiving attachments, etc)
https://github.com/GAM-team/got-your-back
- Open source
- Resume (so backups/restores will eventually complete)
Honorable mention: https://www.mailstore.com/en/products/mailstore-home/
- Not open source
- GUI with index: nice for searching mail locally
- Resume only for backup (so large restores generally fail)
I've been using:
https://github.com/gaubert/gmvault
For a long time and it's worked great. But it seems like GYB is actively maintained, so maybe I should switch.
This isn't exactly what you're asking for, but Google offers a service called Takeout that lets you request and download backups of all your data from their services, including Gmail.
I have a reminder to trigger this every few months and update my local backup. If I recall it comes as a gzipped mbox file.
You can also use an IMAP client and set it to offline/download mode so it downloads everything and saves it locally. I think "offline mode" is what it's called in Evolution--not sure what Thunderbird or other clients call it.
This looks great and simple. I’ll likely try it out. Any chance you’re working on including attachment metadata (and/or broken out access to the attachments themselves) in the future?
I would have preferred a script that parses the mail backup Google sends you.
I think it's a big eml file.
Google Takeout regularly fails to complete for me. Syncing via the API seems like a reasonable alternative.
this is great if only there was a tool for whatsapp to sqlite it would make my data so much more useful
https://medium.com/@Med1um1/extracting-whatsapp-messages-fro...
in both iOS and Android it's all stored in sqlite already. Table schemas circulate forensics blogs and QA sites and how to obtain the unencrypted db
Would love a comparison to gbackup-rs[0].
To me having to install a tool through Python is a show-stopper.
[0] https://github.com/djipko/gbackup-rs
Does that tool still work? It's not had any updates in 3 years and it looks like it uses IMAP, which may not be available for new Gmail accounts now that they're moving away from even per-app passwords.
My account is very old and the tool works on it. Figured it will probably stop working at some point, thanks for the refresh.
I have that tool activating once every 24h still, to this day.
[dead]
Let us stop using GMail:
* Google collects vast amounts of personal data, specifically through receiving all of your email and analyizing it.
* It builds elaborate user profiles and uses them to target you with ads designed to better influence you.
* Its hold on information (from different sources) has made it excessively powerful economically, and thus also politically.
* Google/Alphabet has long started to affect legislation, including through direct registered lobbying: ~15 Million USD in 2024 (opensecrets.org).
* It has been known to pass, and likely still passes, the information it collects - including copies of your email correspondence - on to the US government (Edward Snowden leaks).
and finally:
* There are multiple email providers, many of them quite good - both for pay and gratis. Naturally most of the gratis ones have their own interests, but nothing like Google.
https://mako.cc/copyrighteous/google-has-most-of-my-email-be...
This should be seen as a encouraging to switch to something else rather than defeatist. Many of my communications do not touch Google services, professionally it has been judged as too risky, personally I keep a google account but also others.
Edit: You can create groups of people that are not affected by Google/Apple/Facebook, this should be seen as a goal.
This could be an interesting take, GDPR wise. Google handling personal emails of people that have no contract or business relationship with Google.
What's a good replacement? Needs to work on web and mobile (or desktop and mobile), have search, have labels, have automated filters.
I've been using AWS WorkMail since it was released. I prefer this arrangement because I can administer the related domain & DNS concerns in the same place.
$4/m seems nominal for a 50GB mailbox with no weird adtech shit built in.
Leaving Google for Amazon makes no sense to me, Amazon fits GP's list just as well.
This elevation of products into their parent organizations makes no sense to me.
Gmail != Google
WorkMail != Amazon
Gmail is targeted at consumers and is engineered to suck up your data to pay for itself. WorkMail is targeted at businesses and is engineered to not piss off IT administrators and middle management.
That's fair, still if I put in the effort to migrate, I would rather not do it to another company whose business is selling eyeballs. Amazon's business (maybe not WorkMail? who knows) is to sell you a maximum of stuff on their marketplace, building a detailed profile of you to recommend you more stuff.
I could also pay for a Google Workspace and stay with Gmail.
What are the best free replacements to gmail that are realistic to switch to? I.E well established and not poised to close down any time soon
You may have to temper your expectations. Free usually means "sells/uses your data to offset costs". If you're OK with that, there's no need to switch off of GMail. If you're not OK with that, you'll have to pay.
Also, hosting email under your own domain gives you the freedom to move from one email provider to another even if they do shut down.
I put my money where my mouth is. I wanted to degoogle and so pay $50/year for Fastmail. One feature I like is automatically snoozing certain emails. Most of my non-personal email is automatically snoozed until 6pm every day. This way I don't get multiple notifications throughout the day for emails that aren't time sensitive.
I was with you from day one and never started using gmail.