E*TRADE's web interface will show you your recent transactions. Go back far enough and you hit a wall. Their API covers maybe two years of history, and even that's generous. After that, nothing. Your entire trade history just sits locked inside PDFs that you technically own but can't do anything with.
I had 5,254 of them. Going back to 2020. I wanted the data.
Why bother
Not just for fun. Well, not only for fun.
Having your complete trade history as structured data lets you do things the brokerage UI actively prevents. Cross-reference fills across years. Run analysis on your own behavior. Actually see patterns in your entries and exits instead of clicking through PDF pages one at a time.
Tax prep is the obvious use case. The more interesting one is behavioral review. Did you actually buy the dip, or did you just tell yourself that afterward? With 29,700 rows of structured data, you can find out.
E*TRADE should offer this natively. They don't. Morgan Stanley acquired them in 2020 and the result is two overlapping document systems with different PDF formats and a UI that feels designed to discourage bulk access. Mine was frustrating to deal with.
Getting the PDFs
First I tried the obvious thing. E*TRADE has a public API and I'd used it before for quotes and account data, so I figured there had to be a document endpoint somewhere.
There isn't. The public API doesn't expose trade confirmations at all. You can get balances, positions, recent orders, but not the confirmation PDFs. Those live in a separate document management system that the API pretends doesn't exist.
So I opened Chrome DevTools and watched what the E*TRADE Angular app actually calls when you load the documents page. There it was: an internal REST endpoint returning paginated document metadata, with a separate URL to download each PDF. Not documented, not publicly supported, but it worked.
I built a Playwright script to handle auth on a remote server. The login flow involves SSO, device verification, and session cookies that expire, so I couldn't just replicate the calls with curl. Playwright drove a headless browser, completed the login, captured the session, and then hit the document API directly.
It worked fine for about 4,200 downloads.
Then E*TRADE started rate-limiting, but not with a clean 429. More like requests that silently stalled, then timed out, then started returning empty responses. Took me a while to figure out what was happening. I thought the server was down. It wasn't.
Adding cooldown logic with exponential backoff helped, plus checkpoint files so I could resume without re-downloading what I already had. I spread the remaining ~1,054 downloads across a few sessions with deliberate pauses between batches. Total time: a few days. Total storage: about 1.5GB.
The two-format problem
If you open any of these PDFs by hand, you'll notice two distinct layouts. Pre-Morgan Stanley confirmations look like old E*TRADE: dense text, specific column positions, a particular field ordering. Post-acquisition confirmations follow a different template entirely. Different headers, different field labels, different line structure. Same brokerage, two completely different document designs.
I built the parser assuming one format. It handled about 3,800 files cleanly and then silently mangled the rest. I didn't catch this for a while because I was checking output row counts, not field values. Classic mistake.
The solution was format detection at the top of the extraction pipeline. The parser reads a small header block from each PDF, figures out which template it's looking at, and routes to the right extraction logic. Once that was in place, both formats came through cleanly.
The extraction script uses Python with pdfplumber.
PDFs are annoying because "text" is really just positioned glyphs.
Getting fields in the right order requires understanding the
spatial layout, not just reading characters sequentially.
pdfplumber handles most of this okay, though it
needed some tuning on the Morgan Stanley format where field
positions are less predictable.
Verifying the output
Raw output: 29,700 trade rows from 5,254 PDFs. Zero failures, zero low-confidence rows.
That number felt suspiciously clean, so I checked it a few different ways. E*TRADE encodes the trade date and account number in each document's filename, so I compared those against the dates and account identifiers extracted from the PDF text. Mismatches would mean wrong fields or misnamed files.
I also re-read a random sample of raw PDFs by hand and compared them against the extracted rows. Tedious. The numbers matched. And I sanity-checked aggregate totals against what I remembered from brokerage statements, which isn't a perfect comparison since statements aggregate differently, but nothing was obviously off.
After that I was reasonably confident the data was clean.
How it came together
The download script and the parser were built in parallel, two AI subagents running at the same time. One handled the Playwright automation and download logic; the other wrote and tested the PDF extraction pipeline against a sample of files I'd already pulled down.
This worked better than I expected. The downloader gave the parser something to work with immediately, and the parser's early failures revealed edge cases in the download output (some PDFs came through malformed, which turned out to be a timeout issue on the server side). They informed each other in real time rather than me discovering the format problem only after processing all 5,254 files.
What the output looks like
A CSV and a JSON file. Each row has:
- Trade date and settlement date
- Account number
- Symbol
- Action (buy / sell / buy to cover / sell short)
- Quantity
- Price per share
- Gross amount
- Commission and fees
- Net amount
- Source format (legacy E*TRADE vs. Morgan Stanley)
From there you can load it into a spreadsheet, a SQLite database, a Pandas DataFrame, whatever works. I've been running queries against it for the past week. Some of what I found was expected. Some wasn't.
If you've been trading since before 2020 or have an active account, your PDF count will differ. The approach should still work. One thing I'd do differently: build in the rate-limit cooldowns from the start rather than adding them after E*TRADE starts quietly dropping your connections. You'll save yourself a few hours of confusion.
What E*TRADE should just do
A "download all as CSV" button. That's it.
Your trade history is your data. The fact that getting to it requires reverse-engineering an Angular app's internal API, running a headless browser, handling undocumented rate limits, and writing a format-aware PDF parser is a brokerage failure, not a user problem.
Morgan Stanley's private wealth division will build you a custom reporting dashboard. If you don't have eight figures with them, you get the public UI and a document library that loads however many pages they feel like showing you.
I'll take my chances with Chrome DevTools.
I'm some guy on the internet who spent a few days reverse-engineering his brokerage's document API. Nothing here is financial advice. Don't do anything with this that violates E*TRADE's terms of service, and verify your extracted data before using it for anything important like taxes.