The tech behind the personal health project
Did you know (presumably) every piece of information that doctors record about a patient is available on a MyChart website? Every lab result, doctor’s consultation, glucose measurement, blood test, etc is on this website. Once my mother and I found this out, we kept ourselves busy by reading each piece and trying to understand what the doctors were thinking when reviewing my health and data.
Manual Effort
The developer in me saw this as a big opportunity to acquire that data for analysis. The first night I got access to my laptop (and I could sit up in bed > 30deg), I manually printed like 50 pages on the site to PDFs, so I could use them with ChatGPT to interpret the data. That was really cool and it’s how the Personal Health blog posts were written on this blog. But the next day, there were probably another 30, and the next another 30. I realized it wasn’t going to scale if I had to manually download each doctors note to use it in a RAG approach.
Scraper
As a result, I spent the past week working on a scraper for the MyChart system, that will go into any “Past Visits”, find notes from the team, and put them into PDFs that could be used by other systems for analysis. It also can combine PDFs into a single PDF, because ChatGPT only supports X amount of files per conversation/project, but the size of the files is less of a restriction.
The scraper uses Playwright, just because I wanted to learn how it worked in python, but it could’ve been written with selenium or any chromium driver framework.
Here’s the repo: tfitz237/health-mychart-scraper
I had Github Copilot, using GPT 5.2-Codex-Max, scheme up the initial structure, page objects, and CLI project. From there it was just a matter of finding the right buttons and selectors on the pages. Playwright has a nice codegen tool that let’s you do a live browser session and it’ll capture selectors for the elements you are interacting with.
Pages
The project only has three page objects:
Login
- This required some finesse to make sure it was robust to survive multiple sessions, or was able to provide a 2FA code for the first session. I learned early how easy it was to store a browser session so that the 2FA code only had to happen once.. otherwise it would’ve been a pain to wait for an email/text each time I wanted to start the scraper.
Visits
- Once logged in, we could navigate directly to the hospital visits page, where it will list all upcoming and past visits for the patient. In my case, I really only cared about a specific hospital event, but it’s designed to try to open them all and find any “Notes from Care Team”.
- If I end up using this for my future doctor visits with endo or cardio, I will also add on upcoming visits to understand what is going to occur in these appointments. But I had to be on top of it when I was in the hospital because things were mostly scheduled a few hours in advance and I didn’t give myself the time to update the scraper before they turned into past visits.
Notes
- Clicking on a hospital visit can potentially provide Notes from the Care Team that is the raw data provided from Doctor consults, like notes from physical therapy, or a doctor’s evaluation on an X-Ray. It also includes the raw data for stuff like blood tests, so it could include many numbers as well as interpretted facts.
Challenges:
Cross-Origin Iframes
- I found that there were multiple MyChart websites based on which hospital the data was from. You can combine them in a single login, in that I can find past hospital visits from one MyChart in a single MyChart, but it worked with iframes in the site itself, with Cross-Origin access, which made scraping the data much more challenging.
- When attempting to capture text, PDF, print a Cross-origin iframe, it only prints the content from the current page, not the iframe. This took me a while to understand, as I thought it was a weird z-index thing or something, like the Iframe was a popup and it couldn’t find it. Once I saw that it was a different URL, it made sense why it couldn’t capture it.
- Solution:
- scrape the different MyChart websites separately
- screenshot it instead of capturing the text
- this approach would not work as well with RAG as an image is way less useful to an LLM compared to raw text.
- I ended up just scraping the main hospital visit. However I will go back to ensure I capture the ER and previous hospital visit data as well for use in my RAG-analysis approach. I might not even make the scraper capable of it directly, just download it manually. It’s not like that data is live-updating anymore, as I’m done with those hospital visits.
Result
After getting the scraper working, now any time a new consult note shows up, I can run the scraper to re-acquire all notes from a hospital visit, and it will download them all into separate PDFs. From there I can run a separate CLI command to combine them all into a single PDF, that can be used by any LLM for analysis. I have a ChatGPT project setup that has instructions that help it understand the context of the combined PDF, and some documentation for this blog platform (Bearblog). Now I can tell ChatGPT to write blog posts for each day, or summarize information for myself.
Future Goals
- Keep “state” of already captured data.
- Right now the system has no idea if it’s scraped something already so it has to go through each visit and note every time and re-acquire it, even if it’s already done. It’s based on index right now, rather than datetime, doctor, or other info that could be acquired. Not a big deal thanks to the fact that these data points will be or already are in the past, and therefore won’t change, but it doesn mean that it’s doing extra work.
- Setup database for specific retrieval, instead of just single PDF or combined data.
- I could have an organizational system after scraping, where it could analyze the document and try to tag the info into it to specific types of data.
- Examples:
- it can be easily recognized that this note is from my physical therapist, and therefore could be tagged like that. Then an Agent could do look ups of all PT sessions to consider the patient’s improvements in that part of their process.
- Capture all glucose readings, read them over time to consider if there is any pattern based on the food fed to patient
- Read blood reports and discover trends / changes over time to interpret if the patient is recovering or degrading
Anyways, thats it for now. I just finished the scraper today, so next up is to ensure the blog posts I have written so far are correctly up to date with the data, and write up posts for the rest of the days that I spent in the hospital.
Thanks for reading!