Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider scrapping data from webpage for current week rather than doing API request #234

Closed
maxpatiiuk opened this issue Dec 16, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@maxpatiiuk
Copy link
Owner

Given that I have 14 calendars, this requires 14 API requests for each week. This is inefficient.
Instead, should consider scrapping the data from the webpage - may be bug prone, but would be good enough for several features:

@maxpatiiuk maxpatiiuk added the enhancement New feature or request label Dec 16, 2023
@maxpatiiuk maxpatiiuk moved this to 🆕 New in Calendar Stats Backlog Dec 16, 2023
@maxpatiiuk
Copy link
Owner Author

maxpatiiuk commented Jun 16, 2024

Spent a few hours prototyping this.
Going to look like https://gist.github.com/maxpatiiuk/f315cd3cdafa4a68cdd9838a4d70e05d, but much more robust and production ready.
Faced a challenge - the script above was parsing out event details out of the accessible event label.
That works somewhat reliably for EN (somewhat because event names and locations may include commas, which would break the parsing)
But, there is also AM/PM to account for - quite doable.
But then, the accessible event label is quite different for each language (and Google Calendar supports 74 for now)

Thus, considering parsing out event details from the DOM preview rather than accessible label.
This makes the design a bit more fragile to DOM changes, but, I will still keep the API request code so that it can seemlesly fallback if DOM parsing fails (the API request code will be needed anyway since DOM parsing would only work for day, custom days and week views)

Calendar name id is present inside of a jslog attribute on each event name. Hopefully they don't change this much, but yes, this will be a bit fragile. Alternatively, I can try to parse the event name out of the accessible label - going to be tricky though since calendar names in the DOM don't always match what the API responds with (your main calendar is named after your email address in the API, but after your preferred name in the UI. the Birthdays calendar is also named differently depending on the UI language). Alternatively, could parse out the event node color (which is in RGB, so convert that to hex) and find the nearest calendar with that color. Since there is a limited number of colors to choose from, reliable rgb->hex conversion will be easy - but, the tricky part is that multiple calendars can share a color, so you would still need to do some matching of "possible calendar names" to what's included in the accessible label.

Google Calendar's front-end code that adds the jslog attribute has a few conditionals.
The first 3 branches seem to be for special non-event events (Tasks, Appointments and etc)
The last one (else branch), has a conditional where event calendar id is supposed to be (t = _.F(c, 33)) != null ? t : "" - it looks like calendar ID is not specified the first time events are rendered, but is specified on subsequent renders

      if (_.Yjf(c.Qb().Nu()))
       b.attr("jslog", "139091; track:impression,click,dblclick");
     else if (_.Xjf(c.Qb().Nu()))
       _.mFd(c.Qb().Nu()).iH() ? b.attr("jslog", "185434; track:impression,click,dblclick") : b.attr("jslog", "185338; track:impression,click,dblclick");
     else if (c.Qb().jr())
       b.attr("jslog", "35464; track:click,dblclick");
     else {
       let t;
       b.attr("jslog", "35463; 2:" + _.F(c, 31) + ";" + (_.Yd(c, 33) ? "1:" + ((t = _.F(c, 33)) != null ? t : "") + ";" : "") + "track:impression,click,dblclick,rfjeo,Hu9wEd; mutable:true")
     }

Google Calendar front-ent definition for the event colors (en):

[Xw("#795548", "Cocoa", 21), Xw("#E67C73", "Flamingo", 3), Xw("#D50000", "Tomato", 2), Xw("#F4511E", "Tangerine", 4), Xw("#EF6C00", "Pumpkin", 5), Xw("#F09300", "Mango", 6), Xw("#009688", "Eucalyptus", 13), Xw("#0B8043", "Basil", 12), Xw("#7CB342", "Pistachio", 10), Xw("#C0CA33", "Avocado", 9), Xw("#E4C441", "Citron", 8), Xw("#F6BF26", "Banana", 7), Xw("#33B679", "Sage", 11), Xw("#039BE5", "Peacock", 14), Xw("#4285F4", "Cobalt", 15), Xw("#3F51B5", "Blueberry", 16), Xw("#7986CB", "Lavender", 17), Xw("#B39DDB", "Wisteria", 18), Xw("#616161", "Graphite", 22), Xw("#A79B8E", "Birch", 23), Xw("#AD1457", "Radicchio", 0), Xw("#D81B60", "Cherry Blossom", 1), Xw("#8E24AA", "Grape", 20), Xw("#9E69AF", "Amethyst", 19)]

I also thought of intercepting the network requests that Google Calendar makes so that I don't need to redundantly fetch from them API but can reuse theirs. Even beyond the question of whether I have the ability to intercept network requests (which is questionable since ad blockers complained about that feature missing in mw3), there are some challenges:

  • they are using an internal-only API that is distinct from the public one. The internal API responds as a nested array, not an object, so keys are missing, and so them adding/removing a field, could shift all by one, causing issues for me.
  • I would also have to keep track of events being added/removed/edited during the page lifecycle to update my internal store based on that - this could prove tricky to get right.

Going back to DOM parsing then. There are DOM nodes for event name, event location (which I ignore) and event time. Seemed simple until I realized that the event time sometimes only includes the event start time, not the end time. And, this is not just for 15minute events - sometimes 30 minute and even 45 min events don't include the end time, only the start. The end time is consistently present in the accessible event name, but that could differ too much between the locales.

I compared the DOM notes to see if there is any other hint to dictate if this is a 15/30 or 45min event. There is a "height" attribute. But, besides the reliability concerns between screen sizes and UI settings, I found that the height is a bit different for CJK languages than EN, so I totally can't really on it safely. Looking at the extension stats, it's being used in more than several dozen of languages, so these are the cases I should support.

The DOM nodes have class names that indicate some of these things. Besides the class names common to all events on the page, there are 7 that indicate some specific parts (whether it's occupying only part of the column width due to overlap, whether the event is occurring today, whether event has a handle, whether event should be grayed out, and etc). However, these class names are minified, so might not be very stable (though, Google Calendar seems to hash them rather than minify so they don't change too often, but still). I found a class name string that reliably indicates 15 minute events, though I didn't find ones for 30 or 45 mins events, thus this probably isn't going to work. Putting class name parsing into an ice box for now.

Google Calendar's JS Code that determines the class name string:

b.attr("class", "GTG3wb ChfiMc rFUW1c " + e + (d ? " F262Ye" : "") + (_.A3(c.getStyle()) ? " Sb44q" : "") + (_.D3(c.getStyle()) && !_.K6(c) ? " KKjvXb" : "") + (_.B3(c.getStyle()) ? " oMTcIc" : "") + (_.oe(c, 3, !1) ? " Hrn1mc" : "") + (_.oe(c, 18, !1) ? " MmaWIb" : "") + (_.C3(c.getStyle()) ? " UflSff" : "") + (_.se(c.getStyle(), 12) ? " pZRd0d" : "") + (_.se(c.getStyle(), 13) ? " JpY6Fd" : "") + (_.se(c, 22) ? " TuXzcb" : "") + (_.z3(c.getStyle()) === 4 ? " uBfZ1b" + (_.se(c.getStyle(), 14) ? " sKK0ee" : "") + (_.se(c.getStyle(), 11) ? " MIxzAd" : "") + (_.Oh(c, 21) === 0 ? " FxGsse" : "") : "") + (_.G3(c.getStyle()) ? " wYv2Ye" : "") + (_.K6(c) ? " BS4SQb" : "") + (_.se(c, 19) || _.se(c, 17) ? " afiDFd" : ""));

Thus, it seems that the only reliable place in the DOM from where I can get the event end time is the accessible label string, which looks like this in EN:

// regular event
10:00 to 13:00, Controllers, Calendar: Esri, No location, June 10, 2024
// multi-day event:
June 9, 2024 at 22:15 to June 10, 2024 at 06:30, Sleep, Calendar: Critical, No location, Color: Graphite, 
// am-pm:
10am to 13:15pm, Controllers, Calendar: Esri, No location, June 11, 2024
// event that doesn't have an end-date visible in the UI (but present in accessible label), and has location:
19:30 to 20:15, WWDC24 Apple Keynote, Max Patiiuk, Location: https://www.apple.com/apple-events/, June 10, 2024

For personal use, my script has been reliably parsing out these accessible labels since 2021 - the label format for EN did not change in that time. The only issue I had was when event name included a comma, thus I avoided the commas in my event names.

But the real challenge with parsing accessible labels is going to be the internationalization.

Here is Chinise for example:

19:15至 19:30,Charge Apple Watch,日曆:maxxxxxdlp@gmail.com,沒有位置資料,2024年6月16日
下午7:15至 下午7:30,Charge Apple Watch,日曆:maxxxxxdlp@gmail.com,沒有位置資料,2024年6月16日

And to add fun, here is Myanmar, with non-Arabic numbers

၁၉:၁၅ မှ ၁၉:၃၀၊ Charge Apple Watch၊ ပြက္ခဒိန်- maxxxxxdlp@gmail.com၊ နေရာ ထည့်မထားပါ၊ ၂၀၂၄၊ ဇွန် ၁၆
ညနေ ၇:၁၅ မှ ညနေ ၇:၃၀၊ Charge Apple Watch၊ ပြက္ခဒိန်- maxxxxxdlp@gmail.com၊ နေရာ ထည့်မထားပါ၊ ၂၀၂၄၊ ဇွန် ၁၆

To minimize the need to deal with parsing the locale-specific string, I will go with this:

  1. Get event name from the visible DOM node (would also bypass the comma issues). Doesn't seem like long event names get trimmed in the DOM, so that's good
  2. Get calendar ID from the jslog attribute on the event node
  3. Get times from the visible DOM node
  • If the visible time node only includes the start time, only then look for it in the accessible string

More details on the step 3:
Since there is a lot of variation in the time format in the accessible label string, having one regex handle them all would be fragile and probably performance poor.
Instead I will:

  • Try to look for front-end Google Calendar code that builds the accessible string to see what variations of it are possible in each language
  • If can't find that, then manually switch Google Calendar UI to each of 71 languages and record the accessible strings (for regular event, and a multi-day event, and each of these in 24hr and 12hr time format)
  • Once I get the variations, write a runtime code that will get the accessible label for the first event and try to guess the language and am/pm match from it. I am naively hoping that this doesn't have to be sophisticated - i.e for 24hr clock, the /\d\d:\d\d/gu format would work for majority of languages - I don't need to differentiate between those languages as long as this regex matched one of them. For 12hr, /\d{1,2}(?::\d\d)[ap]m)/gu would work for a few languages. Though the am/pm part is translated differently in most languages (下午 in Chinese, пн in Ukrainian). Once I find a regex that matched the accessible string for the first event successfully, I apply it to parsing all the other events on the page.
    • Note, I have to concern myself with parsing the am/pm part because otherwise
      the number in 8am and January 8 would be confused. As simple as
      \d{1,2}\B won't work because:
      • \B only works for a-z. I.e breaks with Cyrilic 8пн. There may be a more unicode-aware matcher like \p{Separator}, but it would be slower, and still won't handle the next problem:
      • Doesn't handle CJK where am/pm indicator is before the number, not after
    • I am hesitant to persist the resolved language code & am/pm in cache as language can be changed on the fly (this causes page reload though), and am/pm changes on the fly (this doesn't cause page reload, but does re-create DOM nodes)
    • Verify if it is safe to assume for each language that the first number is the starting time and 2nd number is the ending time. I don't care about event date, only time because I can know the current date from the column I am in. For multi-day events, the event will be duplicated in each column it spans, so parsing can consider just one column at a time.
      • No need to parse out the date from the UI, only time as I already have the code to get date from the URL, to watch the URL changes and to resolve a URL date to a date range that is currently displayed in the UI.
    • At any point as soon I as any step fails to parse reliably, bail out quick, log warning to the console and fallback to using the API. In that case, disable the features that rely on DOM parsing for performance reasons? And hopefully I that it broke quickly enough and have time to fix it? (or never if the error only occurs for a language I am not using?). tests: Add automated end-to-end scheduled tests #236 would be really useful here. Otherwise, some sort of crash analytics.

Fun fact: 12hr clock sucks (besides the fact that it is harder to parse). It is ambiguous. See https://en.wikipedia.org/wiki/12-hour_clock#Confusion_at_noon_and_midnight

Implementation details:

  • Have a DOM mutation listener with 1s timeout (leading edge) to reparse events on DOM change (i.e. event being added or weeks changing). In my benchmarking the code for parsing EN accessible labels alone takes 1-3ms.
    • (optional) only run the mutation listener if there is a feature active that needs access to the events? (i.e Calendar Plus is open, or badges are enabled)

@maxpatiiuk
Copy link
Owner Author

I was a bit afraid that the accessible label code would be on the back-end, but no, it is all on the front-end.

The format can differ between languages.

"{TIME_PERIOD}, {TITLE}, {CALENDAR}, {DATE}"

There is also special logic for tasks, reminders, scheduled slots, proposed events without rsvp, with rsvp, denied event, etc...

I think if I don't find the calendar ID, I should just bail out of parsing that event (rather than bail out of parsing the entire page, unless failed to find calendar id for any event) so as to exclude all these non-event entities


I also found that their DOM elements have a property with virtual dom data. I was excitedly looking if that would contain the calendar event data, but no, it just contains the DOM attribute data that I could already access via direct DOM APIs (which would also be more stable than accessing internal data structures)


But, much better, there is a global gcal object that contains on it every JS module of the page, including the date formats:

> gcal.mvd.EL
'{START_DATE_TIME} – {END_DATE_TIME}'
> gcal.kvd.EL
'{FULL_DATE}, {TIME}'
> gcal.LDd.EL
'{DATE}, {START_TIME} to {END_TIME}'

Useful util for exploring the global object:

// Convert a complex object to JSON
stringifyJson = (data,short=true,seen=new Set())=>JSON.stringify(data, (_key, value)=>{
  if(value == null || typeof value === 'string' || typeof value === 'number' || typeof value === 'boolean')
    return value;
  else if(seen.has(value)) return short ? undefined : '[RECURSIVE]';
  seen.add(value);
  if(typeof value ==='object') {
    const result = {};
    for (const key in value) {
      if (Object.prototype.hasOwnProperty.call(value, key)) {
        try {
          const objectValue = value[key];
          objectValue.constructor;
          result[key] = objectValue === globalThis ? short ? undefined : '[GLOBAL_THIS]' : objectValue;
        } catch (err) {
          result[key] = short ? undefined : `[ERROR: ${err.message}]`;
        }
      }
    }
    return result;
  }
  else if(short && typeof value === 'function') return undefined;
  else return value.toString()
})
// Convert gcal object to a string
stringifyJson(gcal);
// Copy the resulting string into an editor, pretty-print it as JSON, and explore!

And then also this to quickly find interesting keys and values in the resulting object:

resultingObject= {...}
a = new Set()
d = j=>Object.entries(j).forEach(([k,v])=>{
  if(k.length > 10) a.add(k);
  if(typeof v === 'string' && v.length > 4) a.add(v)
  if(Array.isArray(v)) v.forEach(d)
  if(typeof v === 'object' && v !== null) d(v);
})
d(resultingObject)
Array.from(a).sort()

And then this to find a path to a string in a deeply nested object:

resultingObject= {...}
l = (j,f,p=[])=>Object.entries(j).forEach(([k,v])=>{
  if(v === f) { console.log([...p,k].join('.')); }
  else if(typeof v === 'object' && v !== null) l(v,f,[...p,k]);
})
l(resultingObject,"Charge Apple Watch")

Some findings:

  • Found definition of all the calendars the user has
    • Path: gcal.dc.ka.ma.n73qwf.0.ma.__wizmanager.ha.3458.__jsmodel.v6ikb.Sz.Fa.3.1
  • All time-zones:
    • Path: gcal.dc.ka.ma.n73qwf.0.ma.__wizmanager.ha.3.closure_lm_675541.ha.keydown.4.handler.source.nL.Du.Qya.qa.qa.ka.208.0
  • Formatting settings for current locale
    • Path: gcal.Jx
  • Feature flag names
  • Calendar settings
    • Path: gcal.dha
  • Fetched event names
    • Path: gcal.dc.ka.ma.n73qwf.0.ma.__wizmanager.ha.3.closure_lm_675541.ha.keydown.4.handler.source.ma.source.PO.ji.kE.Ab.history.ma.ka.ha.ma.bi.0.__jscontroller.Sz.QG.ka.ha.ma.0.6.Cq.1.5

While it has a lot of information I would love to have access to in the extension, as you can see, some of the property access paths are very long, making any reliance on these super fragile.
It seems like many of these keys are hashed rather that simply randomly minified, giving them some form of persistence, but still, it would be too fragile to relly on these.
Still, this is going to be helpful here:


To fetch JS bundles, Google makes a request like this:

https://calendar.google.com/calendar/_/web/calendar-static/_/js/k=calendar-web.matasync.en.kiV42_DT6Uo.2020.O/am=QAICIAA7Ek8CAAAI/d=0/rs=ABFko3_cKNOcL2ZJoL-KJZ1fYcDgE694mA/m=syw9,syw8,xDNx2e,KHdXW,ZDBS7d,jjykEd,sy19u,sy19v,IiAxCb,s0ef2c,Hkkrld,synd,ws9Tlc,siKnQd,UAyYnd,sy2a,sy2b,ndDKmb,UItRMc,yqBu4c,zWPBS,qadpGd,rIjGQb,gCKuke,WMXaid,wzzigb,NmJjzb,beFWRb,Cc7Sob,RtZYV,B78tCd,ayMid,o83wje,C8yvoe,bUUOIe,sy1lt,KEohkb,EZnnmd,bAxIgb,sy1h2,ttg67c,I8m4he,SzkWee,lQR3Hd,bqZpcb,sy1a7,e8v9gb,MpJwZc,n73qwf,MtLh9c,...... very long string like this continues

in the above, the en in web.matasync.en denotes the language - changing those two letters could return the JS bundle for a different language (with formatting strings for that language - quite helpful for my research)
edit: turns out some other parts of the URL like kiV42_DT6Uo also chance with language, so it's not as simple 😢
The 5-6 letter parameters at the end of the URL denote which files to include in the bundle. Removing each of these from the URL removes a bit of code from the resulting .js file - very interesting!


After a bit of experimentation, I see that the bundle that includes the time formats are the following:
https://calendar.google.com/calendar/_/web/calendar-static/_/js/k=calendar-web.matasync.en.kiV42_DT6Uo.2020.O/am=QAICIAA7Ek8CAAAI/d=0/rs=ABFko3_cKNOcL2ZJoL-KJZ1fYcDgE694mA/m=keU3Q,tXMUsb,ndDKmb

this.gcal=this.gcal||{};(function(_){var window=this;
try{
_.B("tXMUsb");
var KDd,FDd;_.GDd=function(a,b){return b?FDd.format({GREGORIAN_DATE:a,ALTERNATE_DATE:b}):a};_.KL.prototype.va=function(a,b,c){return c?_.QCd(_.YCd(this,a.month),a.qb):_.QCd(_.XCd(this,a.month),a.qb)};_.ZCd.prototype.va=function(a){return _.aDd(this,a)};_.bDd.prototype.va=function(a){return _.dDd(a)};_.HDd=function(a,b,c){return _.Vx(_.G(a.ha,_.Ox,23),[b,c])};_.IDd=function(a,b){if(!a.ka||!a.ha)return"";b=a.ha(b);return a.ka.ka(b,!0)};_.JDd=new _.Sx("{START_TIME} to {END_TIME}");KDd=new _.Sx("{START_DATE} at {START_TIME} to {END_DATE} at {END_TIME}");
_.LDd=new _.Sx("{DATE}, {START_TIME} to {END_TIME}");FDd=new _.Sx("{GREGORIAN_DATE}, {ALTERNATE_DATE}");_.ML=function(a){_.U.call(this,a.Ha);this.ha=a.service.tb;this.Uq=a.service.Uq;this.vi=a.service.vi};_.N(_.ML,_.U);_.ML.Ga=_.U.Ga;_.ML.Ea=function(){return{service:{Uq:_.LL,vi:_.Sud,tb:_.bx}}};_.NL=function(a,b,c=!1,d=!1){c=c?_.Wud(a.vi,b):_.Uud(a.vi,b);d&&(a=a.Uq,a.ka&&a.ha?(b=a.ha(b),b=a.ka.va(b)):b="",c=_.GDd(c,b));return c};_.MDd=function(a,b,c=!1){return c?_.Vud(a.vi,b):_.Tud(a.vi,b)};
_.NDd=function(a,b){return a.ha.nS()?_.avd(a.vi,b):b.minute===0?_.Zud(a.vi,b):_.$ud(a.vi,b)};_.ODd=function(a,b,c){return KDd.format({START_DATE:_.Tud(a.vi,b.Cb()),START_TIME:_.NDd(a,b.ae()),END_DATE:_.Tud(a.vi,c.Cb()),END_TIME:_.NDd(a,c.ae())})};_.Ct(_.kv,_.ML);
_.C();
}catch(e){_._DumpException(e)}
///// trimmed...
})(this.gcal);
// Google Inc.

Now, I can find the bundle for each of the other formatting strings that comprises the accessible event label and then fetch those for each language

Found the function that formats time:

    _.NDd = function(a, b) {
/// if 24hr clock format enabled then do 6:30
      return a.ha.nS() ? _.avd(a.vi, b) : 
// else if 0 minutes, do 6am
b.minute === 0 ? _.Zud(a.vi, b) :
// else do 6:30am
_.$ud(a.vi, b)
    }

Also, interesting that I found code like this: _.Qud = new Set(['ja', 'ko', 'zh_CN', 'zh_TW']).has('en');.
That indicates that the language code is substituted into the code after the minifier is run.
I guess they wanted to keep the build fast, so the bundle is build once, but then parts of it are substituted on the fly when it is time to serve it?


Google Calendar languages:

{
  "af": "Afrikaans",
  "az": "azərbaycan",
  "id": "Bahasa Indonesia",
  "ca": "Català",
  "cy": "Cymraeg",
  "da": "Dansk",
  "de": "Deutsch",
  "en_GB": "English (UK)‎",
  "en": "English (US)‎",
  "es": "Español",
  "es_419": "Español (Latinoamérica)‎",
  "eu": "euskara",
  "fil": "Filipino",
  "fr": "Français",
  "fr_CA": "Français (Canada)‎",
  "gl": "galego",
  "hr": "Hrvatski",
  "zu": "isiZulu",
  "it": "Italiano",
  "sw": "Kiswahili",
  "lv": "Latviešu",
  "lt": "Lietuvių",
  "hu": "Magyar",
  "ms": "Melayu",
  "nl": "Nederlands",
  "no": "Norsk (bokmål)‎",
  "pl": "Polski",
  "pt_BR": "Português (Brasil)‎",
  "pt_PT": "Português (Portugal)‎",
  "ro": "Română",
  "sk": "Slovenčina",
  "sl": "Slovenščina",
  "fi": "Suomi",
  "sv": "Svenska",
  "vi": "Tiếng Việt",
  "tr": "Türkçe",
  "is": "íslenska",
  "cs": "Čeština",
  "el": "Ελληνικά",
  "be": "беларуская",
  "bg": "Български",
  "mn": "монгол",
  "ru": "Русский",
  "sr": "Српски",
  "uk": "Українська",
  "kk": "қазақ тілі",
  "hy": "Հայերեն",
  "iw": "עברית",
  "ar": "العربية",
  "ur": "اُردُو‬",
  "fa": "فارسی",
  "ne": "नेपाली",
  "mr": "मराठी",
  "hi": "हिन्दी",
  "bn": "বাংলা",
  "pa": "ਪੰਜਾਬੀ",
  "gu": "ગુજરાતી",
  "ta": "தமிழ்",
  "te": "తెలుగు",
  "kn": "ಕನ್ನಡ",
  "ml": "മലയാളം",
  "si": "සිංහල",
  "th": "ภาษาไทย",
  "lo": "ລາວ",
  "my": "မြန်မာ",
  "ka": "ქართული",
  "am": "አማርኛ",
  "km": "ខ្មែរ",
  "zh_HK": "中文 (香港)‎",
  "zh_CN": "中文(简体)‎",
  "zh_TW": "中文(繁體)‎",
  "ja": "日本語",
  "ko": "한국어"
}

@maxpatiiuk
Copy link
Owner Author

Looks like besides the language code, some other language-specific parameters are specified in the URL:
https://calendar.google.com/calendar/_/web/calendar-static/_/js/k=calendar-web.matasync.en.kiV42_DT6Uo.2020.O/am=QAICIAA7Ek8CAAAI/d=0/rs=ABFko3_cKNOcL2ZJoL-KJZ1fYcDgE694mA/m=keU3Q,tXMUsb,ndDKmb

The request for the HTML home page returns a script tag with those parameters included in the URL. Let's extract them:

await fetch("https://calendar.google.com/calendar/u/0/r/week?pli=1", {"headers": {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:127.0) Gecko/20100101 Firefox/127.0"},}).then(t=>t.text()).then(r=>{
l='<script id="base-js" src="/calendar/_/web/calendar-static/_/js/k=calendar-web.matasync.';
s=r.slice(r.indexOf(l)+l.length);
return s.slice(0,s.indexOf('/m=base"')).replace('d=1','d=0');
});

Changing settings for each language and extracting that language-specific url part:

[
  "af.fGjYHqi3siQ.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38t2kWSQxCgHMYs21szC-klpIN-Zw",
  "az.O9dsS0mL00k.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-GhVrGf7jBQ5RIhiPfOr6qnUv7mg",
  "id.aZIa6c4zhPM.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko380p-O1HnMkmfop-u74FXjI_zuBTg",
  "ca.O-_1HH8FjL8.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39LoBjXJDwlJ1l8nRa5YWkI5WJAyA",
  "cy.q5BBAITHbDs.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38QxBSQisieYXEpfbifjf1FMPZLxQ",
  "da.ZjXJzpTk2NQ.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38w7wwVS6vy0FSshnTFRFnaY_-dNw",
  "de.2OZiqpfWTHw.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-_aa3qMTXDzk4moqM4tM7XaLgbkw",
  "en_GB.KYHrsBTODfk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_YfrVPQQ7-ugqib4_v0wTk60Oqig",
  "en.kiV42_DT6Uo.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39lIyVIeFOxM6CrUuDmwE5CXKPeVw",
  "es.WogQQhrZn6E.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_hzhp-cmiD6H-fKZGu-CqL53exkQ",
  "es_419.4WrUvkTfE7E.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38XXUigONuWBtjcm3AyGFsIexgWSg",
  "eu.FVyXvyQ-vBk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39lBysf7io9wObOyT8OmBWkucpwFA",
  "fil.ZG0z2n6Utsk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-t2s8VG9NZD6Vosz20v0ou0H1gSw",
  "fr.NeL94Vwnkkk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39_YAnfYSOb96ct-wV94vbSeNaF9Q",
  "fr_CA.yDuZVdev-2c.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38VJsMZyrH7O7Boj7BY1cy-NbN1WA",
  "gl.m4lqE4mGoLA.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-3-tUHefVRfhBrkw3urg8pNDiobw",
  "hr.f_13jldKXQU.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38wFyLVcEIG5Pw2jkVRe48PHmUrAw",
  "zu.-Ndbezb-QjE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38T-CKmgnsz3ZrtPnM0Kc_77ujR-Q",
  "it.8xr9VoJDSVQ.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39EMB15h7Uf-WQ0xgELh-fDvwLCsw",
  "sw.K_fsoydgg-k.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_-iLzAnxNNypnD8axQnKhqSDC3Bw",
  "lv.8beiVA8-HfI.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38Bbob8EJLaffNMqXVlOetbGdt7VQ",
  "lt.POBohtPFRKc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39dNttPvcsiVp1VyO6nwDiUzQC0wg",
  "hu.Z45PzMZCOYs.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_RvbhZiA_-yqO1iJfLbYoJj3tqLA",
  "ms.iY6PmE8PJw0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_urRPMNmJVgaAbPkf_h3JGjFW5Eg",
  "nl.Wfd3wg2oLAg.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39eKT9DEYKyZROZLmP-mVl0mPDKkQ",
  "no._UH1m-jairc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_2Q01On91v7SPmlSxwNzrXqPk84Q",
  "pl.HoTxzj70m44.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko387q6BCo_p4xmzsrpFdRYXNOUzeZw",
  "pt_BR.yPVb0sWPdB8.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_2iipauXn_cofC3bPPKJGAs2ADyQ",
  "pt_PT.M40ScvnltBc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-1yyAdr1_yLh2_DAHAzcKvZ1kO1Q",
  "ro.o-J6YJwXM_Y.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-7FuKKra7twKFgP7PYjfl13VRkPQ",
  "sk.O_XlP1I57Is.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_xzfE5Q5cT9dGECG9K6FuIX9-7nA",
  "sl.iVc6mkEVHYE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_F49K5WjbHkbPQgqywoiK4wIYeyA",
  "fi.u4jLcFduYtA.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko383sOE93juaPXTKHbhv4HyFVUeVuQ",
  "sv.NZYxm4-5grE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-cRb50wPQHIRJ4dAzQY6l6WZHxYg",
  "vi.jDliExU6ZZc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-7j_NRqcfBnRm8OdXONu-4_eKZgA",
  "tr.YBgq_f4fsAs.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38BGabY2dw0GmHWlefnVzAAfch6iQ",
  "is.8qGrdxsnvuA.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-QPANWC2JKRRJhAj757jH-ClfXfA",
  "cs.xENm3_vyrB0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_voHJz8EMeHCYOpulB2SnEKBSAEw",
  "el._wUNKxcYJc0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39SSwcrlGtBeBLsM7raY8Y39bbcbw",
  "be.KqarmVCCkNE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38KE6sQG8SBLfKzMU8b6_K-NtVpoQ",
  "bg.TNjdIY1ZhxE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_zZaCjReXnqgCHwp1NqDOGIK-yag",
  "mn.M5V9EhgLRPw.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38UD8OeeMIw0zHIihoR-PBmmx0Csg",
  "ru.JM2QQSAXLtM.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39mWI4tqJ2VKFbpk91yLzn30335qQ",
  "sr.IAZvMlFG9B0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38JBKny69eqkRrwLQfYFm4iXSse4A",
  "uk.T3edqZOotJc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38BvNB5B4EE9gP_oY9NFc1yisZX3A",
  "kk.dSk0-wa9tpo.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_pvVerfXbVWKNiWrIEAoNIhPwNaw",
  "hy.FfLUK2mnymg.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-F98XiXCuizpat1r8O398JTxsJog",
  "iw.lPRGc4QfCK0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39t-lljHsElqM_xSa_R-pnjQmOEDA",
  "ar.n7y-oa7O2zo.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko385XGuSt2zpkzb84j7MftY3vM5Dbg",
  "ur.hKhfKhqQ738.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko381I65m_-hdJIjKzY0Eq3Q6V_diVQ",
  "fa.geNz2OVQpOY.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko399Zg4iV-cYBXO1vpO47s6zSrkdSQ",
  "ne.EcGx837Lpp4.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38z64qGvu9I41bOCtvs8ir8A2HZsw",
  "mr.K0sef7tGL9I.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-whkOOYFwRV0YuLARrYU9QSh5heA",
  "hi.WenmEHoVQr0.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38FQq2hjlcIHfy8Z4PU4YTjQY37mw",
  "bn.Wov9pdgoJoE.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-ji-RDFUY-waVtOT4-Q_ICEyjRVQ",
  "pa.8c7rkc9tc5s.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38yS8NeN4zyH-Kcb2PUSzMCpKqs6w",
  "gu.rL2XGee_j98.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38ePxUJmwyZ3fFqpq8BkjFFEt-1CQ",
  "ta.97lPW3Dw-jk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39LdleOjtQSUJx1CfNVkdhvMvLJzQ",
  "te.dcyfQzF_UF4.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38W5zlRXPfKwXSMzm297mW4ZisXuQ",
  "kn.zHgUSeh1vOk.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39ih84cJ2WFl1CnSupT_FsQQpoaiw",
  "ml.6OxWWkPNjzI.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38XISnQNXKkU0zXRrisP05JS2323A",
  "si.wtLZozlWq5I.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3_d0cdzuK1-ncLdp7TTypnsv9-4rg",
  "th.hZ4ZlMcBHEw.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-7dTPSxa-cS3Zdp0CbKHLm50LN0g",
  "lo.yq1i_m3Dfzc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko39xwaGqdjxuomgsJICyTzyDRPuQZg",
  "my.KnBiGO1WoOc.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-0H-R5e6La5iRfKmCyzA6dqhQbWg",
  "ka.2TtkghregFM.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38eAq0Sl8Qcv2OJoAFKj7omFQAxaw",
  "am.9IRwIRcj-uw.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko3-Qogi8tVhTPCo6z5Y01qJFEyOzhQ",
  "km.l-4sjqqxBdY.2020.O/am=QAICAAArEk8CAAAI/d=0/rs=ABFko38A8e4XHgrE3DtOevCYw8XetVZRsA",
  "zh_HK.Cq4_R4Oqa8c.2020.O/am=QAICAAArEk8CAAAE/d=0/rs=ABFko38iW_FtC5whx39xQTbzElvVG_W1kA",
  "zh_CN.37fgdVobyrM.2020.O/am=QAICAAArEk8CAAAE/d=0/rs=ABFko3_vfz9WqfSlyyv061gvr_AFSrcNng",
  "zh_TW.K7JItRhZf3s.2020.O/am=QAICAAArEk8CAAAE/d=0/rs=ABFko38r62uPGXG9Dj8CZfuB_mlunf53SA",
  "ja.GIKsEBuWpIE.2020.O/am=QAICAAArEk8CAAAE/d=0/rs=ABFko3_vXk3Mt69gwOJzP_WILuq-LdTt0A",
  "ko.tKvKcd-IL0I.2020.O/am=QAICAAArEk8CAAAE/d=0/rs=ABFko3-apa7LgjbDQq0_LqcLfMkjga5awA"
]

Found that the page makes this request when I re-open the tab after it being in the background:

https://calendar.google.com/calendar/u/0/hello?cwuik=10

which responds with: hello
a cute active status indicator)

@maxpatiiuk
Copy link
Owner Author

Rather than reading the JavaScript source code for each language to see what formatting it uses, I saw in the code what kind of changes to expand in formatting between languages, and simply scrapped the aria labels and dom times for every one of Google Calendar's 72 languages and for each in the am/pm and 24 clock style.

A few edge cases to handle:

  • Some languages don't use arabic numbers - need to run a normalization pass on the scrapped data
  • There may be non-event entities on the calendar (tasks, schedules, etc) - handling those by trying to find a calendar id for an even, and if can't then it's not an event

Then, I had a great observation - I don't actually need to know if a given date is am or pm! That's because I have access to the DOM and so I can infer the following from the DOM:

  • Does event touch the top of the current day (meaning it's start time should be 0, even if the label says otherwise because event spanned from the previous day)
  • Does event touch the bottom of the current day (meaning it's end time should be 24, even if the label says otherwise because it spans to the next day)
  • Is the top of the event above the midpoint of the day? Then the start time must be in am
  • Is the bottom of the event above the midpoint of the day? Then the end time must be in am

With this, the task is simplified to merely "finding numbers in strings", rather than "finding times, that could be written in am/pm in any one of 72 languages and more languages could be added at any point without notifying you causing your extension to break"

Still a bit tricky because:

  • The dom time may only have the start time
  • The event label string has many numbers - one or 2 day numbers (which can definitely be confused for time in am/pm style - 2 AM 2 Jan) and a year number (which, fortunately is unambiguous)
    • I deal with it by knowing that the dom time node will always have the start time. So if dom time only has the start time, I find that start time in the event label string and exclude it from the consideration. Then, I also exclude from the consideration the day number for the day that this event is from. And if event spans to next or previous day, I also exclude that day number
  • The event label also includes event summary and calendar summary, which may include numbers in them
    • But, in all languages calendar summary appears after event summary. And, I have event summary from the dom node. So I simply ignore everything in the event label after the last occurrence of the event summary substring.
  • In RTL languages, the start time in the string may occur after the end time
    • But that doesn't matter because after I exclude all other matches from the event string (start time and day numbers), the only remaining number must be the end time

With this, I am able to parse all event label string in all languages even in am/pm with almost 0 reliance on locale specific behavior. In particular, I am making two small assumptions:

  • Event summary in all languages will continue to appear after event start and end times in the event label string. And calendar summary is always after event summary
  • In some languages from 9 to 10 am is shown as 9–10am. My regex for detecting an individual number (<number><not number><number><number> i.e 9:10 or 9.10) needed to be modified to say that the <not number> part is <not number and not - and not –). My assumption is that the going forward, only - and characters will need to be excluded like this. But to be sure, having continuous UI tests in multiple languages would be great

All of this is supplemented with test cases that verify parsing is correct in every one of 72 languages in am/pm. The only risk now is that Google will change formatting on their side. But for that, and other things I need #236 anyway, and that would be crucial in making my extension more stable.

The other change I did today is that on my machine only, if extension logs any warning/error, it will also be printed on the screen. So if any bug occurs while I am just passively using the extension rather than developing it, I will still notice that it happened (i.e if dom parsing will start failing, I will be notified quickly, at least in case of en, but my parsing is now pretty much language-independent)

In any case still, as soon as failure occurs, I fallback to API calls for the duration of the session.

All of the above is now implemented and functioning well. With some refactoring, it weights 1000 lines of code total. Plus 5.3k of lines of text fixtures.
Though, I now plan on simplifying my event duration calculation, caching and cache invalidation logic to simplify it, while also improving performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

No branches or pull requests

1 participant