Skip to content

Latest commit

 

History

History
95 lines (73 loc) · 3.44 KB

onSome.md

File metadata and controls

95 lines (73 loc) · 3.44 KB

 

home roadmap issues ©2022,2023 by tim menzies

DOI

Automated SE and Reservoir Sampling

To sample an infinite stream, only keep some of the data (and keep it in the "reservoir"):

  • and as time goes on, keep less and less.
  • and if the reservoir fills up, just let new things replace on things (at random)
    • and do not fret about losing important information.
    • if something is common, deleting it once will not matter since either it exists elsewhere in the reservoir, or it will soon reappear on inout.

E.g. if run on 10,000 numbers, with a reservoir of size 32, this code would keep a sample across the whose space of numbers.

{  18  687 
 1545 
 2022  2324 2693 2758 
 3247  3533 
 4067 4168 4469 4570 
 5863 5907 5957 
 6147 6440 6727 
 7228 7517 7574 7598 7765 
 8311 8379 8538 
 9052 9189 9323}
SOME = obj"SOME"
function SOME.new(i,max)
  i.ok, i.max, i.n = true, max or the.Some or 256, 0  
  i._has = {} end -- marked private with "_" so we do not print large lists


function SOME.add(i,x) --> nil. If full, add at odds i.max/i.n (replacing old items at random)
  if x ~= "?" then
    local pos
    i.n = i.n + 1
    if     #i._has < i.max   --- note that "#i._has" means "length of the list i._has"
    then   pos= 1+#i._has    -- easy case. if cache not full, then add to end
    elseif rand() < i.max/i.n then pos= lib.rint(#i._has) end -- else, replace any at random
    if pos then
       i._has[pos]=x
       i.ok=false end end end -- "ok=false" means we may need to resort

There is more SOME below but before that I note that the above can handle numbers or symbolics inside the SOME. But what happens next is all about SOMEs of nums.

function SOME.has(i) --> t; return kept contents, sorted
  if not i.ok then i._has = sort(i._has) end -- only resort if needed
  i.ok = true
  return i._has end


function SOME.mid(x) --> n; return the number in middle of sort
  return per(i:has(),.5) end


function SOME.div(x) --> n; return the standard deviation as (.9 - .1)/2.58
  return (per(i:has(), .9) - per(i:has(), .1))/2.56 end


function  per(t,p) --> num; return the `p`th(=.5) item of sorted list `t`
  p = math.floor(((p or .5)*#t)+.5) --
  return t[math.max(1,math.min(#t,p))] end

To understand SOME.div, recall that in a normal curve:

  • 99% of values are in 2.58 standard deviations of mean (-2.58s <= X <= 2.58s)
  • 95% of values are in 1.96 standard deviations of mean (-1.96s <= X <= 1.96s)
  • 90% of values are in 1.28 standard deviations of mean (-1.28s <= X <= 1.28s)
    • so 2*1.28*sd = 90th - 10th percentile
    • i.e. sd = (90th - 10th)/(2*1.28)