home | syllabus | groups | moodle | video | review | © 2022
Take a working system (written in LUA) and write it in any other language.
The task is write some code to read a CSV file and generate summaries of columns (medians and standard deviation for numerics; mode and entropy for symbolic columns).
eg. when run on our
test file,
this code reports (see the test cases eg.stats
:
xmid {:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv {:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid {:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv {:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}
The task is very simple (lots of very small moving parts).
- The real goal here is to get your group to practice working together.
- You have five people in your group. Time to divide and conquer.
Note, write as much as you can from scratch. So, in Python, no Pandas or scikitlearn or argparse or optparse or their equivalent in other languages.
Homework | Task |
---|---|
HW2 | Get this going for the Num and Sym class (below) and the tests cases the , sym , num , bignum . |
HW3 | Get this going for the Cols , Row , Data class and the test cases eg.csv, eg.data, eg.stats . |
HW4 | Add all the bling from HW1. Also, add post-commit hooks to auto run all the test cases, the code coverage checks (if your language supports then), and the documentation generators. For more on what kinds of documentation is acceptable, see the web site from lecture1. |
HW5 | For five other groups from cs510 (picked at random), apply the Project1 rubric. Important note: whatever scores you offer, these will not change the other group's scores. |
- All submissions will be a repo url, posted to Moodle.
- If you have outputs, or the text file with the rubrics
add them to e.g.
out1.txt
in/docs/out
.
- Example implementation: source code
- quick tutorial on LUA.
- Sample input csv file
- Read the quick tutorial on LUA (but you only have to read it, not write it)
- Three ideas cover much of LUA
- In LUA, names are global are default (unless marked with
local
).- Or, in the function header, you add extra variables.
- LUA is a language with only one associative array data structure
- If the keys are all consecutive integers, then these are simple lists (with indexes 1..n)
- Contents are accessed via
list[1]
.
- Contents are accessed via
- If the keys are symbols, then we use javascript/python style access:
- Contents are accessed via
list.slotNiame
- Contents are accessed via
- If the keys are all consecutive integers, then these are simple lists (with indexes 1..n)
- LUA is a minimal "batteries not included" language. So the code given to you starts with a bunch of functions that are built-in to other languages.
- In LUA, names are global are default (unless marked with
- Three ideas cover much of LUA
- Read the source code. Written in LUA. Your must not write in LUA.. Note that:
- Column1 of that pdf has a header string showing help, from which I build my
the
settings object. - Column2 of that pdf = some misc utils
- Column3 of that pdf = classes/methods.
- Last column of that pdf = test cases for this system
- Second last column of that pdf = shows code that exits to operating system with the number of test failures
- No failures = return zero.
- HINT: when reading strange source code:
- FIRST look for the data structures (see column2)
- SECOND look at the tests (last column)
- THIRD look at the details.
- Find some way to divide the functionality across may small files
- e.g. one file per class
- e.g. anything to do command-line options goes into its own file
- e.g. misc utilities into its own file.
- e.g. test cases in a separate file
The function Y=F(X) computes dependent variables Y from independent variables X.
Some variables are Num
eric (which we denote with a leading upper case letter) and some
are Sym
bolic. Some dependent
variables need to maximized (denoted with a trailing -
sign)
and other need to be maximized (denoted with a trailing +
).
We say a CSV file contains lots of X,Y
examples. Line one of the file has a name showing the column
name and types. E.g. in our
test file,
the dependent variables are columns 4,5 and 8.
Clndrs,Volume,Hp:,Lbs-,Acc+,Model,origin,Mpg+
8,304.0,193,4732,18.5,70,1,10
8,360,215,4615,14,70,1,10
8,307,200,4376,15,70,1,10
8,318,210,4382,13.5,70,1,10
8,429,208,4633,11,72,1,10
Note also the ":" header on Hp:
(above). This denotes a column to skip.
You have to report middle and diversity of each non-skipped column.
For numbers:
- mid = median = sort numbers seen so far, return the middle value
- div = standard deviation = sort numbers, find 90th, 10th percentile, return (90th-10th)/2.56
- why? well you know that 1 or 2 standard deviations captures 66 to 95% of the mass. So somewhere in-between 1 and 2 is some point where you catch 90% (that point is 1.28 standard deviations, so we used plus or minus 1.28=2.56)
For symbols:
- mid = mode = most common symbol
- div = entropy = for symbols occurring at probability p1,p2,... then
entropy= ∑ -pi * log2(pi).- Why? well, entropy can be viewed as the effort required to recreate a signal.
- If a signal has parts that occur a probability p1,p2,...
- then the probability that we want to search for a signal is, wait for it, p1 + p2,....
- And the effort to find the signal is log2(p) (assuming a binary chop)
- So the probability of needed that search effort is -pi * log2(pi) (and the minus sign is added as convention).
Your code must support the following (and outside of these bounds, feel free to NOT do things like csv):
- Five classes (or more):
Data
,Cols
,Sym
,Num
,Row
. For notes on these, see column2 of the source code.- Note that
Num
s is a reservoir sampler;
i.e. it keeps N numbers and if you see more than N, some old number is replaced at randaom.
- Note that
- A help string that can be displayed with
-h
. - A
cli
function that can updatethe
, its slots from the command-line- e.g. if
the
is{name="Tim", age=20}
then-n X
and-a X
can updatename
andage
.
- e.g. if
- A
csv
function that reads each line of a text file, divides on some operator (e.g. comma), removes leading and trailing white space, then coerces the cells to ints, floats, booleans or (failing the rest) to strings. - A set of demo tasks that can be called with
-e task
(and-e ALL
) runs all tests.- The tests of source code (and perhaps your own, if you want).
- On our system,
-e ALL
returns the following (but [YMMV[(https://dictionary.cambridge.org/us/dictionary/english/ymmv)).- The following output is in alphabetically ordering of the tests. A better ordering would be simpler to more complex. To see that ordering, look at the last column of the source code.
- Note that our tests can handle a crash (see "BAD"). To actually debug that one, you need the stack dump info that is hidden by the usual test output. So we have a
-----------------------------------
!!!!!! CRASH BAD false
-----------------------------------
!!!!!! FAIL LIST true
-----------------------------------
Examples lua csv -e ...
ALL
BAD
LIST
LS
bignum
csv
data
num
stats
sym
the
!!!!!! PASS LS true
-----------------------------------
{1 28 49 50 56 85 86 156 208 237 294 327 444 459 461 485 490 503
546 618 653 712 723 727 770 801 849 915 928 941 967 987}
!!!!!! PASS bignum true
-----------------------------------
{Clndrs Volume Hp: Lbs- Acc+ Model origin Mpg+}
{8 304 193 4732 18.5 70 1 10}
{8 360 215 4615 14 70 1 10}
{8 307 200 4376 15 70 1 10}
{8 318 210 4382 13.5 70 1 10}
{8 429 208 4633 11 72 1 10}
{8 400 150 4997 14 73 1 10}
{8 350 180 3664 11 73 1 10}
{8 383 180 4955 11.5 71 1 10}
{8 350 160 4456 13.5 72 1 10}
!!!!!! PASS csv true
-----------------------------------
{:at 4 :hi 5140 :isSorted false :lo 1613 :n 398 :name Lbs- :w -1}
{:at 5 :hi 24.8 :isSorted false :lo 8 :n 398 :name Acc+ :w 1}
{:at 8 :hi 50 :isSorted false :lo 10 :n 398 :name Mpg+ :w 1}
!!!!!! PASS data true
-----------------------------------
50 31.007751937984
!!!!!! PASS num true
-----------------------------------
xmid {:Clndrs 4.0 :Model 76.0 :Volume 146.0 :origin 1.0}
xdiv {:Clndrs 1.55 :Model 3.876 :Volume 100.775 :origin 0.775}
ymid {:Acc+ 15.5 :Lbs- 2800.0 :Mpg+ 20.0}
ydiv {:Acc+ 2.713 :Lbs- 887.209 :Mpg+ 7.752}
!!!!!! PASS stats true
-----------------------------------
{:div 1.378 :mid a}
!!!!!! PASS sym true
-----------------------------------
{:dump false :eg ALL :file ../data/auto93.csv :help false :nums 512 :seed 10019 :seperator ,}
!!!!!! PASS the true
!!!!!! PASS ALL true
The task is done when we can see in your repo:
- Your repo shows a record of 5 people doing (approx) equal work.
- Something that can display a help string and where internal settings be updated from command-line.
- Something that can read csv files whose first line has ":+-" and upper and lower case in column names.
- Output like the above (and YMMV)
- For hw2, you need classes and methods for
Sym, Num
- For hw2, you need to show output from
eg.the, eg.sym, eg,num, eg.bigbum
- For hw3, you need classes and methods also for
Data, Cols, Row
- For hw3, you need to should output from
eg.csv, eg.data, eg.stats
- For hw3, test engine that can call one, or all, tests and which returns the number of failed tests to the operating system
(so returning zerop means "no errors)
- the test engine should be able to handle crashing tests (and keep running tests after the crash). For Python people,
see
try:catch:
.
- the test engine should be able to handle crashing tests (and keep running tests after the crash). For Python people,
see