From 29a8ed339c1cfb3554052b11e223c43f26f0c7cc Mon Sep 17 00:00:00 2001 From: Gavin Gilmour Date: Sat, 15 Sep 2018 12:03:34 +0100 Subject: [PATCH] Doc improvements Tidied up some of the field definitions doc --- docs/Variable-definition.rst | 360 ++++++++++++++++++++--------------- 1 file changed, 202 insertions(+), 158 deletions(-) diff --git a/docs/Variable-definition.rst b/docs/Variable-definition.rst index b61db0ac4..be7c07b4a 100644 --- a/docs/Variable-definition.rst +++ b/docs/Variable-definition.rst @@ -1,41 +1,50 @@ -Variable definitions -================= +Variable Definitions +==================== -Core Variables +Variable Types -------------- A variable definition describes the records that you want to match. It is a dictionary where the keys are the fields and the values are the -field specification - +field specification. For example:- .. code:: python - variables = [ - {'field' : 'Site name', 'type': 'String'}, - {'field' : 'Address', 'type': 'String'}, - {'field' : 'Zip', 'type': 'String', 'has missing':True}, - {'field' : 'Phone', 'type': 'String', 'has missing':True} - ] + [ + {'field': 'Site name', 'type': 'String'}, + {'field': 'Address', 'type': 'String'}, + {'field': 'Zip', 'type': 'String', 'has missing': True}, + {'field': 'Phone', 'type': 'String', 'has missing': True} + ] String Types ^^^^^^^^^^^^ -A 'String' type variable must declare the name of the record field to -compare a 'String' type declaration ex. -``{'field' : 'Address', type:'String'}`` The string type expects fields to be of class string. +A ``String`` type field must declare the name of the record field to compare +a ``String`` type declaration. The ``String`` type expects fields to be of +class string. -String types are compared using `affine gap string +``String`` types are compared using `affine gap string distance `__. +For example:- + +.. code:: python + + {'field': 'Address', type: 'String'} + ShortString Types ^^^^^^^^^^^^^^^^^ -Short strings are just like String types except that dedupe will not -try to learn a canopy blocking rule for these fields, which can speed -up the training phase considerably. Zip codes and city names are good -candidates for this type. If in doubt, just use 'String.' +A ``ShortString`` type field is just like ``String`` types except that dedupe +will not try to learn a canopy blocking rule for these fields, which can +speed up the training phase considerably. + +Zip codes and city names are good candidates for this type. If in doubt, +always use ``String``. + +For example:- .. code:: python @@ -46,42 +55,45 @@ candidates for this type. If in doubt, just use 'String.' Text Types ^^^^^^^^^^ -If you want to compare fields comparing long blocks of text, like -product descriptions or article abstracts you should use this -type. Text types fields are compared using the `cosine similarity -metric `__. +If you want to compare fields containing long blocks of text e.g. product +descriptions or article abstracts, you should use this type. ``Text`` type +fields are compared using the `cosine similarity metric +`__. -Basically, this is a measurement of the amount of words that two -documents have in common. This measure can be made more useful the -overlap of rare words counts more than the overlap of common words. If -provide a sequence of example fields than (a corpus), dedupe will -learn these weights for you. +This is a measurement of the amount of words that two documents have in +common. This measure can be made more useful as the overlap of rare words +counts more than the overlap of common words. + +If provided a sequence of example fields (i.e. a corpus) then dedupe will +learn these weights for you. For example:- .. code:: python - {'field': 'Product description', 'type' : 'Text', - 'corpus' : ['this product is great', - 'this product is great and blue']} - } + { + 'field': 'Product description', + 'type': 'Text', + 'corpus' : [ + 'this product is great', + 'this product is great and blue' + ] + } -If you don't want to adjust the measure to your data, just leave 'corpus' -out of the variable definition. +If you don't want to adjust the measure to your data, just leave 'corpus' out +of the variable definition entirely. .. code:: python - {'field' : 'Product description', 'type' : 'Text'} - + {'field': 'Product description', 'type': 'Text'} Custom Types ^^^^^^^^^^^^ -A 'Custom' type field must have specify the field it wants to compare, -a 'type' declaration of 'Custom', and a 'comparator' declaration. The -comparator must be a function that can take in two field values and -return a number. +A ``Custom`` type field must have specify the field it wants to compare, a +type declaration of ``Custom``, and a comparator declaration. The comparator +must be a function that can take in two field values and return a number. -Example custom comparator: +For example, a custom comparator: .. code:: python @@ -92,58 +104,65 @@ Example custom comparator: else: return 1 -variable definition: +The corresponding variable definition: .. code:: python - {'field' : 'Zip', 'type': 'Custom', - 'comparator' : sameOrNotComparator} + { + 'field': 'Zip', + 'type': 'Custom', + 'comparator': sameOrNotComparator + } LatLong ^^^^^^^ -A 'LatLong' type field must have as the name of a field and a 'type' -declaration of custom. LatLong fields are compared using the -`Haversine Formula `__. -A 'LatLong' type field must consist of tuples of floats corresponding -to a latitude and a longitude. +A ``LatLong`` type field must have as the name of a field and a type +declaration of ``LatLong``. ``LatLong`` fields are compared using the `Haversine +Formula `__. + +A ``LatLong`` +type field must consist of tuples of floats corresponding to a latitude and a +longitude. .. code:: python - {'field' : 'Location', 'type': 'LatLong'}} + {'field': 'Location', 'type': 'LatLong'} Set ^^^ -A 'Set' type field is for comparing lists of elements, like keywords -or client names. Set types are very similar to -:ref:`text-types-label`. They use the same comparison function and -you can also let dedupe learn which terms are common or rare by -providing a corpus. Within a record, a Set types field have to be -hashable sequences like tuples or frozensets. +A ``Set`` type field is for comparing lists of elements, like keywords or +client names. ``Set`` types are very similar to :ref:`text-types-label`. They +use the same comparison function and you can also let dedupe learn which +terms are common or rare by providing a corpus. Within a record, a ``Set`` +type field has to be hashable sequences like tuples or frozensets. .. code:: python - {'field' : 'Co-authors', 'type': 'Set', - 'corpus' : [('steve edwards'), - ('steve edwards', 'steve jobs')]} + { + 'field': 'Co-authors', + 'type': 'Set', + 'corpus' : [ + ('steve edwards'), + ('steve edwards', 'steve jobs') + ] } or .. code:: python - {'field' : 'Co-authors', 'type': 'Set'} - } + {'field': 'Co-authors', 'type': 'Set'} Interaction ^^^^^^^^^^^ -An interaction field multiplies the values of the multiple variables. -An interaction variable is created with 'type' declaration of -'Interaction' and an 'interaction variables' declaration. +An ``Interaction`` field multiplies the values of the multiple variables. +An ``Interaction`` variable is created with type declaration of +``Interaction`` and an ``interaction variables`` declaration. -The 'interaction variables' must be a sequence of 'variable names' of +The ``interaction variables`` field must be a sequence of variable names of other fields you have defined in your variable definition. `Interactions `__ @@ -151,47 +170,46 @@ are good when the effect of two predictors is not simply additive. .. code:: python - [{'field': 'Name', 'variable name': 'name', 'type': 'String'}, - {'field': 'Zip', 'variable name': 'zip', 'type': 'Custom', - 'comparator' : sameOrNotComparator}, - {'type': 'Interaction', - 'interaction variables': ['name', 'zip']}] + [ + { 'field': 'Name', 'variable name': 'name', 'type': 'String' }, + { 'field': 'Zip', 'variable name': 'zip', 'type': 'Custom', + 'comparator' : sameOrNotComparator }, + {'type': 'Interaction', 'interaction variables': ['name', 'zip']} + ] Exact ^^^^^ -'Exact' variables measure whether two fields are exactly the same or not. +``Exact`` variables measure whether two fields are exactly the same or not. .. code:: python - {'field' : 'city', 'type': 'Exact'}} + {'field': 'city', 'type': 'Exact'} Exists ^^^^^^ -'Exists' variables measure whether both, one, or neither of the fields -are defined. This can be useful if the presence or absence of a field tells -you something about meaningful about the record. +``Exists`` variables measure whether both, one, or neither of the fields are +defined. This can be useful if the presence or absence of a field tells you +something meaningful about the record. .. code:: python - {'field' : 'first_name', 'type': 'Exists'} + {'field': 'first_name', 'type': 'Exists'} Categorical ^^^^^^^^^^^ -Categorical variables are useful when you are dealing with qualitatively -different types of things. For example, you may have data on businesses -and you find that taxi cab businesses tend to have very similar names -but law firms don't. Categorical variables would let you indicate -whether two records are both taxi companies, both law firms, or one of -each. +``Categorical`` variables are useful when you are dealing with qualitatively +different types of things. For example, you may have data on businesses and +you find that taxi cab businesses tend to have very similar names but law +firms don't. ``Categorical`` variables would let you indicate whether two records +are both taxi companies, both law firms, or one of each. -Dedupe would represent these three possibilities using two dummy -variables: +Dedupe would represent these three possibilities using two dummy variables: :: @@ -202,7 +220,7 @@ variables: A categorical field declaration must include a list of all the different strings that you want to treat as different categories. -So if you data looks like this +So if you data looks like this:- :: @@ -211,53 +229,62 @@ So if you data looks like this AA1 Taxi taxi Hindelbert Esq lawyer -You would create a definition like: +You would create a definition such as: .. code:: python - {'field' : 'Business Type', 'type': 'Categorical', - 'categories' : ['taxi', 'lawyer']} + { + 'field': 'Business Type', + 'type': 'Categorical', + 'categories' : ['taxi', 'lawyer'] + } Price ^^^^^ -Price variables are useful for comparing positive, nonzero numbers -like prices. The values of 'Price' field must be a positive float. If -the value is 0 or negative, then an exception will be raised. +``Price`` variables are useful for comparing positive, non-zero numbers like +prices. The values of ``Price`` field must be a positive float. If the value is +0 or negative, then an exception will be raised. .. code:: python - {'field' : 'cost', 'type': 'Price'} + {'field': 'cost', 'type': 'Price'} DateTime ^^^^^^^^ -DateTime variables are useful for comparing dates and timestamps. This variable -can accept strings or Python datetime objects as inputs. +``DateTime`` variables are useful for comparing dates and timestamps. This +variable can accept strings or Python datetime objects as inputs. -The DateTime variable definition accepts a few optional arguments that can help -improve behavior if you know your field follows an unusual format: +The ``DateTime`` variable definition accepts a few optional arguments that +can help improve behavior if you know your field follows an unusual format: -* :code:`fuzzy` - Use fuzzy parsing to automatically extract dates from strings like "It happened on June 2nd, 2017" (default :code:`True`) +* :code:`fuzzy` - Use fuzzy parsing to automatically extract dates from strings like "It happened on June 2nd, 2018" (default :code:`True`) * :code:`dayfirst` - Ambiguous dates should be parsed as dd/mm/yy (default :code:`False`) * :code:`yearfirst`- Ambiguous dates should be parsed as yy/mm/dd (default :code:`False`) -Note that the DateTime variable defaults to mm/dd/yy for ambiguous dates. -If both :code:`dayfirst` and :code:`yearfirst` are set to :code:`True`, then :code:`dayfirst` will take -precedence. +Note that the ``DateTime`` variable defaults to mm/dd/yy for ambiguous dates. +If both :code:`dayfirst` and :code:`yearfirst` are set to :code:`True`, then +:code:`dayfirst` will take precedence. -Sample DateTime variable definition, using the defaults: +For example, a sample ``DateTime`` variable definition, using the defaults: .. code:: python - {'field' : 'time_of_sale', 'type': 'DateTime', - 'fuzzy': True, 'dayfirst': False, 'yearfirst': False} + { + 'field': 'time_of_sale', + 'type': 'DateTime', + 'fuzzy': True, + 'dayfirst': False, + 'yearfirst': False + } -If you're happy with the defaults, you can simply define the :code:`field` and :code:`type`: +If you're happy with the defaults, you can simply define the :code:`field` +and :code:`type`: .. code:: python - {'field' : 'time_of_sale', 'type': 'DateTime'} + {'field': 'time_of_sale', 'type': 'DateTime'} Optional Variables @@ -266,66 +293,84 @@ Optional Variables Address Type ^^^^^^^^^^^^ -An 'Address' variable should be used for United States addresses. It -uses the `usaddress `__ -package to split apart an address string into components like address -number, street name, and street type and compares component to component. +An ``Address`` variable should be used for United States addresses. It uses +the `usaddress `__ package to +split apart an address string into components like address number, street +name, and street type and compares component to component. + +For example:- .. code:: python - {'field' : 'address', 'type' : 'Address'} + {'field': 'address', 'type': 'Address'} -Install the `dedupe-variable-address `__ package for Address Type. +Install the `dedupe-variable-address +`__ package for +``Address`` Type. Name Type ^^^^^^^^^ -A 'Name' variable should be used for a field that contains American -names, corporations and households. It uses the `probablepeople -`__ package to split -apart an name string into components like give name, surname, -generational suffix, for people names, and abbreviation, company type, -and legal form for corporations. +A ``Name`` variable should be used for a field that contains American names, +corporations and households. It uses the `probablepeople +`__ package to split apart +an name string into components like give name, surname, generational suffix, +for people names, and abbreviation, company type, and legal form for +corporations. + +For example:- .. code:: python - {'field' : 'name', 'type' : 'Name'} + {'field': 'name', 'type': 'Name'} -Install the `dedupe-variable-name `__ package for Name Type. +Install the `dedupe-variable-name +`__ package for ``Name`` +Type. Fuzzy Category ^^^^^^^^^^^^^^ -A 'FuzzyCategorical' variable should be used for when you for -categorical data that has variations. Occupations are example, where -the you may have Attorney, Counsel, and Lawyer. For this variable -type, you need to supply a corpus of records that contain your focal -record and other field types. This corpus should either be all the -data you are trying to link or a representative sample. +A ``FuzzyCategorical`` variable should be used for when you for +categorical data that has variations. -.. code:: python +Occupations are an example, where the you may have 'Attorney', 'Counsel', and +'Lawyer'. For this variable type, you need to supply a corpus of records that +contain your focal record and other field types. This corpus should either be +all the data you are trying to link or a representative sample. - {'field' : 'occupation', 'type' : 'FuzzyCategorical', - 'corpus' : [{'name' : 'Jim Doe', 'occupation' : 'Attorney'}, - {'name' : 'Jim Doe', 'occupation' : 'Lawyer'}]} +For example:- +.. code:: python -Install the `dedupe-variable-fuzzycategory `__ package for the FuzzyCategorical Type. + { + 'field': 'occupation', + 'type': 'FuzzyCategorical', + 'corpus' : [ + {'name' : 'Jim Doe', 'occupation' : 'Attorney'}, + {'name' : 'Jim Doe', 'occupation' : 'Lawyer'} + ] + } +Install the `dedupe-variable-fuzzycategory +`__ package for +the ``FuzzyCategorical`` Type. Missing Data ------------ If the value of field is missing, that missing value should be represented as -a ``None`` +a ``None`` object. .. code:: python - data = [{'Name' : 'AA Taxi', 'Phone' : '773.555.1124'}, - {'Name' : 'AA Taxi', 'Phone' : None}, - {'Name' : None, 'Phone' : '773-555-1123'}] + [ + {'Name': 'AA Taxi', 'Phone': '773.555.1124'}, + {'Name': 'AA Taxi', 'Phone': None}, + {'Name': None, 'Phone': '773-555-1123'} + ] If you want to model this missing data for a field, you can set ``'has missing' : True`` in the variable definition. This creates a new, @@ -338,17 +383,15 @@ no field will be created to account for missing data. This approach is called 'response augmented data' and is described in Benjamin Marlin's thesis `"Missing Data Problems in Machine Learning" -`__. Basically, -this approach says that, even without looking at the value of the -field comparisons, the pattern of observed and missing responses will -affect the probability that a pair of records are a match. +`__. +Basically, this approach says that, even without looking at the value of the +field comparisons, the pattern of observed and missing responses will affect +the probability that a pair of records are a match. This approach makes a few assumptions that are usually not completely true: -- Whether a field is missing data is not associated with any other - field missing data -- That the weighting of the observed differences in field A should be - the same regardless of whether field B is missing. +- Whether a field is missing data is not associated with any other field missing data. +- That the weighting of the observed differences in field A should be the same regardless of whether field B is missing. If you define an an interaction with a field that you declared to have @@ -359,25 +402,28 @@ Longer example of a variable definition: .. code:: python - variables = [{'field' : 'name', 'variable name' : 'name', 'type' : 'String'}, - {'field' : 'address', 'type' : 'String'}, - {'field' : 'city', variable name' : 'city', 'type' : 'String'}, - {'field' : 'zip', 'type' : 'Custom', 'comparator' : sameOrNotComparator}, - {field' : 'cuisine', 'type' : 'String', 'has missing': True} - {'type' : 'Interaction', 'interaction variables' : ['name', 'city']} - ] + [ + {'field': 'name', 'variable name' : 'name', 'type': 'String'}, + {'field': 'address', 'type': 'String'}, + {'field': 'city', 'variable name' : 'city', 'type': 'String'}, + {'field': 'zip', 'type': 'Custom', 'comparator' : sameOrNotComparator}, + {'field': 'cuisine', 'type': 'String', 'has missing': True} + {'type': 'Interaction', 'interaction variables' : ['name', 'city']} + ] Multiple Variables comparing same field --------------------------------------- It is possible to define multiple variables that all compare the same variable. -For example +For example:- .. code:: python - variables = [{'field' : 'name', 'type' : 'String'}, - {'field' : 'name', 'type' : 'Text'}] + [ + {'field': 'name', 'type': 'String'}, + {'field': 'name', 'type': 'Text'} + ] Will create two variables that both compare the 'name' field but @@ -387,13 +433,11 @@ in different ways. Optional Edit Distance ---------------------- -For String, ShortString, Address, and Name fields, you can choose to -use the a conditional random field distance measure for strings. This -measure can give you more accurate results but is much slower than the +For ``String``, ``ShortString``, ``Address``, and ``Name`` fields, you can +choose to use the a conditional random field distance measure for strings. +This measure can give you more accurate results but is much slower than the default edit distance. .. code:: python - {'field' : 'name', 'type' : 'String', 'crf' : True} - - + {'field': 'name', 'type': 'String', 'crf': True}