-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CARBONDATA-276]add trim property for cols #200
Conversation
@@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info | |||
model.setEscapeCharacter(schmaModel.getEscapeCharacter()); | |||
model.setQuoteCharacter(schmaModel.getQuoteCharacter()); | |||
model.setCommentCharacter(schmaModel.getCommentCharacter()); | |||
model.setTrim(schmaModel.getTrim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why we need to set this in schemamodel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear about the usecase, can you make it more clear by providing more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases, users want to trim some string field in csv, he set in Load DML option, and we need to handle to in csvinput step. So we the trim should be transformed to CSVinput step.
For Example, in some usecases, the LeadingWhiteSpace in a String, like " fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add this option for users, let them choose.
I think the dictionary generation is the same with the CSVInput, not ignore any white space. |
Trimming data feature is used to support some users requirements. There is no causality between the dictionary generation is the same with the CSVInput and triming data or not. |
@@ -102,8 +102,8 @@ public void initialize() throws IOException { | |||
parserSettings.setMaxColumns( | |||
getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns())); | |||
parserSettings.setNullValue(""); | |||
parserSettings.setIgnoreLeadingWhitespaces(false); | |||
parserSettings.setIgnoreTrailingWhitespaces(false); | |||
parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same you need to handle in CarbonFilters.scala also, since while processing filter expressions the spaces are not getting trimmed, so here also you need to take care based on your feature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi sujith
i am thinking if user already trim the data with the option setting,then when user query with some space filter,it would no getting result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi sujith,
I agree with Eason, when user query with some space filter, it should be considered as forbidden action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guys i think if while data loading we are reading from configuration inorder to trim it or not same we need to do while doing filter also, based on configuration value decide.
ex: while loading i enabled trim property, so system will trim and load the data, now in filter query also if user provides while space then it needs to be trimmed while creating filter model. This will provide more system consistentency. if user enable trim then we wont trim it while loading and also while querying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one more point it will be better to set this property in column level while creating the table itself as its column properties , this will avoid user to provide this option every time while data loading
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pros of this approach will be suppose in one load user loaded with dirty data and suddenly he realizes no i need to trim then in the next load he will enable the option and load the data, this will increase the dictionary space also, also in query dictionary lookup overhead will increase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So better to set this while creating the table as column properties metadata
import org.apache.spark.sql.Row | ||
|
||
/** | ||
* Created by x00381807 on 2016/9/26. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, my fault
@sujith71955 @jackylk pls review again. The code has been updated. The trim property for cols is set in Create. |
Can you provide more information about this PR, like what kind of query and why is it a bug? |
@jackylk there is no bug, I update my description. |
GraphConfigurationInfo graphConfig) { | ||
List<Boolean> isUseTrimList = new ArrayList<Boolean>(); | ||
for (CarbonDimension dimension : dims) { | ||
if (dimension.isUseTrim()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add this trim option as property of the column, i mean inside a column property rather than setting directly as CarbonDimension property, Already in CarbonColumn there is a property map which holds column related properties, i think we can use that, please check the feasibility.
public void setIsUseTrim(Boolean[] isUseTrim) { | ||
for (Boolean flag: isUseTrim) { | ||
if (flag) { | ||
this.isUseTrim += "T"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use TRUE/FALSE for better readability
@@ -472,6 +475,7 @@ public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws K | |||
break; | |||
} | |||
} | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this file is having any conflict?
I suggestion the trim should not be the properties of the table. |
@Bill, i think while creating table the user shall know the entire schema which he is trying to generate and what kind of data/datatype is required for his table based on the expected data, if we provide the support in data load there are few drawbacks which we discussed in the prior discussion thread, please check. |
i will repeat the cons again if we set this property while data loading. |
just a thinking you can add your points :) |
@sujith I got your point. |
@Bill , I agree to your valid points but i am thinking data load is a frequent operation compare to Create, and more over once you create table with the column properties specified, it will remain the same throughout the table life cycle, if we give in load command there is always a scope of inconsistency by keeping few loads with trimmed data and few without. i think user shall able to judge/decide if data should contain white spaces or it shall be trimmed for particular column if he knows his schema well . |
@sujith71955 @bill1208 |
Can one of the admins verify this patch? |
@lion-x please rebase it. |
remove exporter label at cleanup.sh
Why raise this PR?
Add trim property for cols.
trim a col or not can be set during create table time, like,
CREATE TABLE IF NOT EXISTS t3 (ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary Int)
STORED BY 'carbondata'
TBLPROPERTIES("trim"="country")
In order to delete some meaningless white space in some fields, we do this PR. It will also improve the index finding performance with shorter string.
How to test?
Pass all the test cases.