Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CARBONDATA-276]add trim property for cols #200

Closed
wants to merge 1 commit into from
Closed

[CARBONDATA-276]add trim property for cols #200

wants to merge 1 commit into from

Conversation

lion-x
Copy link
Contributor

@lion-x lion-x commented Sep 27, 2016

Why raise this PR?

Add trim property for cols.
trim a col or not can be set during create table time, like,

CREATE TABLE IF NOT EXISTS t3 (ID Int, date Timestamp, country String, name String, phonetype String, serialname String, salary Int)
STORED BY 'carbondata'
TBLPROPERTIES("trim"="country")

In order to delete some meaningless white space in some fields, we do this PR. It will also improve the index finding performance with shorter string.

How to test?

Pass all the test cases.

@lion-x lion-x changed the title addtrimoption add trim option Sep 27, 2016
@lion-x lion-x changed the title add trim option [CARBONDATA-276]add trim option Sep 27, 2016
@lion-x
Copy link
Contributor Author

lion-x commented Sep 27, 2016

@@ -465,6 +466,7 @@ private static void generateGraph(IDataProcessStatus schmaModel, SchemaInfo info
model.setEscapeCharacter(schmaModel.getEscapeCharacter());
model.setQuoteCharacter(schmaModel.getQuoteCharacter());
model.setCommentCharacter(schmaModel.getCommentCharacter());
model.setTrim(schmaModel.getTrim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to set this in schemamodel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear about the usecase, can you make it more clear by providing more details

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some cases, users want to trim some string field in csv, he set in Load DML option, and we need to handle to in csvinput step. So we the trim should be transformed to CSVinput step.

For Example, in some usecases, the LeadingWhiteSpace in a String, like " fsdfsd" maybe meaningful. Users want to keep the LeadingWhiteSpace. So we add this option for users, let them choose.

@QiangCai
Copy link
Contributor

I think the dictionary generation is the same with the CSVInput, not ignore any white space.
It is unneccessary to trim data.

@lion-x
Copy link
Contributor Author

lion-x commented Oct 11, 2016

Trimming data feature is used to support some users requirements. There is no causality between the dictionary generation is the same with the CSVInput and triming data or not.

@@ -102,8 +102,8 @@ public void initialize() throws IOException {
parserSettings.setMaxColumns(
getMaxColumnsForParsing(csvParserVo.getNumberOfColumns(), csvParserVo.getMaxColumns()));
parserSettings.setNullValue("");
parserSettings.setIgnoreLeadingWhitespaces(false);
parserSettings.setIgnoreTrailingWhitespaces(false);
parserSettings.setIgnoreLeadingWhitespaces(csvParserVo.getTrim());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same you need to handle in CarbonFilters.scala also, since while processing filter expressions the spaces are not getting trimmed, so here also you need to take care based on your feature

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi sujith
i am thinking if user already trim the data with the option setting,then when user query with some space filter,it would no getting result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi sujith,
I agree with Eason, when user query with some space filter, it should be considered as forbidden action.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guys i think if while data loading we are reading from configuration inorder to trim it or not same we need to do while doing filter also, based on configuration value decide.
ex: while loading i enabled trim property, so system will trim and load the data, now in filter query also if user provides while space then it needs to be trimmed while creating filter model. This will provide more system consistentency. if user enable trim then we wont trim it while loading and also while querying.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also one more point it will be better to set this property in column level while creating the table itself as its column properties , this will avoid user to provide this option every time while data loading

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pros of this approach will be suppose in one load user loaded with dirty data and suddenly he realizes no i need to trim then in the next load he will enable the option and load the data, this will increase the dictionary space also, also in query dictionary lookup overhead will increase.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So better to set this while creating the table as column properties metadata

import org.apache.spark.sql.Row

/**
* Created by x00381807 on 2016/9/26.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my fault

@lion-x lion-x changed the title [CARBONDATA-276]add trim option [CARBONDATA-276]add trim property for cols Oct 27, 2016
@lion-x
Copy link
Contributor Author

lion-x commented Oct 27, 2016

@sujith71955 @jackylk pls review again. The code has been updated. The trim property for cols is set in Create.

@jackylk
Copy link
Contributor

jackylk commented Oct 27, 2016

Can you provide more information about this PR, like what kind of query and why is it a bug?

@lion-x
Copy link
Contributor Author

lion-x commented Oct 28, 2016

@jackylk there is no bug, I update my description.

GraphConfigurationInfo graphConfig) {
List<Boolean> isUseTrimList = new ArrayList<Boolean>();
for (CarbonDimension dimension : dims) {
if (dimension.isUseTrim()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add this trim option as property of the column, i mean inside a column property rather than setting directly as CarbonDimension property, Already in CarbonColumn there is a property map which holds column related properties, i think we can use that, please check the feasibility.

public void setIsUseTrim(Boolean[] isUseTrim) {
for (Boolean flag: isUseTrim) {
if (flag) {
this.isUseTrim += "T";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use TRUE/FALSE for better readability

@@ -472,6 +475,7 @@ public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws K
break;
}
}
<<<<<<< HEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this file is having any conflict?

@bill1208
Copy link
Contributor

bill1208 commented Nov 1, 2016

I suggestion the trim should not be the properties of the table.
because in actually scenario the table designer is in front, then data of the this table is generated.
so the designer may don not consider which column need trim that may only can know after data generated.
So I suggestion this trim operation should be consider when data load.
during the data load user can set all column trim or not and also set trim for special column both for datalod command and insert command

@sujith71955
Copy link
Contributor

@Bill, i think while creating table the user shall know the entire schema which he is trying to generate and what kind of data/datatype is required for his table based on the expected data, if we provide the support in data load there are few drawbacks which we discussed in the prior discussion thread, please check.

@sujith71955
Copy link
Contributor

i will repeat the cons again if we set this property while data loading.
cons of this approach will be suppose in one load user loaded with dirty data and suddenly he realizes no i need to trim then in the next load he will enable the option and load the data, this will increase the dictionary space also, also in query dictionary lookup overhead will increase.

@sujith71955
Copy link
Contributor

just a thinking you can add your points :)

@bill1208
Copy link
Contributor

bill1208 commented Nov 2, 2016

@sujith I got your point.
but I think trim data should belong to the step when process the data, which not belong to the table properties.
the scenario what's your des is exist even if specify it when create table.
For example when create the table they donot think it need trim the column1, but when actually data come or some business changed it need trim it. how to process this scenario? recreate the table or alter the table properties? :)

@sujith71955
Copy link
Contributor

@Bill , I agree to your valid points but i am thinking data load is a frequent operation compare to Create, and more over once you create table with the column properties specified, it will remain the same throughout the table life cycle, if we give in load command there is always a scope of inconsistency by keeping few loads with trimmed data and few without. i think user shall able to judge/decide if data should contain white spaces or it shall be trimmed for particular column if he knows his schema well .
And more over if the initial load the column data is having white space and as you told above he thinks to trim it, for that again user will perform data loading, but in this case his column count/distinct count and all will be different compare to the actual and also he has to see more query result with the data he is not interested, now if he delete the particular load also again he needs to reload the particular load data which will be again a cumbersome process. we shall try to make user to make his steps right and keep it simple.
You can put a view on my above points

@lion-x
Copy link
Contributor Author

lion-x commented Nov 7, 2016

@sujith71955 @bill1208
actually, in my opinion, trimming data should belong to data preprocessing. If we must put trim property in carbon, I think it is better to put trim option in loading data part. If user put unprepocessing data into table, it is user's fault.

@ravipesala
Copy link
Contributor

Can one of the admins verify this patch?

@chenliang613
Copy link
Contributor

@lion-x please rebase it.

@lion-x lion-x closed this Dec 12, 2016
@lion-x lion-x deleted the addTrimOption branch December 13, 2016 06:59
Beyyes pushed a commit to Beyyes/carbondata that referenced this pull request Jul 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants