Cleaning the Data frame

Question

Cleaning the Data frame

Steve Jones - SSC Editor

SSC Guru

Points: 742706
More actions
August 7, 2019 at 12:00 am

#3664362

Comments posted to this topic are about the item Cleaning the Data frame

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Stewart "Arturius" Campbell SSC Guru Points: 72941 More actions · Answer 1

Nice reminder, thanks Steve

____________________________________________
Space, the final frontier? not any more...
All limits henceforth are self-imposed.
“libera tute vulgaris ex”

Carlo Romagnano SSC-Insane Points: 22929 More actions · Answer 2

I found this in the syntax:

"header is set to TRUE if and only if the first row contains one fewer field than the number of columns."

So, because of the same number of columns and titles the right answer is:

x = read.csv2("Flights.csv",header=FALSE,sep=",",na.strings = "!")

jschmidt 17654 SSCommitted Points: 1747 More actions · Answer 3

The "one fewer field" guidance is weird to me. I've been using read.csv2 on files with the same number of header fields and columns with header=TRUE to read many files successfully. I wonder if their is some implied row number field in a csv or if the guidance isn't clear.

Carlo Romagnano SSC-Insane Points: 22929 More actions · Answer 4

I should try, but I think that if you specify "header=true or false" the first row contains column names (true= less names than columns, false=same number for names and columns.

Steve Jones - SSC Editor SSC Guru Points: 742706 More actions · Answer 5

Steve Jones - SSC Editor

SSC Guru

Points: 742706

August 7, 2019 at 2:57 pm

#3669518

Not sure that works.

2019-08-07 08_56_47-RStudio

George Vobr SSChampion Points: 10279 More actions · Answer 6

The syntax description in the reference states that header is a logical value indicating whether the file contains the names of the variables as its first line. If missing, the value is determined from the file format: header is set to TRUE if and only if the first row contains one fewer field than the number of columns.

If header = FALSE is explicitly specified, the first line is always treated as data values. See both examples given above by Steve, the resulting Data Frame has default column names Values V1, V2, V3, V4 and the first row are data from the original column names of Flights.csv.

Try a simple text import code to easily check the function of the header parameter, for example:

1.
read.csv2(sep = ",", text = "
a,b,c,
1,2,3,4,5
")
Parameter header is missing, but in the first row there are 4 elements a, b, c, for 5 columns.
Default header = TRUE applies.
Result is a data frame with an added header and one row of data:
  a b c X
1 2 3 4 5

2. But you cannot specify:
read.csv2(sep = ",", text = "
a,b,c
1,2,3,4,5
")
Result:
Error: more columns than column names

3. The header = FALSE is explicitly specified:
read.csv2(header = FALSE, sep = ",", text = "
a,b
1,2,3,4,5
")
Result is a data frame with the default column names.
The first row of data is completed with NA.

  V1 V2 V3 V4 V5
1  a  b NA NA NA
2  1  2  3  4  5

4. The header = TRUE is explicitly specified with more columns names than columns...:
read.csv2(header = TRUE, sep = ",", text = "
a,b,c,d,e,f,g,
1,2,3,4,5
")
Result is a data frame the header is completed with X. The first row of data is completed with NA.
  a b c d e  f  g  X
1 1 2 3 4 5 NA NA NA