Discussion:
File Gotchas
(too old to reply)
Roedy Green
2013-03-17 03:16:03 UTC
Permalink
/*
* [TestFileCombine.java]
*
* Summary: combining two filenames with java.io.File
*
* Copyright: (c) 2013 Roedy Green, Canadian Mind Products,
http://mindprod.com
*
* Licence: This software may be copied and used freely for any
purpose but military.
* http://mindprod.com/contact/nonmil.html
*
* Requires: JDK 1.7+
*
* Created with: JetBrains IntelliJ IDEA IDE
http://www.jetbrains.com/idea/
*
* Version History:
* 1.0 2013-03-16 initial version
*/
package com.mindprod.example;

import com.mindprod.common11.Misc;

import java.io.File;
import java.io.IOException;

import static java.lang.System.out;

/**
* combining two filenames with java.io.File
*
* @author Roedy Green, Canadian Mind Products
* @version 1.0 2013-03-16 initial version
* @since 2013-03-16
*/
public final class TestFileCombine
{

/**
* Experiment with various ways of combining file names
*
* @param args not used
*
* @throws java.io.IOException on I/O failure.
*/
public static void main( String[] args ) throws IOException
{

// file is not suitable for resolving relative or absolute
offsets from a base filename.
File root = new File( "E:/mindprod" );
File o1 = new File( root, "index.html" );
out.println( Misc.getCanOrAbsPath( o1 ) );
// prints: E:/mindprod/index.html (actually with backslashes)

File o2 = new File( root, "/index.html" );
out.println( Misc.getCanOrAbsPath( o2 ) );
// prints: E:/mindprod/index.html

File base = new File( "E:/mindprod/jgloss/encoding" );
File o3 = new File( base, "pad.html" );
out.println( Misc.getCanOrAbsPath( o3 ) );
// prints: E:\mindprod\jgloss\encoding\pad.html

File o4 = new File( base, "../pad.html" );
out.println( Misc.getCanOrAbsPath( o4 ) );
// prints: E:\mindprod\jgloss\pad.html

File o5 = new File( base, "/jgloss/pad.html" );
out.println( Misc.getCanOrAbsPath( o5 ) );
// prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
// You might have naively hoped for:
E:/mindprod/jgloss/pad.html
// However, File has no idea that / on your website refers to
E:/mindprod.

File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
);
File o6 = new File( base2, "pad.html" );
out.println( Misc.getCanOrAbsPath( o6 ) );
// prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
(ouch)
// You might have hoped for:
E:\mindprod\jgloss\encoding\pad.html

File o7 = new File( base2, "../pad.html" );
out.println( Misc.getCanOrAbsPath( o7 ) );
// prints: E:\mindprod\jgloss\encoding\pad.html (ouch)
// You might have hoped for: E:\mindprod\jgloss\pad.html

File o8 = new File( base2, "/jgloss/pad.html" );
out.println( Misc.getCanOrAbsPath( o8 ) );
// prints:
E:\mindprod\jgloss\encoding\utf8.html\jgloss\pad.html (ouch)
// You might have naively hoped for:
E:/mindprod/jgloss/pad.html
// However, File has no idea that / on your website refers to
E:/mindprod.
}

}
--
Roedy Green Canadian Mind Products http://mindprod.com
The computer programmer is a creator of universes for which he alone
is the lawgiver. No playwright, no stage director, no emperor, however
powerful, has ever exercised such absolute authority to arrange a stage
or a field of battle and to command such unswervingly dutiful actors or
troops.
~ Joseph Weizenbaum (born: 1923-01-08 died: 2008-03-05 at age: 85)
Joerg Meier
2013-03-17 12:56:20 UTC
Permalink
Post by Roedy Green
File o5 = new File( base, "/jgloss/pad.html" );
out.println( Misc.getCanOrAbsPath( o5 ) );
// prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
E:/mindprod/jgloss/pad.html
// However, File has no idea that / on your website refers to
E:/mindprod.
What website ? Now websites are involved ? Not really sure whats going on
here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
E:/ or E:/mindprod/jgloss ?

Leaving out the part about a website that I don't understand, why would you
assume that Java randomly would pick the parts of the filename you were
thinking of ? I can see no indication why it would be that specific part
other than "If I wish really hard, maybe it will come true". At most, I
would have expected that a leading / would be interpreted as the drives
root, as it works under Linux.
Post by Roedy Green
File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
);
File o6 = new File( base2, "pad.html" );
out.println( Misc.getCanOrAbsPath( o6 ) );
// prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
(ouch)
E:\mindprod\jgloss\encoding\pad.html
That would be a defect that I would immediately file a bug report for. It
would mean that it would be impossible to access folders/directories that
have a period in their name. Why you would hope that those would randomly
be cut off for no reason is beyond me.
Post by Roedy Green
File o7 = new File( base2, "../pad.html" );
out.println( Misc.getCanOrAbsPath( o7 ) );
// prints: E:\mindprod\jgloss\encoding\pad.html (ouch)
// You might have hoped for: E:\mindprod\jgloss\pad.html
Again: a behaviour like that would mean a bug in regards to directories
with a period in their name. Not sure why that would be desirable.
Post by Roedy Green
File o8 = new File( base2, "/jgloss/pad.html" );
out.println( Misc.getCanOrAbsPath( o8 ) );
E:\mindprod\jgloss\encoding\utf8.html\jgloss\pad.html (ouch)
E:/mindprod/jgloss/pad.html
// However, File has no idea that / on your website refers to
E:/mindprod.
Same response as above: what website ? Why would / refer to that particular
piece of the path ?

Liebe Gruesse,
Joerg
--
Ich lese meine Emails nicht, replies to Email bleiben also leider
ungelesen.
Roedy Green
2013-03-17 17:20:18 UTC
Permalink
Post by Joerg Meier
What website ? Now websites are involved ? Not really sure whats going on
here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
E:/ or E:/mindprod/jgloss ?
That is the point. Your local file system has no idea that
E:/mindprod represents the root of your local mirror of a website, and
neither do your browsers. If they did, you could have links in the
local mirror of the form href="/jgloss/jgloss.html" to refer to
E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
website mirror. You must use relative addresses, e.g.
href="../jgloss/jgloss.html". My examples mainly come up when you try
navigating the local files of a website mirror with the file system.

For a remote website, the browser does know the root. I have not
experimented to see if /-type links work there.
--
Roedy Green Canadian Mind Products http://mindprod.com
The computer programmer is a creator of universes for which he alone
is the lawgiver. No playwright, no stage director, no emperor, however
powerful, has ever exercised such absolute authority to arrange a stage
or a field of battle and to command such unswervingly dutiful actors or
troops.
~ Joseph Weizenbaum (born: 1923-01-08 died: 2008-03-05 at age: 85)
markspace
2013-03-17 18:17:54 UTC
Permalink
Post by Roedy Green
Post by Joerg Meier
What website ? Now websites are involved ? Not really sure whats going on
here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
E:/ or E:/mindprod/jgloss ?
That is the point. Your local file system has no idea that
E:/mindprod represents the root of your local mirror of a website, and
neither do your browsers. If they did, you could have links in the
local mirror of the form href="/jgloss/jgloss.html" to refer to
E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
website mirror. You must use relative addresses, e.g.
href="../jgloss/jgloss.html". My examples mainly come up when you try
navigating the local files of a website mirror with the file system.
For a remote website, the browser does know the root. I have not
experimented to see if /-type links work there.
I was going to sort of defend you but now you're just being silly.
Check out the documentation for the wget unix utility. There's some
hints there.

What I think you are missing is:

1. You have to maintain you're own root if you're
parsing/browsing/scraping a website. You have to remember that you
fetched a document from http:www.mindprod.com/stuffs, for example, and
all your paths are relative to that. I haven't actually looked at HTML
semantics in a while, so you might have to also remove the path from
that root and just use the protocol + host part. The URL class in Java
does this for you.

2. Once you have the root, you have to look at the start of the path
from the HTML document and determine if you just append, or if you have
to use the just the hostname, based on the leading characters of the
path ("." or "/"). I'm quite certain the HTML RFCs spell this out
explicitly. Expecting the Java File class to implement these special
semantics for you is just isn't going to work. It's "naive," or
something, alright.
Steven Simpson
2013-03-17 19:38:06 UTC
Permalink
Post by Roedy Green
Your local file system has no idea that
E:/mindprod represents the root of your local mirror of a website, and
neither do your browsers. If they did, you could have links in the
local mirror of the form href="/jgloss/jgloss.html" to refer to
E:\mindprod\jgloss\jgloss.html where E:\mindprod is the root of the
website mirror. You must use relative addresses, e.g.
href="../jgloss/jgloss.html". My examples mainly come up when you try
navigating the local files of a website mirror with the file system.
For a remote website, the browser does know the root. I have not
experimented to see if /-type links work there.
I gather you're trying to write some off-line site-checking program,
where you have a local copy of your site, which you FTP to the server,
and the program needs to interpret links (among other things).

java.io.File does not capture distinctions between files and
directories, but java.net.URI does distinguish between URIs with and
without terminating slashes. I suggest you do as much work as possible
with URIs - identify each document you're handling by its URI; parse
href values as URIs and resolve against the document's - and only
convert to File when you need to access the disc. Here's a barely
tested class that might help with that:

import java.net.URI;
import java.io.File;

/**
* Maps URIs within a site to local files.
*/
class FileMapping {
final URI site;
final URI copy;
final String index;

/**
* Create a file mapping.
*
* @param site the base URI of the site; anything after the last
* slash is ignored
*
* @param copy the directory of the local copy of the site
*
* @param index the default filename to use to map directory-like
* URIs
*/
public FileMapping(String site, String copy, String index) {
this(URI.create(site), new File(copy), index);
}

/**
* Create a file mapping using a default leafname.
*
* @param site the base URI of the site; anything after the last
* slash is ignored
*
* @param copy the directory of the local copy of the site
*/
public FileMapping(String site, String copy) {
this(URI.create(site), new File(copy));
}

/**
* Create a file mapping using a default leafname.
*
* @param site the base URI of the site; anything after the last
* slash is ignored
*
* @param copy the directory of the local copy of the site
*/
public FileMapping(URI site, File copy) {
this(site, copy, "index.html");
}

/**
* Create a file mapping.
*
* @param site the base URI of the site; anything after the last
* slash is ignored
*
* @param copy the directory of the local copy of the site
*
* @param index the default filename to use to map directory-like
* URIs
*/
public FileMapping(URI site, File copy, String index) {
/* We must have a slash-terminated base URI for relativize to
* work. */
this.site = site.resolve("./");

/* We must add a dummy element so that we can ensure a
* trailing slash. */
this.copy = new File(copy, "dummy").toURI().resolve("./");

this.index = index;
}

/**
* Map the URI to a file.
*
* @param addr the URI to be mapped
*
* @return the file that the URI maps to, or null if it is
* external
*/
public File map(URI addr) {
URI rel = site.relativize(addr);
if (rel.isAbsolute()) return null;
if (rel.resolve("./").equals(rel))
rel = rel.resolve(index);
rel = copy.resolve(rel);
return new File(rel);
}

private static void test(FileMapping mapping, String addrText) {
URI addr = URI.create(addrText);
File file = mapping.map(addr);
System.out.printf("%s -> %s%n", addr, file);
}

public static void main(String[] args) throws Exception {
FileMapping mapping =
new FileMapping("http://mindprod.com/", "/var/site");
test(mapping, "http://www.example.com/");
test(mapping, "http://mindprod.com/jgloss/pad.html");
test(mapping, "http://mindprod.com/jgloss/encoding/pad.html");
}
}
--
ss at comp dot lancs dot ac dot uk
Lew
2013-03-17 18:58:54 UTC
Permalink
Post by Joerg Meier
Post by Roedy Green
File o5 = new File( base, "/jgloss/pad.html" );
out.println( Misc.getCanOrAbsPath( o5 ) );
// prints:E:\mindprod\jgloss\encoding\jgloss\pad.html (ouch)
E:/mindprod/jgloss/pad.html
// However, File has no idea that / on your website refers to
E:/mindprod.
'File' is meant to assist with file-system navigation, not web navigation.

It is not an abstraction of a file system, either. It is "[a]n abstract representation of
file and directory pathnames."

It only models the names. From that point of view, all the behavior you observed
is consistent with expectation.
Post by Joerg Meier
What website ? Now websites are involved ? Not really sure whats going on
here. Why would I hope that / randomly refers to E:/mindprod ? Why not to
E:/ or E:/mindprod/jgloss ?
In the case of 'File', you are not even promised that it refers to "/".

You are promised that it represents the pathname "/", the resource for which is
out of its scope.
Post by Joerg Meier
Leaving out the part about a website that I don't understand, why would you
assume that Java randomly would pick the parts of the filename you were
thinking of ? I can see no indication why it would be that specific part
other than "If I wish really hard, maybe it will come true". At most, I
would have expected that a leading / would be interpreted as the drives
root, as it works under Linux.
Which is actually more than it does. All it represents is the pathname "/".

To put it another way, 'File' is not not responsible for how the pathname is
interpreted.

If that is the drive root, that's up to the OS service to which 'File' passes
the pathname.
Post by Joerg Meier
Post by Roedy Green
File base2 = new File( "E:/mindprod/jgloss/encoding/utf8.html"
);
File o6 = new File( base2, "pad.html" );
out.println( Misc.getCanOrAbsPath( o6 ) );
// prints: E:\mindprod\jgloss\encoding\utf8.html\pad.html
(ouch)
new File( base2, "pad.html" );
E:/mindprod/jgloss/encoding/utf8.html/pad.html
E:\mindprod\jgloss\encoding\utf8.html\pad.html

Why "ouch"?
Post by Joerg Meier
Post by Roedy Green
E:\mindprod\jgloss\encoding\pad.html
That would violate the documented behavior of the constructor:
"Creates a new File instance from a parent pathname string and a child pathname string."
Post by Joerg Meier
That would be a defect that I would immediately file a bug report for. It
would mean that it would be impossible to access folders/directories that
It is not the job of 'File' to access any resource. Its job is only to manage pathnames
and the interaction of those pathnames with host services.
Post by Joerg Meier
have a period in their name. Why you would hope that those would randomly
be cut off for no reason is beyond me.
And it would violate the contract.
Post by Joerg Meier
... [snip] ...
Same response as above: what website ? Why would / refer to that particular
piece of the path ?
In point of fact, the shortcut of thinking that "/" refers to anything is a mismatch
to what 'File' actually does. 'File' manages the name and its communication to the OS.

The OS decides what it matches.

With that in mind, the logic of 'File''s documented behavior and Joerg's incredulity
that expectations would diverge therefrom are perfectly explicable.
--
Lew
Loading...